More assumptions blown apart, and a whole lot of data wrangling I didn’t think we’d have to deal with since we have ‘clean data’.
NOTE: this post was written ‘as is’, while thinking out loud to help keep things authentic. Please excuse and pardon any grammatical and typographical errors, as well as possible incoherence in the structure.
Change in the research question
Having seen that the measurement of ‘cognitive decline’ is not black and white, and given additional complexities of dealing with it, we made a collective decision to narrow our scope again.
Rather than exploring the impact of childhood experience on cognitive decline, we decided to explore the impact on dementia.
While we only have about 220 cases of dementia reported in the entire ELSA dataset, we have several thousand other participants to explore.
Initially, I thought we’d do some sort of treatment / control grouping wherein we match a sample of 220 non-dementia cases to the 220 dementia cases and see if we can find anything interesting.
Of course, that would mean ignoring about 95% of the sample, and we wanted to make as much use of the dataset as possible.
Thus, the matching is unlikely at this point in time.
Anyway, with a modified research question and dataset in place, we spent most all day cleaning the dataset – something I really did not expect to be doing particularly since we’re meant to be dealing with a “clean” dataset!
I’d hate to imagine the format of this dataset prior to DPUK’s processing. What a nightmare!
Data wrangling
Childhood data is available in ‘Wave 3’ of the ELSA dataset, within the ‘life_history’ subset. Most if not all of the childhood data is retrospective. That is, it’s a result of interviewers asking interviewees stuff about their childhood from memory.
Importantly, ELSA is not a dataset of all dementia patients, so the vast majority will be able to remember parts of their childhood.
It’s interesting for me to work with this type of dataset because it’s actually pretty rare to work with this sort of data in Finance, where for the most part, surveys and interviews tend to be frowned upon.
Merging the ‘waves’
We thought we’d need to merge the ‘wave 3’ data with all the other waves. Fortunately, we took a little bit of time exploring the files manually. This allowed us to see that a ‘master’ file already exists with Waves 1 through to 7, excluding the childhood related data.
Merging the ‘wave 3’ data with the master one was of course trivial, and an inner merge meant that we only have the data for which participants provided data about their childhood retrospectively.
Identifying the ‘dementia’ outcome variable
There are a good few variables relating to dementia within ELSA, including ‘r{w}demene’, ‘s{w}demene’, ‘r{w}demens’, ‘r{w}alzhe’, amongst otheres, where w is the wave number, w {1, 2, … , 8}.
First, we naively ended up assuming that all the variables are independent, and this almost led to us significantly overestimating the number of dementia cases.
Fortunately, we read the documentation. And found that `s{w}demene` is a subset of ‘r{w}demene’.
What’s more interesting however, is the consideration of the ‘r{w}demene’ (dementia) variable vs. the ‘r{w}alzhe’ (alzheimer’s) variable.
We have 3 people on the team specialised in dementia (and its related areas), and at one point they quite fiercely debated about how Alzheimer’s is not dementia vs. how it is in fact dementia.
While the debate wasn’t entirely settled, we seem to have somehow agreed that counting Alzheimer’s (‘r{w}alzhe’) as an independent case would lead to overestimating the number of dementia outcomes.
Thus, we ended up creating our own `dementia_indicator` which is equal to 1 if a subject has either Alzheimer’s or Dementia, 1 if a subject has both Alzheimer’s or Dementia, 0 if they have neither, and NaN if it’s anything else (e.g. “don’t know”, “refused to answer”, missing, etc).
Exploratory data analysis
After creating some of our variables (aka “feature extraction”), we started exploring it in more detail via some exploratory data analysis (EDA).
This largely involved plotting stuff to try and gauge relationships between and within variables.
I can’t show any of the plots since it was all created in a secure environment. And the only way to get the plots out is via a “file out request”; something I haven’t run yet but will try later on.
Anyway, we plotted out things like the number and proportion of dementia cases over time. Consistent with expectations, it is upward sloping.
What was interesting – and indeed, unexpected – was that we saw a decline in the average age of people with dementia.
This however, is not likely to be true in that it may just be a case of more people ‘dropping out’ of waves and younger people coming in to the waves, so while it appears like the age is declining, it might not be the case in reality.
Just throw it in the model
At one point we kind of got a little fed up with the sheer volume of variables we’re looking at.
Naturally, we thought – why not just “chuck” it in a random forest or similar and see what we get.
But of course, the problem with that is that it doesn’t quite help us answer our research question.
We’re trying to understand the impact / effect of childhood experiences on dementia. Doing so naturally requires us to look at variables relating to childhood.
And that’s where the struggle is at the moment.
Trying to identify which of the thousands of variables we have available to actually relate to childhood experiences.
Solving it is not quite as simple as running some sort of topic model, or word2vec. This is because the data has varying labels (“yes” / “no”, “good” / “bad” / “excellent”, etc). And there are no specific mentions to childhood within the responses.
We might be able to pull something out by using the data dictionary. For instance, we might be able to use descriptions of the variables to try and identify childhood related variables.
But to be fair, we have someone on the team who specialises in dealing with childhood related studies. Time permitting, it’d of course be useful to use those insights to decide which variables we work with.
Leave a Reply