DPUK Datathon Experience, Day 1 (Series, 2 of 4)

Day 1 of the DPUK Datathon was quite intense, but an incredible learning experience. It involved plenty of assumptions blown apart. As well as seeing data related issues I didn’t think existed. Perhaps most importantly, it was incredible because I was surrounded by some of the brightest minds in the country.

NOTE: please excuse the grammar / punctuation errors / incoherence. I’ve written this post while thinking out loud, soon after Day 1 of the datathon. The first post of this ‘series’ is viewable here.

The first half of the day comprised of several presentations on the current research agenda and data available. We were then split into teams of 6-7 to start exploring research questions in the second half.

An insight into dementia (data)

There were 11 presentations (1 of which ended up getting cancelled) between 9am and 12pm. Topics included:

The role of data and technology in addressing the questions of early dementia diagnosis and prevention
Introduction to the English Longitudinal Study of Ageing (ELSA) data
Exploring the DPUK Data Portal
Gaining an insight into a game app that was used to collect Big Data as benchmarks for clinical / experimental setting studies.
New networks for researchers and data scientists looking to expand the dementia research agenda.

I found these short presentations (each about 15 minutes long) quite insightful, especially since I’m pretty much clueless about dementia in general.

It’s a pretty big problem

The presentations really helped put the scale of the problem in perspective. For instance, it’s apparently predicted that 1 in 3 of the people born in 2019 will end up getting dementia.

Globally, that figure is set to increase into the tens if not hundreds of millions over the next few decades.

Apart from the scale in terms of people expected to ‘get’ dementia, the problem is accentuated by the fact that we still don’t quite know what causes it.

There seems to be some hope though, in that the advancements in technology could help identify the cause(s) thanks to the diversity and scale of new data.

Exploring the technology to diagnose dementia

One of the presentations highlighted how existing technologies seem to broadly be of 2 kinds:

Behavioural / Cognitive Tests, and
Physiological tests, including:
- Brain scans (e.g. CT, MRI, and PET scans),
- Blood tests (apparently they’re looking for a Vitamin B-12 deficiency), and
- Neurological evaluations.

Promising / “futuristic” tech include:

Using human interactions with phones (e.g. tapping, scrolling, voice coherence),
Retinal scans (something about it sort of ‘mapping’ what your brain does),
Blood biomarkers (apparently this is the “holy grail” for many), including proteins (tau, NFL, miRNA)
Big Data
Wearables and sensors

Taking stock of the Big Data

DPUK is a data processor, meaning they don’t own the data, but instead ‘clean’ it to make it useable for researchers and the like. Thanks to their significant efforts, they provide access to a wealth of different datasets in formats that are considerably easier to work with.

While they have access to dozens of different datasets (and growing), we mainly explored the 3 we have access to as part of the datathon.

ELSA

One of the biggest goliaths is the English Longitudinal Study of Ageing (ELSA).

ELSA consists of a representative sample of people in England aged 50+ in 2002, when the data collection began. Since then they’ve added more people and tracked responses of all ‘original’ and ‘newer’ subjects in “waves”.

Topics include everything from health, disability, cognition, economic position, wealth, income, childhood resources, to name a few.

While I often deal with “Big Data” as part of my PhD Finance research, the ELSA dataset is quite different.

It’s got about 20,000 participants (“subjects”), so not exactly “Big Data” in that sense. But it has approximately 5,000 different variables across the participants. That’s about 5,000 + variables we’re talking about.

So while in my previous post I said all the data we’ll be dealing with is clean, I didn’t quite fathom the sheer size of the data in terms of diversity.

My expectation was few variables across hundreds of thousands or millions of people. The reality is “few” participants (tens of thousands) across tens of thousands of variables (not just with ELSA).

Deep and Frequent Phenotyping (DFP)

This dataset is quite new, with the first set of data being collected in October 2018 if I’m not mistaken. The dataset is pretty incredible in that it has PET, MRI, Gait, and other scans, as well as cognitive assessments of subjects.

Since the dataset is new though, there aren’t too many observations at this time.

While exploring the datasets, I think I saw something like 200 – 300 observations for scans. Unfortunately, that’s too little to do any meaningful statistical analysis.

The Caerphilly Prospective Study (CaPS)

The CaPS database provides access to information about lifestyle and other factors with incident vascular disease in men. Data collection for CaPS began in 1979 so it’s one of the oldest datasets we have access to during the Datathon.

Information includes things like whether they have an active physical life, what speed they walk at, amongst others.

Other datasets

There are several other datasets available to and accessible by researchers here. I’ve only discussed the 3 above since these are the ones we have access to work with during the datathon.

And I’m honestly pleased it’s “just” these 3. Because the data complexity is far greater than I thought, so this is going to be quite interesting and challenging as it is, with “just” 3 datasets.

Getting teams together

With the “army” of data scientists, mathematicians, psychologists, psychiatrists, neuroscientists and a host of other experts armed with new / reinforced knowledge about dementia, we were ready to get our hands dirty.

First, we were split into 8 teams of 6-7 participants each.

All but 1 team had really diverse experts within. The 1 “non-diverse” team was the one with 6/6 statisticians who said “they’re going to just wrangle the data and see what they find”.

After introducing one another and briefly talking about our backgrounds, we started discussing potential research questions to explore.

Exploring research questions

We considered exploring brain imaging (from DFP data), but the small sample size (N ~ 280) made it less attractive.

Prior to the datathon, I browsed around various online forums related to dementia. I saw something about some dementia patients ‘clinging’ to money, possibly because they want to feel secure; this was discussed on a thread at Alzheimer’s Society UK.

Thus, I thought perhaps we could look at how income and / or financial security has an impact on dementia.

Collectively, we figured it’d be good to incorporate ‘security’ / feeling secure as a whole (i.e., not restricted to financial security exclusively). Hence we figured maybe we should look at how childhood “experiences” impact dementia.

With a broad research question roughly agreed, we took a look at the kind of data we could work with. Both ELSA and CaPS provided some sort of useful data, with the former providing more direct data on childhood ‘experiences’.

The drawback of ELSA however, is that there are only about 200 instances of dementia. Naturally, this meant we couldn’t realistically explore the impact of childhood ‘experiences’ on dementia. Thus, we broadened the question to cognitive decline instead of dementia exclusively. This is reasonable since there is some evidence of people with cognitive decline or cognitive impairments ending up with dementia.

Exploring the data

With a rough research question in mind, we started to explore the ELSA dataset in more detail.

This is where things got messy.

Really, really messy.

Different data, no unique mergers

A lot of these datasets are quite rich and diverse by themselves. Some of them aren’t, but can be really useful when used in conjunction with other datasets.

In Finance, we usually work across multiple datasets, and for the most part, merging datasets is pretty straight forward. For instance, we have standardised firm identifiers which allow us to match different characteristics for the same firm across different databases.

In the “dementia research world”, there are different datasets, but the ‘subjects’ aren’t necessarily the same.

Thus, the combinations of different datasets can only really be used to derive aggregate level insights.

For instance, one might merge 2 completely different datasets on age, such that we can derive insights for people across different age brackets, for example.

So in our context, we’re largely restricting ourselves to the ELSA database despite having access to other data. We expect to use the CaPS database for robustness checks, but not a whole lot more.

Determining relevant variables

At the time of writing, we’re in a place where we literally have thousands of variables – within the childhood ‘experience’ alone.

Of course we can run some sort of principal component analysis (PCA) or other feature extraction techniques to determine which of the variables best relate to cognitive decline.

The issue we have at the moment is that we can’t quite identify all of the relevant variables.

This is largely because of the naming conventions inside the data itself – quite different to what we see in the ‘Cohort Directory’ on the DPUK Data Portal, and in some cases, quite different from what we see in the documentation, too.

One of the things we managed to do was extract names and labels from .dta Stata files as .csvs. This will allow us to map variable names to labels and then potentially run some sort of text ‘mining’ to extract variables relevant / related to “childhood”.

I’ll report back on how this goes in the next blog post, but we also have other issues to deal with which we hadn’t quite anticipated.

Differences in values vs. labels

We had an instance where we read in a .dta file as a pandas dataframe, only to find that the values within a given column C were the labels for a given question. In other words, we had strings like “good”, “excellent”, “not sure”, etc, instead of the numeric values (1, 2, … , N).

Surprisingly, when the same file is read in to Stata, the numeric values are visible on further inspection, such that the labels are only for display, with the actual values being the numerical ones.

In other words, Stata was mapping the numbers 1, 2, …, N for a column C to some labels L in {“good”, “excellent”, “not sure”, … , “bad”}. However, reading in the file into pandas only showed the labels L.

Now of course, we could create dictionary maps for the column, but this would be a ridiculous task given we have several thousand columns which could have this sort of issue!

Wrapping up

So that’s about where we’re at right now.

We have a rough research question in mind – how do childhood “experiences” impact cognitive decline?

Note that we still haven’t quite defined what we mean by childhood “experiences”.

We have some idea of the broad dataset we’ll be working with, and have some ideas of the issues we have / need to be aware of.

One of the team members won’t be with us on Friday so we’re going to be one man down after tomorrow.

We have no idea how we’re going to ship something by Friday but know that we have to get something out.

The day ended with a glorious drinks reception at the Sainsbury’s Art Centre at the University and were given a really cool tour of some of the art pieces.

We were also treated to a delicious dinner before 4 of us made our way for a drink at the student bar (1 from my team and 2 other participants from other teams).

2 of the participants ended up playing some really great music on the piano and that marked the end of the night.

All in all, I think Day 1 was quite intense, really well thought out, and a fantastic experience. I’m well excited about the sort of data I’ll be working with and am looking forward to what we end up doing.

I’ll post the link to Day 2 here later on.

Feature Image by Arif Wahid on Unsplash, modified on Snagit.

Additional menu