Blog

Data Diversity: Ten practical approaches to acquire the right datasets Part 2

The outcomes of machine learning activities are heavily influenced by the data sources they use and it is well known that many demographic groups - relating to sex, age, ethnicity, race and socioeconomic status - are poorly represented in data sources that can inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Your disease influences your landscape.

The disease programmes that organisations choose are going to influence the landscape of data sources that can be pursued.  For example, if you are going to work on a rare genetic disorder like Hunter's Syndrome, you may encounter few data sources, each with few patients. The disease affects almost exclusively males whose life expectancy then varies between 10 and 20 years of age. In this case, characteristics of disease prevalence may make it difficult to improve the diversity to include a lot more females and patients who are middle-aged or older. 

On the other hand, if the disease of interest is more common like prostate cancer, for example, there will be more data sources with more patients represented in them. From the diversity perspective, we know that prostate cancer has a disproportionate incidence and mortality in black male groups compared with white male groups. Yet, most of the participants in prostate cancer trials are white males. In this case, there may be more room to look for more black male groups in many data sources to try and better represent them.

And this is an important lead-in for how to measure how adequately represented demographic groups may be.

Determine how you will assess adequate representation.

Below is a chart that may help guide the metrics you might want to use to measure adequate diversity. Along the top are three different types of data sources: disease agnostic ones (i.e a national biobank), a common disease (i.e a type of cancer) and a rare disease (i.e a rare genetic disorder).

Down the side are demographic attributes. For a biobank, it may make sense to measure these attributes relative to the demographics of the patient areas they service. For common and rare diseases, we would mostly rely on prevalence and consider the demographics of the disease population.

Sometimes, it may not be the case that a disease is not prevalent in a given demographic group, it may be that data is absent, i.e a part of the population is not represented in the data. In those cases, you may want to view an improvement for diversity to mean that you are including people who are typically going to be missing from many of the disease's data sources.  

In any case, determining how you will assess adequate representation is important and something to add as part of your process.

Estimating a minimum N.

Suppose the goal is to improve the representation of a particular cross-section of society in the data. In that case,  we need to ask: “what is the minimum number of people for a given demographic so that the analyses can generate meaningful results?”

This can be a very challenging question to answer because the minimum sample size can be influenced by many factors. It can depend on the disease, the disease trait, and the frequency of how prevalent that trait appears in a population. It depends on things like the power, the significance level, the number of hypotheses tested, the effect size and the population heterogeneity. Machine learning scientists are concerned about knowing whether a data source has necessary and sufficient data to identify and distinguish a signal from noise for some characteristic of interest.

Initially, the answer won't be a number - but a checklist of things that will help to provide a rough idea of what the N will be. That N will change as your analysis matures but it is useful to revisit it to help make data acquisition a cost-effective activity.

Look for your patients.

After all these steps, this is where you can begin to appreciate all the other factors that you may have to consider alongside diversity.

And you may start asking yourself, is the data answering the scientific questions I am interested in? Is the quality of clinical and molecular data in line with what I am searching for? How quick and easy is the application process for getting data? Are there any intellectual property considerations? How are longitudinal data points distributed across a data source?

Finding diversity in the data landscape for a disease can lead to some new kinds of creative thinking that may not be obvious. For example, if you want to find N patients of a given ethnicity, we need to think globally about where people from a demographic are because it will influence the administrative processes for acquiring data.

Create the demand: Put it in the contract.

When you find a prospective data source you like, ask for the diverse sets of data that include age, sex, ethnicity, race and socioeconomic status a part of a data access contract, along with indicating what you want to do with it. Create demand by specifying it in the contract, and hopefully, this will influence data providers to record that information better in the future.

BLOG Part 1 →


Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine


Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

Blog
Intern at BenevolentAI part I: meet our 2020 intern cohort
What impactful work did our interns get up to across Engineering, Data Science, ML and business operations this summer? Get to know them and their work in our tech internships blog.
Nov 26, 2020
News
FDA grants Emergency Use Authorisation for baricitinib in hospitalised COVID-19 patients nine months after initial hypothesis was published by BenevolentAI
BenevolentAI scientists first identified baricitinib as a potential treatment for COVID-19 in early February 2020 using Benevolent's AI tools and biomedical knowledge graph.
Nov 20, 2020
News
BenevolentAI at NeurIPS 2020: Machine Learning in Drug Discovery
BenevolentAI is happy to announce it is sponsoring NeurIPS 2020. Join us to hear about data diversity and ML applied drug discovery, and to learn about careers in the field.
Nov 17, 2020
Blog
Careers with Impact: 5 learnings from machine learning applied drug discovery
Last week, we brought together four of our exceptional colleagues for a panel discussion on careers in machine learning applied drug discovery. Here are some of our main takeaways:
Nov 17, 2020
Blog
Data published in Science Advances shows baricitinib reduces COVID-19 morbidity and mortality
Research published in Science Advances supports BenevolentAI’s AI-generated hypothesis from late January for baricitinib as a treatment for COVID-19.
Nov 13, 2020
News
Sir Nigel Shadbolt joins BenevolentAI as a non-executive director
BenevolentAI strengthens its Board with the appointment of AI pioneer Sir Nigel Shadbolt as Non-Executive Director.
Nov 3, 2020