Data Diversity: Ten practical approaches to acquiring the right datasets Part 1

The impact of machine learning models are heavily influenced by the data sources they use and it is well known that many demographic groups relating to sex, age, ethnicity, race and socio-economic status - are often poorly represented in data sources that inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Define and prioritise diversity attributes.

Diversity is a broad topic and when patient data is powering algorithms, it is critical to identify the r attributes of this data. They can form the metrics for coverage and are essential in diversity discussions, but it's also worth prioritising them.

As an example, let’s think about sex, age and ethnicity ordered in one hand and race followed by socio-economic status on the other hand. These are imperfect and sometimes unclear concepts. Sex and gender, for instance, are two different concepts that are often conflated in health records. Ethnicity and race are often mixed together. And socio-economic status can refer to different kinds of indicators.

Sex, age and ethnicity are most biologically relevant to disease and most often collected. Variables are ordered partly based on their power to include/exclude parts of the society. Whereas, race and socioeconomic status provide indicators for health inequalities, which can have an impact on health outcomes.Why order them? Because in any given health data set about patients, some of this list of diversity attributes may be missing. Ordering them can help to prioritise data acquisition efforts for finding what is absent.

Acknowledge the landscape of barriers

There's an inertia in the way health data have been collected. We can't go back in time and change which patients were recruited for past or long-running studies. Sometimes it can be difficult or impossible to retrospectively add important demographic metadata to an existing source. Clinical data, molecular data, genomics data and sources of tissue that show demographic bias can become heavily referenced in scientific literature. Their influence can be hard to ignore

Once you have defined what diversity means to your organisation, and you have acknowledged the challenges, you can move on with trying to make what progress you can.

Create a corporate mission statement.

There are at least three good reasons for coming up with a corporate mission statement. First, to get an agreement on it, you'll need input on it from the highest levels of the company. When they embrace it, it makes it easier for it to be considered necessary by everyone else in the organisation. Second, it helps tell your prospective data providers you're interested in trying to make that change and that it's something you're bearing in mind. Third, from an information governance perspective, having a mission statement helps remind you to be aware of the scope of purpose for why you want to use personal data.

Identify activities.

Just as important as the mission statement is the work that goes into making that happen. This helps to encourage ideas around which areas improvements in diversity coverage can be made. Some precision medicine activities are more amenable to diversity enhancement than others. As an example, if an activity is about finding insights into existing data, then adding new people may not make sense. If you're working with a client who has certain data sources in mind, you can invite them to consider demographic coverage.

Taking demographic inventory for current data sources.

Systematically gather consistent data: The goal would be to gather data around age, sex, ethnicity, race, and socio-economic status, understand where they have been reported, and try to get demographic coverage for disease activities. Many of these demographic attributes will undoubtedly already appear in various analysis reports. Age may have come up as important in one activity, ethnicity may have come up with another. The main goal is to systematically gather consistent data about all of these attributes. They may not change any of your results, but it will help better describe them.

Revisit your purpose for processing: This is a crucial information governance concept, you need to make sure you have checked the needs of your Diversity in Data Initiative against your original stated purpose for gathering data. Verify that your purpose is within that original scope.

Check whether the attributes appear in the data source: Attributes may have already been reported. It could be the data sets are there but they didn't make into some of the reports. It could be the provider has access to diversity attributes but people didn't think about asking for them. It could be the provider doesn't have them. In any case, ask the question, try to find out.

BLOG Part 2 →

Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine

Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

Baricitinib, first identified by BenevolentAI as a COVID-19 treatment, is granted emergency use in India in response to its escalating crisis
Baricitinib - first identified by BenevolentAI as a potential COVID-19 treatment - is accelerated for use in hospitalised patients in India following a continued surge in cases and fatalities.
May 5, 2021
A New Era in Target Discovery: Collaborating with AstraZeneca on CKD and IPF
Finding the right target underpins the success of the entire drug discovery process. Learn how BenevolentAI’s collaboration with AstraZeneca is making a difference in CKD and IPF.
Apr 23, 2021
Data from Eli Lilly’s COV-BARRIER trial shows baricitinib reduced deaths in hospitalised COVID-19 patients by 38%
The latest data published in Eli Lilly’s Phase 3 randomised, double-blind, placebo-controlled study (COV-BARRIER) shows the largest clinical effect reported to date for a reduction in mortality in the COVID-19 patient population
Apr 8, 2021
BenevolentAI named as one of Fierce Medtech’s Fierce 15 of 2020
BenevolentAI was selected as one of the most promising private companies in the industry by Fierce Medtech in its Fierce 15 2020 list.
Mar 8, 2021
Tech Nation Visa: the gateway to world-leading UK tech jobs
Drawing attention to the Tech Nation Visa, a great initiative that enables the brightest international talent to live and work in the UK.
Feb 19, 2021
BenevolentAI announces first patient dosed in its Atopic Dermatitis clinical trial
A molecule designed and developed by BenevolentAI to treat mild to moderate Atopic Dermatitis has entered clinical trials.
Feb 11, 2021