Data Diversity: Ten practical approaches to acquiring the right datasets Part 1

The impact of machine learning models are heavily influenced by the data sources they use and it is well known that many demographic groups relating to sex, age, ethnicity, race and socio-economic status - are often poorly represented in data sources that inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Define and prioritise diversity attributes.

Diversity is a broad topic and when patient data is powering algorithms, it is critical to identify the r attributes of this data. They can form the metrics for coverage and are essential in diversity discussions, but it's also worth prioritising them.

As an example, let’s think about sex, age and ethnicity ordered in one hand and race followed by socio-economic status on the other hand. These are imperfect and sometimes unclear concepts. Sex and gender, for instance, are two different concepts that are often conflated in health records. Ethnicity and race are often mixed together. And socio-economic status can refer to different kinds of indicators.

Sex, age and ethnicity are most biologically relevant to disease and most often collected. Variables are ordered partly based on their power to include/exclude parts of the society. Whereas, race and socioeconomic status provide indicators for health inequalities, which can have an impact on health outcomes.Why order them? Because in any given health data set about patients, some of this list of diversity attributes may be missing. Ordering them can help to prioritise data acquisition efforts for finding what is absent.

Acknowledge the landscape of barriers

There's an inertia in the way health data have been collected. We can't go back in time and change which patients were recruited for past or long-running studies. Sometimes it can be difficult or impossible to retrospectively add important demographic metadata to an existing source. Clinical data, molecular data, genomics data and sources of tissue that show demographic bias can become heavily referenced in scientific literature. Their influence can be hard to ignore

Once you have defined what diversity means to your organisation, and you have acknowledged the challenges, you can move on with trying to make what progress you can.

Create a corporate mission statement.

There are at least three good reasons for coming up with a corporate mission statement. First, to get an agreement on it, you'll need input on it from the highest levels of the company. When they embrace it, it makes it easier for it to be considered necessary by everyone else in the organisation. Second, it helps tell your prospective data providers you're interested in trying to make that change and that it's something you're bearing in mind. Third, from an information governance perspective, having a mission statement helps remind you to be aware of the scope of purpose for why you want to use personal data.

Identify activities.

Just as important as the mission statement is the work that goes into making that happen. This helps to encourage ideas around which areas improvements in diversity coverage can be made. Some precision medicine activities are more amenable to diversity enhancement than others. As an example, if an activity is about finding insights into existing data, then adding new people may not make sense. If you're working with a client who has certain data sources in mind, you can invite them to consider demographic coverage.

Taking demographic inventory for current data sources.

Systematically gather consistent data: The goal would be to gather data around age, sex, ethnicity, race, and socio-economic status, understand where they have been reported, and try to get demographic coverage for disease activities. Many of these demographic attributes will undoubtedly already appear in various analysis reports. Age may have come up as important in one activity, ethnicity may have come up with another. The main goal is to systematically gather consistent data about all of these attributes. They may not change any of your results, but it will help better describe them.

Revisit your purpose for processing: This is a crucial information governance concept, you need to make sure you have checked the needs of your Diversity in Data Initiative against your original stated purpose for gathering data. Verify that your purpose is within that original scope.

Check whether the attributes appear in the data source: Attributes may have already been reported. It could be the data sets are there but they didn't make into some of the reports. It could be the provider has access to diversity attributes but people didn't think about asking for them. It could be the provider doesn't have them. In any case, ask the question, try to find out.

BLOG Part 2 →

Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine

Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

BenevolentAI to present at 41st Annual J.P. Morgan Healthcare Conference
BenevolentAI ​​(Euronext Amsterdam: BAI), a leading clinical-stage AI drug discovery company, announces that it will participate in the upcoming 41st Annual J.P. Morgan Healthcare Conference in San Francisco, US from 9-12 January 2023.
Dec 1, 2022
FAIR Data Foundation: An Enabler for AI Drug Discovery
Biomedical data used in AI-enabled drug discovery should adhere to the FAIR Data Principles — Findability, Accessibility, Interoperability and Reusability. This blog explains why this is, how one can make data FAIR and challenges that remain.
Nov 8, 2022
BenevolentAI to present at Jefferies Investor Conference in November
BenevolentAI’s Investor Relations and Business Development team will be attending the Jefferies Healthcare London conference, with Nick Keher, CFO, presenting on Thursday 17th November.
Nov 4, 2022
BenevolentAI achieves further milestones in AI-enabled target identification collaboration with AstraZeneca
Two additional AI-generated novel targets selected by AstraZeneca for its drug development portfolio, resulting in two milestone payments for BenevolentAI.
Oct 6, 2022
Interim results for the six months ended 30 June 2022
Continued operational progress and strengthened financial position provides capital for key value inflection points and continued investment in leading technology platform.
Sep 27, 2022
Analyst / Investor Event
BenevolentAI announces that during its analyst / investor event and interim results presentation being held in London today, new information on the Company’s BEN-2293 Phase Ib study results will be disclosed.
Sep 27, 2022