Data Diversity: Ten practical approaches to acquiring the right datasets Part 1

The impact of machine learning models are heavily influenced by the data sources they use and it is well known that many demographic groups relating to sex, age, ethnicity, race and socio-economic status - are often poorly represented in data sources that inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Define and prioritise diversity attributes.

Diversity is a broad topic and when patient data is powering algorithms, it is critical to identify the r attributes of this data. They can form the metrics for coverage and are essential in diversity discussions, but it's also worth prioritising them.

As an example, let’s think about sex, age and ethnicity ordered in one hand and race followed by socio-economic status on the other hand. These are imperfect and sometimes unclear concepts. Sex and gender, for instance, are two different concepts that are often conflated in health records. Ethnicity and race are often mixed together. And socio-economic status can refer to different kinds of indicators.

Sex, age and ethnicity are most biologically relevant to disease and most often collected. Variables are ordered partly based on their power to include/exclude parts of the society. Whereas, race and socioeconomic status provide indicators for health inequalities, which can have an impact on health outcomes.Why order them? Because in any given health data set about patients, some of this list of diversity attributes may be missing. Ordering them can help to prioritise data acquisition efforts for finding what is absent.

Acknowledge the landscape of barriers

There's an inertia in the way health data have been collected. We can't go back in time and change which patients were recruited for past or long-running studies. Sometimes it can be difficult or impossible to retrospectively add important demographic metadata to an existing source. Clinical data, molecular data, genomics data and sources of tissue that show demographic bias can become heavily referenced in scientific literature. Their influence can be hard to ignore

Once you have defined what diversity means to your organisation, and you have acknowledged the challenges, you can move on with trying to make what progress you can.

Create a corporate mission statement.

There are at least three good reasons for coming up with a corporate mission statement. First, to get an agreement on it, you'll need input on it from the highest levels of the company. When they embrace it, it makes it easier for it to be considered necessary by everyone else in the organisation. Second, it helps tell your prospective data providers you're interested in trying to make that change and that it's something you're bearing in mind. Third, from an information governance perspective, having a mission statement helps remind you to be aware of the scope of purpose for why you want to use personal data.

Identify activities.

Just as important as the mission statement is the work that goes into making that happen. This helps to encourage ideas around which areas improvements in diversity coverage can be made. Some precision medicine activities are more amenable to diversity enhancement than others. As an example, if an activity is about finding insights into existing data, then adding new people may not make sense. If you're working with a client who has certain data sources in mind, you can invite them to consider demographic coverage.

Taking demographic inventory for current data sources.

Systematically gather consistent data: The goal would be to gather data around age, sex, ethnicity, race, and socio-economic status, understand where they have been reported, and try to get demographic coverage for disease activities. Many of these demographic attributes will undoubtedly already appear in various analysis reports. Age may have come up as important in one activity, ethnicity may have come up with another. The main goal is to systematically gather consistent data about all of these attributes. They may not change any of your results, but it will help better describe them.

Revisit your purpose for processing: This is a crucial information governance concept, you need to make sure you have checked the needs of your Diversity in Data Initiative against your original stated purpose for gathering data. Verify that your purpose is within that original scope.

Check whether the attributes appear in the data source: Attributes may have already been reported. It could be the data sets are there but they didn't make into some of the reports. It could be the provider has access to diversity attributes but people didn't think about asking for them. It could be the provider doesn't have them. In any case, ask the question, try to find out.

BLOG Part 2 →

Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine

Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

BenevolentAI announces participation in upcoming investor conferences
BenevolentAI announces its participation at two upcoming biotech and healthcare investor conferences hosted by Goldman Sachs and Morgan Stanley.
Jun 10, 2022
BenevolentAI Announces Board Changes
BenevolentAI today announces the appointment of Dr. Susan Liautaud as a member of the board of directors of the Company with effect from 30 June 2022. Dr. Susan Liautaud will act as Independent Non-Executive Director of the Company.
May 25, 2022
BenevolentAI achieves third milestone in its AI-enabled drug discovery collaboration with AstraZeneca
AstraZeneca selects another novel target for idiopathic pulmonary fibrosis from the collaboration for its drug development portfolio.
May 17, 2022
FDA converts emergency approval of baricitinib — first identified as a COVID treatment by BenevolentAI — to a full approval
The FDA has converted its emergency approval of baricitinib to a full approval, underscoring the strength of BenevolentAI’s AI-derived hypothesis.
May 12, 2022
BenevolentAI Begins Trading On Euronext Amsterdam
BenevolentAI, a leading, clinical-stage AI-enabled drug discovery company, announces that trading in its shares is expected to begin today, following completion of its business combination with Odyssey Acquisition S.A. on 22 April 2022.
Apr 25, 2022
BenevolentAI · AI-Enabled Drug Discovery
Advanced technologies, combined with an exponential increase in biomedical data and research, provide an unparalleled opportunity to unravel the mysteries of diseases that have gone untreated for too long.
Apr 25, 2022