Blog

Data Diversity: Ten practical approaches to acquiring the right datasets Part 1

The impact of machine learning models are heavily influenced by the data sources they use and it is well known that many demographic groups relating to sex, age, ethnicity, race and socio-economic status - are often poorly represented in data sources that inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Define and prioritise diversity attributes.

Diversity is a broad topic and when patient data is powering algorithms, it is critical to identify the r attributes of this data. They can form the metrics for coverage and are essential in diversity discussions, but it's also worth prioritising them.

As an example, let’s think about sex, age and ethnicity ordered in one hand and race followed by socio-economic status on the other hand. These are imperfect and sometimes unclear concepts. Sex and gender, for instance, are two different concepts that are often conflated in health records. Ethnicity and race are often mixed together. And socio-economic status can refer to different kinds of indicators.

Sex, age and ethnicity are most biologically relevant to disease and most often collected. Variables are ordered partly based on their power to include/exclude parts of the society. Whereas, race and socioeconomic status provide indicators for health inequalities, which can have an impact on health outcomes.Why order them? Because in any given health data set about patients, some of this list of diversity attributes may be missing. Ordering them can help to prioritise data acquisition efforts for finding what is absent.

Acknowledge the landscape of barriers

There's an inertia in the way health data have been collected. We can't go back in time and change which patients were recruited for past or long-running studies. Sometimes it can be difficult or impossible to retrospectively add important demographic metadata to an existing source. Clinical data, molecular data, genomics data and sources of tissue that show demographic bias can become heavily referenced in scientific literature. Their influence can be hard to ignore

Once you have defined what diversity means to your organisation, and you have acknowledged the challenges, you can move on with trying to make what progress you can.

Create a corporate mission statement.

There are at least three good reasons for coming up with a corporate mission statement. First, to get an agreement on it, you'll need input on it from the highest levels of the company. When they embrace it, it makes it easier for it to be considered necessary by everyone else in the organisation. Second, it helps tell your prospective data providers you're interested in trying to make that change and that it's something you're bearing in mind. Third, from an information governance perspective, having a mission statement helps remind you to be aware of the scope of purpose for why you want to use personal data.

Identify activities.

Just as important as the mission statement is the work that goes into making that happen. This helps to encourage ideas around which areas improvements in diversity coverage can be made. Some precision medicine activities are more amenable to diversity enhancement than others. As an example, if an activity is about finding insights into existing data, then adding new people may not make sense. If you're working with a client who has certain data sources in mind, you can invite them to consider demographic coverage.

Taking demographic inventory for current data sources.

Systematically gather consistent data: The goal would be to gather data around age, sex, ethnicity, race, and socio-economic status, understand where they have been reported, and try to get demographic coverage for disease activities. Many of these demographic attributes will undoubtedly already appear in various analysis reports. Age may have come up as important in one activity, ethnicity may have come up with another. The main goal is to systematically gather consistent data about all of these attributes. They may not change any of your results, but it will help better describe them.

Revisit your purpose for processing: This is a crucial information governance concept, you need to make sure you have checked the needs of your Diversity in Data Initiative against your original stated purpose for gathering data. Verify that your purpose is within that original scope.

Check whether the attributes appear in the data source: Attributes may have already been reported. It could be the data sets are there but they didn't make into some of the reports. It could be the provider has access to diversity attributes but people didn't think about asking for them. It could be the provider doesn't have them. In any case, ask the question, try to find out.

BLOG Part 2 →

Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine



Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

Blog
Intern at BenevolentAI part I: meet our 2020 intern cohort
What impactful work did our interns get up to across Engineering, Data Science, ML and business operations this summer? Get to know them and their work in our tech internships blog.
Nov 26, 2020
News
FDA grants Emergency Use Authorisation for baricitinib in hospitalised COVID-19 patients nine months after initial hypothesis was published by BenevolentAI
BenevolentAI scientists first identified baricitinib as a potential treatment for COVID-19 in early February 2020 using Benevolent's AI tools and biomedical knowledge graph.
Nov 20, 2020
News
BenevolentAI at NeurIPS 2020: Machine Learning in Drug Discovery
BenevolentAI is happy to announce it is sponsoring NeurIPS 2020. Join us to hear about data diversity and ML applied drug discovery, and to learn about careers in the field.
Nov 17, 2020
Blog
Careers with Impact: 5 learnings from machine learning applied drug discovery
Last week, we brought together four of our exceptional colleagues for a panel discussion on careers in machine learning applied drug discovery. Here are some of our main takeaways:
Nov 17, 2020
Blog
Data published in Science Advances shows baricitinib reduces COVID-19 morbidity and mortality
Research published in Science Advances supports BenevolentAI’s AI-generated hypothesis from late January for baricitinib as a treatment for COVID-19.
Nov 13, 2020
News
Sir Nigel Shadbolt joins BenevolentAI as a non-executive director
BenevolentAI strengthens its Board with the appointment of AI pioneer Sir Nigel Shadbolt as Non-Executive Director.
Nov 3, 2020