Data Diversity: Ten practical approaches to acquiring the right datasets Part 1

The impact of machine learning models are heavily influenced by the data sources they use and it is well known that many demographic groups relating to sex, age, ethnicity, race and socio-economic status - are often poorly represented in data sources that inform precision medicine.

The need to improve the diversity of data used in the field is driven by moral, scientific and economic motives. Improvements could be made to many aspects of the drug discovery process, but here we focus on those which could be made to the data acquisition process. Here we outline some possible  approaches.

Define and prioritise diversity attributes.

Diversity is a broad topic and when patient data is powering algorithms, it is critical to identify the r attributes of this data. They can form the metrics for coverage and are essential in diversity discussions, but it's also worth prioritising them.

As an example, let’s think about sex, age and ethnicity ordered in one hand and race followed by socio-economic status on the other hand. These are imperfect and sometimes unclear concepts. Sex and gender, for instance, are two different concepts that are often conflated in health records. Ethnicity and race are often mixed together. And socio-economic status can refer to different kinds of indicators.

Sex, age and ethnicity are most biologically relevant to disease and most often collected. Variables are ordered partly based on their power to include/exclude parts of the society. Whereas, race and socioeconomic status provide indicators for health inequalities, which can have an impact on health outcomes.Why order them? Because in any given health data set about patients, some of this list of diversity attributes may be missing. Ordering them can help to prioritise data acquisition efforts for finding what is absent.

Acknowledge the landscape of barriers

There's an inertia in the way health data have been collected. We can't go back in time and change which patients were recruited for past or long-running studies. Sometimes it can be difficult or impossible to retrospectively add important demographic metadata to an existing source. Clinical data, molecular data, genomics data and sources of tissue that show demographic bias can become heavily referenced in scientific literature. Their influence can be hard to ignore

Once you have defined what diversity means to your organisation, and you have acknowledged the challenges, you can move on with trying to make what progress you can.

Create a corporate mission statement.

There are at least three good reasons for coming up with a corporate mission statement. First, to get an agreement on it, you'll need input on it from the highest levels of the company. When they embrace it, it makes it easier for it to be considered necessary by everyone else in the organisation. Second, it helps tell your prospective data providers you're interested in trying to make that change and that it's something you're bearing in mind. Third, from an information governance perspective, having a mission statement helps remind you to be aware of the scope of purpose for why you want to use personal data.

Identify activities.

Just as important as the mission statement is the work that goes into making that happen. This helps to encourage ideas around which areas improvements in diversity coverage can be made. Some precision medicine activities are more amenable to diversity enhancement than others. As an example, if an activity is about finding insights into existing data, then adding new people may not make sense. If you're working with a client who has certain data sources in mind, you can invite them to consider demographic coverage.

Taking demographic inventory for current data sources.

Systematically gather consistent data: The goal would be to gather data around age, sex, ethnicity, race, and socio-economic status, understand where they have been reported, and try to get demographic coverage for disease activities. Many of these demographic attributes will undoubtedly already appear in various analysis reports. Age may have come up as important in one activity, ethnicity may have come up with another. The main goal is to systematically gather consistent data about all of these attributes. They may not change any of your results, but it will help better describe them.

Revisit your purpose for processing: This is a crucial information governance concept, you need to make sure you have checked the needs of your Diversity in Data Initiative against your original stated purpose for gathering data. Verify that your purpose is within that original scope.

Check whether the attributes appear in the data source: Attributes may have already been reported. It could be the data sets are there but they didn't make into some of the reports. It could be the provider has access to diversity attributes but people didn't think about asking for them. It could be the provider doesn't have them. In any case, ask the question, try to find out.

BLOG Part 2 →

Kevin Garwood
Patient Data Manager: Information Governance and Data Acquisition in Precision Medicine

Test our Diversity Analysis Tool: a simple open-source programme

The codebase we have open-sourced is meant to be simple and provide some basic code examples that may inspire you to develop more sophisticated solutions. We also provide information about common data processing issues that relate to these demographic concepts. The codebase may help you make a better assessment of the diversity in data sets you already work with. It may just help you evaluate the diversity in future health data sets you may encounter. However you may use the code, we hope it encourages you to think more about improving the diversity in data.

The Diversity Analysis Tool is available on GitHub here

More Posts

You Might Also Like

Transforming drug discovery with AI: how we’re building and nurturing the best talent for the job
At BenevolentAI, we are on a mission to bring life-changing medicines to patients, and we are looking for collaborative, mission-driven people to join our tech, drug discovery and business operations teams in London, Cambridge and New York.
Oct 17, 2021
BenevolentAI identifies novel target for ulcerative colitis and advances candidate to IND/CTA-enabling studies
BenevolentAI’s AI-Drug Discovery platform uncovered a novel target not previously linked to ulcerative colitis and advanced candidate to preclinical studies.
Oct 14, 2021
Measuring bias: moving towards more inclusive health research outcomes #stateofai
Having shared our open-source Diversity Analysis Tool last year, we were tasked to investigate and demonstrate the lack of diversity in biomedical data as part of the State of AI Report 2021.
Oct 12, 2021
Expert-augmented computational drug discovery for rare diseases
Combining scientific expertise, computational tools and our AI-enhanced biomedical knowledge graph to successfully uncover a new drug combination for treating a rare brain cancer in children.
Sep 28, 2021
BenevolentAI appoints Dr John Orloff to its Board of Directors
Biopharmaceutical veteran Dr John Orloff joins the BenevolentAI Board as a Non-Executive Director as it scales the development of its leading AI-derived drug pipeline.
Sep 9, 2021
AI in Drug Discovery
This blog seeks to demystify the application of artificial intelligence (AI) and machine learning (ML) in drug discovery by exploring some of the challenges, opportunities and progress that has been achieved in the field so far.
Jul 13, 2021