Artificial intelligence (AI) is a broad and evolving scientific field, and the value it can deliver at various stages of the drug discovery process is now widely accepted in the pharmaceutical industry. This blog seeks to demystify the application of AI in drug discovery, focusing on its key challenges, opportunities and successes.
Over one million scientific articles are published every year in the biomedical domain alone, and every new year brings new methods for data collection and more detailed data modalities. While scientists have access to an exponentially increasing amount of knowledge and data, biological data is messy and incomplete; it may contain conflicting or contradicting evidence, suppositions, biases, uncertainty, gaps in knowledge or misclassifications. This prevents us from understanding the full biology landscape and complicates decision making.
That incomplete data makes training computer models to recognise patterns challenging. For example, in classic computer vision, machines are trained to identify, say, a cat based on thousands of examples of cats. Whereas in drug discovery, we are operating within uncharted territory, with gaps of understanding; there is no clear definition or example to train the machine on because researchers don’t know what a potential treatment looks like.
In addition, many demographic groups relating to sex, age, ethnicity, race and socioeconomic status are often poorly represented in biomedical data. This data gap can seriously undermine attempts to understand disease variation and predict drug outcomes across populations.
The human body is an astonishingly complex information system. Biology can span more than ten orders of magnitude, from energy interactions at the molecular scale (10-10 m) that determine protein binding to systems biology questions that concern the whole organism (100 m). In time, meaningful interactions exist both at nanoseconds (10-9 s) to disease progressions measured in years (108 s). Representing biology at the right scale and specificity is incredibly difficult. Effects cascade across these orders of magnitude - for example, we have two meters of DNA tightly coiled up into a package mere micrometres in length, and that tiny code determines in large part how we are physically assembled as human beings.
Not only is biology hierarchical, but the code depends on complicated regulating interactions. A neuron in the brain and a muscle cell in the arm both have the same DNA, but what makes them different is how the genes are expressed (e.g. turned ‘on’ or ‘off’). Of the two meters of DNA in each cell, only a small 6 cm of it actually codes for the proteins that are the main workers in a cell. The remainder of that two meters of DNA influences the gene expression, allowing a neuron to become a neuron and a muscle cell to become a muscle cell. Each expressed gene from that DNA plays an important and complex role in biological mechanisms and their interactions.
Therefore, when we identify a gene or protein that is dysregulated in a disease, we don’t necessarily know whether that is the protein we need to target - it might just be a downstream effect of another dysregulated gene. Similarly, the targeted protein might have a downstream effect on other biological mechanisms that could rule it out as a potential target. These dynamic interactions are very difficult to comprehend and model, making it hard to predict which protein to target and what the effect of drugging a protein might be.
Once a protein target has been identified as therapeutic, the next step is to develop a small molecule drug. The number of possible molecules is staggering: there are more possible molecules than there are particles in the universe. Drug design - where we find and develop one of those molecules - is an incredible challenge, especially as we have found and characterized only tiny pockets of this vast molecular space. From selection to optimisation or design it is extremely challenging to predict how best to improve a molecule. In the drug design process, chemists need to design a drug by taking into account a variety of parameters to guarantee minimal adverse effects while ensuring delivery of the desired effect(s) on the disease. Yet, from all the molecules that could be considered to the one that will make it to market, there is a long and difficult journey.
There are rapidly emerging areas of AI that can be applied to tackle the challenge of understanding biology to improve health.
and methods such as entity recognition, entity linking, and relation extraction allow scientists to extract meaningful information from biomedical literature at scale. These methods can recognise patterns in literature and help further our understanding of biological disease drivers, uncover novel insights, and identify opportunities hidden in large volumes of data or scattered across disparate domains.
approaches can be beneficial to identify latent biological mechanisms that influence complex diseases from molecular or clinical data sets in an unbiased way. This approach is complementary to NLP models as the latter relies on biomedical literature, which covers only a small fraction of the 20,000 human genes. Careful investigation of the outputs of unsupervised methods applied to molecular or clinical data in combination with other approaches such as NLP methods may lead to discovering novel disease mechanisms and biomarkers. Unsupervised learning can also be used to uncover disease subgroups at the patient level, with applications in precision medicine and clinical trial stratification.
lies at the heart of how to use AI to learn and model biology. This field studies how to represent concepts from the real world and “featurize them" to make them useful for AI tasks. For example, representation learning examines how best to encode a molecule so computers can predict important properties like protein binding. Or, in the target identification step of drug discovery, how to encode concepts to predict new therapies. Techniques such as graph neural networks, tensor factorization models, and transformer-based NLP embedding models have been recently applied to learn meaningful biomedical representations.
helps human experts to work hand-in-hand with AI to enhance their understanding and help the AI learn, forming a positive feedback loop benefitting from human insight and the incredible recall of AI systems. Well-designed systems help give scientists confidence in their decision-making by enabling end-users to interpret messy and incomplete biomedical data, to understand what data the models used to make predictions, and how best to contextualize that information among all the biomedical data.
While AI-enabled drug discovery is certainly not immune to the hyped promises surrounding most AI applications, 2020 saw the field make impressive progress on some of the world’s most complex scientific challenges.
Proteins are large complex molecules that consist of chains of amino acids folded into a unique three-dimensional shape. Proteins are essential to all known forms of life and are the most common substrate that is targeted by therapeutics. Working out how they fold into 3D shapes is an astonishingly hard scientific problem that could impact disease understanding, drug discovery, and protein design. In November 2020, Google DeepMind trained its AI system, AlphaFold, on the sequences and structures of 100,000+ proteins and predicted, with impressive speed and accuracy, a protein’s shape from its sequence of amino acids. While there is certainly more work to be done, this landmark achievement demonstrates the utility of AI in scientific discovery.
A crisis is emerging in antibiotic discovery, caused by the rise of antibiotic-resistant bacteria and declining approvals of novel antibiotics. While screening the vast chemical space using traditional methods is expensive and time-consuming, computational approaches can help explore broader chemical space, faster. In 2020, MIT researchers used predictive computer models to screen over 100 million molecules and select potential antibiotics that kill bacteria with novel mechanisms of action. The machine-learning algorithm identified a powerful new antibiotic compound called halicin. Laboratory tests showed the compound killed some of the most drug-resistant bacteria and it was further validated in two different mouse models. This proof of concept highlights the power of computational approaches in antibiotic discovery.
In late January 2020, a specialist BenevolentAI team used its biomedical knowledge graph and AI tools to identify Eli Lilly’s baricitinib as a potential COVID-19 treatment. Our AI methods focus on extracting data and uncovering hidden relationships to infer previously unknown scientific information. Our scientific experts identified baricitinib - an approved rheumatoid arthritis drug - as the strongest option based on the novel finding that baricitinib could not only reduce inflammation but also block the virus from entering and infecting lung cells by inhibiting a protein called AAK1. The team published their findings in The Lancet in February 2020, and by April 2020, this hypothesis had prompted global clinical trials. Baricitinib combined with remdesivir was awarded emergency authorization by the FDA in November 2020, based on data that showed the combination reduced recovery time and improved clinical outcomes. Baricitinib was then shown to reduce mortality of hospitalised patients by 38% in Eli Lilly’s COV-BARRIER trial - the largest reduction in mortality in this COVID-19 patient population reported to date - and the drug has since been approved for emergency use in Japan and India. Our novel AI-derived hypothesis for baricitinib’s combined antiviral and anti-inflammatory mechanism of action demonstrates the value of using ML to extract and infer new scientific information and demonstrates the utility of AI in accelerating the search for potential drug candidates.
While AI and ML can help accelerate the pace, broaden the scale and increase precision in drug R&D, it is no silver bullet. Once an AI platform or system has made a prediction or recommendation, it must be interpreted by expert scientists. In the case of Benevolent’s search into a potential treatment for COVID-19, for example, our AI technology played a significant role in accelerating the discovery of a list of potential candidates, streamlining the triage process and enhancing the ability to query these results. However, it was our Benevolent scientists who evaluated the recommendations and put forward the hypothesis.
For applications of AI and ML in drug discovery to be successful, they must be designed to augment or enhance the scientist. By keeping humans in the loop, new technologies can empower scientists to unlock hidden insights in the data and allow them to interact with and interpret this data in previously impossible ways. Through this partnership of human and machine intelligence, we will see AI become an invaluable tool in helping advance life-changing discoveries through to the clinic.
Rachel leads the application of AI to Target ID at Benevolent. She has a PhD in Computational Biology and is currently obsessed with knowledge graphs and explainable AI.
Nathan is recognised as a global thought leader in Chemoinformatics and the inventor of the first multiobjective de novo molecular design system. He joined BenevolentAI in 2017 from The Institute of Cancer Research, where he founded and led the In Silico Medicinal Chemistry team, with significant scientific impact on drugs in active clinical trials, and developing new algorithms for drug discovery.
Aylin graduated in mathematics and computer science from the TU Braunschweig and was awarded a PhD from Queen Mary University of London. She joined the Francis Crick Institute in 2014, where she developed and applied new ML methods to biomedical data. At Benevolent, Aylin leads the team that evaluates and qualifies workflows and processes combining genetic, molecular and clinical data for drug discovery.