Advanced AI Scientist
Large-scale knowledge graphs have gained prominence over the past several decades as a means for representing structured data at scale.
But encoding complex facts within the confines of a Knowledge Graph can be challenging. Often, in the real world, we would like to ask questions that are contextualised by multiple conditions, such as locations, types or dates. For example, a politician holds a given elected position for a particular timespan, so we might ask, “Who did the USA elect as President in 1960?”. Traditional knowledge graphs represent facts as a triplet of two entities and a relationship, such as (JFK, elected president, USA), and are unable to represent the additional conditions, such as “in 1960”, without an explosion in relationship types. This paved the way for more recent hyper-relational (or n-ary) knowledge graphs, which can represent higher-order facts.
Higher-order facts are central to drug discovery. Biological processes can depend entirely on certain contexts. For example, proteins interact differently depending on which tissue or cell type is under consideration. The ability to encode these dependencies could enable a more accurate prediction of therapeutic targets for drug interventions.
Machine learning models have been developed to predict new or unknown information from knowledge graphs in a task called knowledge graph completion (or link prediction). For instance, a link prediction model might reason from a knowledge graph containing the triplet (USA, ElectedPresident, JFK) to infer that the triplet (JFK, BornInCountry, USA) also likely exists. Traditional knowledge graph completion models have been carefully modified to encode these higher-order facts. However, these modifications present two problems:
- Literals, numerals such as dates or gene expression levels, can’t be represented efficiently as discrete graph entities and are typically ignored.
- Construction of partially complete n-ary knowledge graphs in new domains, such as biology, is expensive and time-consuming.
An alternative approach is to consider applying Language Models (LMs) to encode complex facts instead. Pre-trained language models like BERT and GPT have transformed the field of natural language processing in recent years - setting record performance across a huge array of tasks. Natural language can incorporate additional context easily and LMs can represent literals efficiently through sub-word tokenisation. Sub-word tokenisation includes frequently occurring words (or literals) in the vocabulary, while rare words are split into frequent sub-words. Additionally, huge auxiliary natural language corpora may also then be utilised to circumvent the need for partially complete n-ary knowledge graphs.
Here we present Hyper-ELC - a masked language modelling approach to n-ary link prediction. Hyper-ELC is pre-trained to predict the unique identifier of a masked entity in an auxiliary entity-linked corpus. It is then fine-tuned on cloze-style sentence templates created from the n-ary knowledge graph.
Hyper-ELC is the first purely natural language-based approach to n-ary link prediction, the first approach that leverages literal conditions and the first approach that does not require a partially complete n-ary knowledge graph to perform link prediction.
These distinct features enable us to ask ever-more flexible questions and retrieve increasingly precise answers to important questions in more diverse domains. We hope that the application of this novel approach in biomedicine will advance us towards discovering life-changing medicines with a greater likelihood of positive outcomes in the clinic.
On Masked Language Models for Contextual Link Prediction
Back to blog post and videos