May 27, 2022
ACL 2022

On Masked Language Models for Contextual Link Prediction

Angus Brayne, Maciej Wiatrak, Dane Corneil

In the real world, many relational facts require context; for instance, a politician holds a given elected position only for a particular timespan. This context (the timespan) is typically ignored in knowledge graph link prediction tasks, or is leveraged by models designed specifically to make use of it (i.e. n-ary link prediction models). Here, we show that the task of n-ary link prediction is easily performed using language models, applied with a basic method for constructing cloze-style query sentences. We introduce a pre-training methodology based around an auxiliary entity-linked corpus that outperforms other popular pre-trained models like BERT, even with a smaller model. This methodology also enables n-ary link prediction without access to any n-ary training set, which can be invaluable in circumstances where expensive and time-consuming curation of n-ary knowledge graphs is not feasible. We achieve state-of-the-art performance on the primary n-ary link prediction dataset WD50K and on WikiPeople facts that include literals - typically ignored by knowledge graph embedding methods.


Can masked language modelling beat knowledge graph methods for contextual link prediction?

Large-scale knowledge graphs have gained prominence over the past several decades as a means for representing structured data at scale. But encoding complex facts within the confines of a Knowledge Graph can be challenging. Often, in the real world, we would like to ask questions that are contextualised by multiple conditions, such as locations, types or dates. For example, a politician holds a given elected position for a particular timespan, so we might ask, “​​Who did the USA elect as President in 1960?”. Traditional knowledge graphs represent facts as a triplet of two entities and a relationship, such as (JFK, elected president, USA), and are unable to represent the additional conditions, such as “in 1960”, without an explosion in relationship types. This paved the way for more recent hyper-relational (or n-ary) knowledge graphs, which can represent higher-order facts.

Higher-order facts are central to drug discovery. Biological processes can depend entirely on certain contexts. For example, proteins interact differently depending on which tissue or cell type is under consideration. The ability to encode these dependencies could enable a more accurate prediction of therapeutic targets for drug interventions.

Machine learning models have been developed to predict new or unknown information from knowledge graphs in a task called knowledge graph completion (or link prediction). For instance, a link prediction model might reason from a knowledge graph containing the triplet (USA, ElectedPresident, JFK) to infer that the triplet (JFK, BornInCountry, USA) also likely exists. Traditional knowledge graph completion models have been carefully modified to encode these higher-order facts. However, these modifications present two problems:

  1. Literals, numerals such as dates or gene expression levels, can’t be represented efficiently as discrete graph entities and are typically ignored.
  2. Construction of partially complete n-ary knowledge graphs in new domains, such as biology, is expensive and time-consuming.

An alternative approach is to consider applying Language Models (LMs) to encode complex facts instead. Pre-trained language models like BERT and GPT have transformed the field of natural language processing in recent years - setting record performance across a huge array of tasks. Natural language can incorporate additional context easily and LMs can represent literals efficiently through sub-word tokenisation. Sub-word tokenisation includes frequently occurring words (or literals) in the vocabulary, while rare words are split into frequent sub-words. Additionally, huge auxiliary natural language corpora may also then be utilised to circumvent the need for partially complete n-ary knowledge graphs.

Here we present Hyper-ELC - a masked language modelling approach to n-ary link prediction. Hyper-ELC is pre-trained to predict the unique identifier of a masked entity in an auxiliary entity-linked corpus. It is then fine-tuned on cloze-style sentence templates created from the n-ary knowledge graph.

Hyper-ELC is the first purely natural language-based approach to n-ary link prediction, the first approach that leverages literal conditions and the first approach that does not require a partially complete n-ary knowledge graph to perform link prediction.

These distinct features enable us to ask ever-more flexible questions and retrieve increasingly precise answers to important questions in more diverse domains. We hope that the application of this novel approach in biomedicine will advance us towards discovering life-changing medicines with a greater likelihood of positive outcomes in the clinic.

Watch Video Presentation →