20 May 2021


Authors: Héléna Alexandra Gaspar, Mohamed Ahmed, Thomas Edlich, Benedek Fabian, Zsolt Varszegi, Marwin Segler ,Joshua Meyers ,Marco Fiscato 


Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences.

Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.

Back to publications

Latest publications

07 Dec 2022
NeurIPS 2022
sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion
Read more
07 Dec 2022
EMNLP 2022
Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval
Read more
03 Nov 2022
AKBC 2022
Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations
Read more