Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model

20 May 2021

CHEMRXIV

Share this page

Authors: Héléna Alexandra Gaspar, Mohamed Ahmed, Thomas Edlich, Benedek Fabian, Zsolt Varszegi, Marwin Segler ,Joshua Meyers ,Marco Fiscato

Abstract

Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences.

Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.

Publication

Share this page

Back to publications

Latest publications

01 Jun 2024

arXiv Computer Science

Retrieve to Explain: Evidence-driven Predictions with Language Models

01 May 2024

Journal of Biomedical Semantics, volume 15, Article number: 5 (2024)

Elucidating the Semantics-Topology Trade-off for Knowledge Inference-Based Pharmacological Discovery

12 Oct 2023

Translational Neurodegeneration. 2023; 12: 47

Janus kinase inhibitors are potential therapeutics for amyotrophic lateral sclerosis

All publications