07 Dec 2022

NeurIPS 2022

Authors:  Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu


Electronic health records (EHR) offer the opportunity for richer phenotype definition and more accurate risk prediction over bespoke cohorts collected for specific research purposes. Combining multiple structured data sources, such as primary and secondary care records, is crucial to understand patient trajectories and severity for a given disease. However, a key challenge lies in combining multiple ontologies. Current approaches rely on manually curated mappings between ontologies and are often prone to error.

In this paper, we unify ontologies using textual descriptors of concepts like diagnostic codes. We fine-tune pretrained language models to denoise and identify mis- or undiagnosed individuals based on their medical history. We validate our approach using the UK Biobank, a large-scale biomedical database. We demonstrate our method yields calibrated disease predictions for undiagnosed patients compared to non-text and single ontology approaches. Finally, we demonstrate empirically how our method can be used for cohort expansion with an in-depth clinical evaluation for sex-specific diseases and for a Type II Diabetes Mellitus use case.

Back to publications

Latest publications

09 Oct 2023
Learning the kernel for rare variant genetic association test
Read more
24 Aug 2023
Associating biological context with protein-protein interactions through text mining at PubMed scale
Read more
07 Dec 2022
EMNLP 2022
Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval
Read more