sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion

07 Dec 2022

NeurIPS 2022

Share this page

Authors: Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu

Abstract

Electronic health records (EHR) offer the opportunity for richer phenotype definition and more accurate risk prediction over bespoke cohorts collected for specific research purposes. Combining multiple structured data sources, such as primary and secondary care records, is crucial to understand patient trajectories and severity for a given disease. However, a key challenge lies in combining multiple ontologies. Current approaches rely on manually curated mappings between ontologies and are often prone to error.

In this paper, we unify ontologies using textual descriptors of concepts like diagnostic codes. We fine-tune pretrained language models to denoise and identify mis- or undiagnosed individuals based on their medical history. We validate our approach using the UK Biobank, a large-scale biomedical database. We demonstrate our method yields calibrated disease predictions for undiagnosed patients compared to non-text and single ontology approaches. Finally, we demonstrate empirically how our method can be used for cohort expansion with an in-depth clinical evaluation for sex-specific diseases and for a Type II Diabetes Mellitus use case.

Publication

Research Blog

Share this page

Back to publications

Latest publications

01 Jun 2024

arXiv Computer Science

Retrieve to Explain: Evidence-driven Predictions with Language Models

01 May 2024

Journal of Biomedical Semantics, volume 15, Article number: 5 (2024)

Elucidating the Semantics-Topology Trade-off for Knowledge Inference-Based Pharmacological Discovery

12 Oct 2023

Translational Neurodegeneration. 2023; 12: 47

Janus kinase inhibitors are potential therapeutics for amyotrophic lateral sclerosis

All publications