sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion

Home
About us
Publications
sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion

07 Dec 2022

NeurIPS 2022

Share this page

Authors: Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu

Abstract

Electronic health records (EHR) offer the opportunity for richer phenotype definition and more accurate risk prediction over bespoke cohorts collected for specific research purposes. Combining multiple structured data sources, such as primary and secondary care records, is crucial to understand patient trajectories and severity for a given disease. However, a key challenge lies in combining multiple ontologies. Current approaches rely on manually curated mappings between ontologies and are often prone to error.

In this paper, we unify ontologies using textual descriptors of concepts like diagnostic codes. We fine-tune pretrained language models to denoise and identify mis- or undiagnosed individuals based on their medical history. We validate our approach using the UK Biobank, a large-scale biomedical database. We demonstrate our method yields calibrated disease predictions for undiagnosed patients compared to non-text and single ontology approaches. Finally, we demonstrate empirically how our method can be used for cohort expansion with an in-depth clinical evaluation for sex-specific diseases and for a Type II Diabetes Mellitus use case.

Publication

Research Blog

Share this page

Back to publications

Latest publications

09 Oct 2023

FRONTIERS IN GENETICS

Learning the kernel for rare variant genetic association test

24 Aug 2023

ELSEVIER

Associating biological context with protein-protein interactions through text mining at PubMed scale

07 Dec 2022

EMNLP 2022

Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval

All publications