Authors: Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu
Electronic health records (EHR) offer the opportunity for richer phenotype definition and more accurate risk prediction over bespoke cohorts collected for specific research purposes. Combining multiple structured data sources, such as primary and secondary care records, is crucial to understand patient trajectories and severity for a given disease. However, a key challenge lies in combining multiple ontologies. Current approaches rely on manually curated mappings between ontologies and are often prone to error.
In this paper, we unify ontologies using textual descriptors of concepts like diagnostic codes. We fine-tune pretrained language models to denoise and identify mis- or undiagnosed individuals based on their medical history. We validate our approach using the UK Biobank, a large-scale biomedical database. We demonstrate our method yields calibrated disease predictions for undiagnosed patients compared to non-text and single ontology approaches. Finally, we demonstrate empirically how our method can be used for cohort expansion with an in-depth clinical evaluation for sex-specific diseases and for a Type II Diabetes Mellitus use case.
Back to publications