NERSCPowering Scientific Discovery for 50 Years

AI Shows Promise for Mapping Disease Progression

Science Highlight

November 1, 2024

By Elizabeth Ball
Contact: cscomms@lbl.gov

Rayne Zaayman Gallant EMBL

Berkeley Lab AMCR researchers found that biomedical large language models (LLMs) show some promise for organizing electronic health records. [Credit: Rayne Zaayman-Gallant/EMBL under Creative Commons license (CC-BY-NC-ND 4.0)]

Science Breakthrough

With help from NERSC’s Perlmutter supercomputer, researchers at the Berkeley Lab Applied Mathematics and Computational Research division (AMCR) and a student from the San Juan Bautista School of Medicine (SJBSM) have proposed a framework that allowed them to investigate embeddings – representations of objects like text that are designed to be consumed by AI models such as large language models (LLMs). They also tested the hypothesis that LLMs might organize data from electronic medical records in a manner that helps characterize patients’ diseases and conditions and how they progress over time. Their work was released on medRxiv in July.

Science Background

Electronic medical records can contain useful information on patients’ condition and on the progression of disease, but the volume and complexity of the data, as well as issues like missing information and different modalities of data collection can make it difficult to organize in a way that captures key findings and the nuances of patient health. Through the MVP-CHAMPION partnership between the U.S. Department of Energy (DOE) and the U.S. Department of Veterans’ Affairs (VA), researchers at eight DOE labs are investigating ways in which HPC might be applied to patient data to increase healthcare providers’ diagnostic power and improve patient outcomes. HPC is a critical tool for the work of the MVP-CHAMPION project, since the VA datasets, which include genomics and electronic health records, are vast.

As part of this project, lead LLM engineer Rafael Zamora-Resendiz and Berkeley Lab PI for the MVP-CHAMPION project Silvia Crivelli are developing a pair of LLMs pretrained from scratch on VA biomedical data. They plan to apply their LLMs to a number of projects aimed to improve health outcomes such as obstructive sleep apnea (OSA), lung cancer, and suicide and overdose.

To help organize and interpret the massive, messy dataset, the team has researched natural language processing (NLP) techniques that can help glean key information from “unstructured” data such as doctors’ notes and discharge summaries. These tools have great potential, but understanding these tools and how they work and how they translate to meaningful clinical applications is an essential step before they can be applied usefully at scale.

For this paper, Zamora-Resendiz and Crivelli counted on the help of SJBSM student Ifrah Khurram. The team used a publicly available dataset of electronic health records to see how different biomedical LLMs organized patient information. They focused on common patterns the LLMs learned and how they characterized patient disease and stage of disease, in this case OSA. Specifically, they observed how different LLMs organized hospital admission reports and found that LLMs with more parameters and trained on better-curated data more effectively organized patients by diagnosis—and noted that they tended to organize patients by time until death. This organizational scheme may indicate that clinical LLMs may someday offer benefits in detecting the onset and progression of severe diseases.

Science Breakdown

To improve compute performance, one of the researchers’ tasks was to improve data parallelism in preparation for use on Perlmutter’s GPUs, thereby speeding up the LLMs’ ability to break down narratives from clinical text pertaining to 145,915 patients—about three GB.

Perlmutter’s GPUs allowed the team to run a bench of open-source biomedical/clinical LLMs including the University of Florida’s GatorTron, the Massachusetts Institute of Technology’s BioGPT, and Oak Ridge Leadership Computing Facility’s Forge, and to develop assessments that can be applied to other biomedical LLMs. In the future, they hope to use Perlmutter to assess larger commercial LLMs that are being fine-tuned for medical problems, like Meta’s Llama-3.

Going forward, the framework developed on Perlmutter will be tested on the models developed for the VA, which cover much longer clinical timelines. Zamora-Resendiz and Crivelli anticipate that the VA models trained from scratch on decades of patient clinical text will be able to learn a higher-resolution map of disease progression and will, as a result, outperform other models when fine-tuned. More importantly, they will contribute to the advancement of precision medicine.

Given the richness of the VA's longitudinal data (20+ years worth), they expect that running the assessment with VA data would offer stronger evidence that LLMs can organize patients by time until death. However, scaling to the size of the VA data will be a challenge.

Research Lead

Rafael Zamora-Resendiz (Berkeley Lab, AMCR Division)

Co-authors

Ifrah Khurram (San Juan Bautista School of Medicine)

Silvia Crivelli (Berkeley Lab, AMCR Division)

Publication

Towards Maps of Disease Progression: Biomedical Large Language Model Latent Spaces For Representing Disease Phenotypes And Pseudotime. Rafael Zamora-Resendiz, Ifrah Khurram, Silvia Crivelli
medRxiv 2024.06.16.24308979; doi: https://doi.org/10.1101/2024.06.16.24308979

Funding

U.S. Department of Veterans' Affairs via Interagency Agreement with DOE
Student Ifrah Khurram was supported by the Berkeley Lab Science Undergraduate Laboratory Internship program (SULI) and the Sustainable Research Pathways program (SRP). 

User Facilities

NERSC


About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.