Skip to main content

Not the Models You Are Looking For: An Evaluation of Performance, Privacy, and Fairness of LLMs in EHR Tasks

We use a private dataset derived from Vanderbilt University Medical Center’s EHR and GPT-3.5, GPT-4, and traditional ML, measuring predictive performance, output calibration, privacy-utility tradeoff, and algorithmic fairness. Traditional ML vastly outperformed GPT-3.5 and GPT-4 with respect to predictive performance and output probability classification. We find that traditional ML is much more robust to efforts to generalize demographic information compared to GPT-3.5 and GPT-4. Surprisingly, GPT-4 is the fairest model according to our selected metrics. These findings imply additional research into LLMs is necessary before deploying as clinical prediction models.

Learning Objectives

  • Identify the concerns associated with using large language models (LLMs) as clinical prediction models and demonstrate how to assess their performance and reliability in clinical applications.

Speakers

  • Katherine Brown, PhD (Vanderbilt University Medical Center)

Leveraging Open-Source Large-Language Model-Enabled Identification of Undiagnosed Patients with Rare Genetic Aortopathies

Hereditary aortopathies are often underdiagnosed, with many patients not receiving genetic testing until after a cardiac event. In this pilot study, we investigate the use of open-source LLMs for recommending genetic testing based on clinical notes. We evaluate the utility of injecting disease-specific knowledge into retrieval augmentation generation-based and finetuned models. Our result of 93% accuracy using a base model alone surprisingly suggests that incorporating domain knowledge may sometimes hinder clinical model performance.

Learning Objectives

  • Identify the challenges involved in diagnosing rare genetic diseases. 
  • Explain how large language models (LLMs) can be used to screen patients for genetic aortopathies. 
  • Apply an open-source LLM-enabled recommender pipeline to make patient-level diagnostic predictions using clinical notes.

Speakers

  • Zilinghan Li, Master of Science (Argonne National Laboratory)

Identifying Opioid Overdose and Opioid Use Disorder and Related Information from Clinical Narratives Using Large Language Models

In this study we evaluated the ability of ChatGPT-4o-mini to extract three social and environmental determinants of health (SEDoH) indicators (housing stability, substance use and socio-economic status) from clinical notes compared to a manually annotated reference standard, showing extraction with moderate accuracy, precision and recall. The model exhibited a moderate performance in identifying “socio-economic status” highlighting its potential for use in standardizing and integrating SEDoH data into healthcare systems.

Learning Objectives

  • Understanding the critical public health implications of opioid overdose and opioid use disorder (OUD) in the United States. 
  • Identify opioid overdose, problematic opioid use, and other related concepts to facilitate studies counter opioid crisis. 
  • Apply encoder-only large language models (LLMs) and decoder-based generative LLMs to extract opioid related information from clinical notes and understand the strengths and weaknesses of the two types of LLMs. 
  • Understand the cost-effective p-tuning algorithm to adopt encoder-based and decoder-based LLMs for patient information extraction.

Speakers

  • Daniel Paredes, MS (University of Florida)

Exploring ChatGPT 3.5 for structured data extraction from oncological notes

In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in HIPAA secure environments, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT 3.5 could be used to supplement data in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. This has potential to increase interoperability between healthcare and clinical research.

Learning Objectives

  • Recognize the potential and limitations of using ChatGPT to extract structured data from unstructured clinical notes.

Speakers

  • Ty Skyles, BS candidate (Brigham Young University)

Enhancing Disease Detection in Radiology Reports Through Fine-tuning Lightweight LLM on Weak Labels

Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.

Learning Objectives

  • Identify key limitations preventing the practical application of large language models (LLMs) in the medical domain, including model size constraints and lack of cohort-specific labeled datasets. 
  • Describe the process of fine-tuning lightweight LLMs, such as Llama 3.1-8B, using datasets with synthetic labels to improve task-specific performance. 
  • Evaluate the impact of synthetic label quality on model performance, comparing outcomes from high-quality sources (e.g., GPT-4o) and lower-quality datasets (e.g., MIMIC-CXR). 
  • Analyze the ability of fine-tuned Llama 3.1-8B to surpass noisy teacher labels, demonstrating the model’s inherent capability in disease detection and related tasks.

Speakers

  • Yishu Wei, PhD (Department of Population Health Sciences, Weill Cornell Medicine)

Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach

Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.

Learning Objectives

  • Explain how Sentence-BERT embeddings and machine learning models, including neural networks and XGBoost, can be applied to predict antibiotic resistance patterns using clinical documentation and microbiology data.

Speakers

  • Mahmoud Alwakeel, MD (Mahmoud Alwakeel, MD)

Continuing Education Credit

Physicians

The American Medical Informatics Association is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to provide continuing medical education for physicians.

The American Medical Informatics Association designates this online enduring material for 1.5 AMA PRA Category 1™ credits. Physicians should claim only the credit commensurate with the extent of their participation in the activity.

Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.

ACHIPsTM

AMIA Health Informatics Certified ProfessionalsTM (ACHIPsTM) can earn 1 professional development unit (PDU) per contact hour.

ACHIPsTM may use CME/CNE certificates or the ACHIPsTM  Recertification Log to report 2024 Symposium sessions attended for ACHIPsTM Recertification.

Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.

FAQs

All content was recorded live at AMIA’s Annual Symposium event November 9-13, 2024, in San Francisco, CA. Plan now to join us for the next Annual Symposium!

Yes! Purchase the AMIA 2024 Annual Symposium On Demand Bundle to enjoy all recorded sessions available at the best value. Get the bundle.

Purchase the AMIA 2024 Annual Symposium On Demand Bundlefor the best value on all top 20 sessions. Additional individual sessions are also available for purchase in the catalog.

Claim credit no later than January 20, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after January 20, 2028.

Yes! AMIA 2024 Annual Symposium On Demand is available for anyone to purchase. Become an AMIA member before you purchase to receive exclusive member discounts. Join AMIA today.

We’re glad you asked! AMIA offers a variety of membership options, all with exclusive benefits and abundant networking opportunities. Choose the membership that’s right for you.

The Audio-only format of all 20 sessions is available free of charge exclusively to AMIA members. Access the AMIA 2024 Annual Symposium On Demand Audio Library. Log in required.

Join us at the next Annual Symposium and engage with leaders from across the health informatics field. Learn more.

Yes! You can claim Self-Study credit when you complete AMIA 2024 Annual Symposium On Demand sessions, in addition to claiming Live credit for attending the live event. View the full details on self-study accreditation for this product.

Yes, The AMIA 2024 Annual Symposium On Demand Bundle (Presenter, Slides, and Audio) may be purchased for 8 educational credits using your health system’s code at checkout. Individual sessions (Presenter, Slides, and Audio) may be purchased for 1 educational credit per session using your health system’s code at checkout.

Available Until:
Dates and Times:
Type: AMIA On Demand
Course Format(s): On Demand
Credits:
1.50
CME
Price: Member: $60, Nonmember: $85
Purchase now
Share