Wrangling Clinical Documentation with LLMs

Not the Models You Are Looking For: An Evaluation of Performance, Privacy, and Fairness of LLMs in EHR Tasks

We use a private dataset derived from Vanderbilt University Medical Center’s EHR and GPT-3.5, GPT-4, and traditional ML, measuring predictive performance, output calibration, privacy-utility tradeoff, and algorithmic fairness. Traditional ML vastly outperformed GPT-3.5 and GPT-4 with respect to predictive performance and output probability classification. We find that traditional ML is much more robust to efforts to generalize demographic information compared to GPT-3.5 and GPT-4. Surprisingly, GPT-4 is the fairest model according to our selected metrics. These findings imply additional research into LLMs is necessary before deploying as clinical prediction models.

Learning Objectives

Identify the concerns associated with using large language models (LLMs) as clinical prediction models and demonstrate how to assess their performance and reliability in clinical applications.

Speakers

Katherine Brown, PhD (Vanderbilt University Medical Center)

Leveraging Open-Source Large-Language Model-Enabled Identification of Undiagnosed Patients with Rare Genetic Aortopathies

Hereditary aortopathies are often underdiagnosed, with many patients not receiving genetic testing until after a cardiac event. In this pilot study, we investigate the use of open-source LLMs for recommending genetic testing based on clinical notes. We evaluate the utility of injecting disease-specific knowledge into retrieval augmentation generation-based and finetuned models. Our result of 93% accuracy using a base model alone surprisingly suggests that incorporating domain knowledge may sometimes hinder clinical model performance.

Learning Objectives

Identify the challenges involved in diagnosing rare genetic diseases.
Explain how large language models (LLMs) can be used to screen patients for genetic aortopathies.
Apply an open-source LLM-enabled recommender pipeline to make patient-level diagnostic predictions using clinical notes.

Speakers

Zilinghan Li, Master of Science (Argonne National Laboratory)

Identifying Opioid Overdose and Opioid Use Disorder and Related Information from Clinical Narratives Using Large Language Models

In this study we evaluated the ability of ChatGPT-4o-mini to extract three social and environmental determinants of health (SEDoH) indicators (housing stability, substance use and socio-economic status) from clinical notes compared to a manually annotated reference standard, showing extraction with moderate accuracy, precision and recall. The model exhibited a moderate performance in identifying “socio-economic status” highlighting its potential for use in standardizing and integrating SEDoH data into healthcare systems.

Learning Objectives

Understanding the critical public health implications of opioid overdose and opioid use disorder (OUD) in the United States.
Identify opioid overdose, problematic opioid use, and other related concepts to facilitate studies counter opioid crisis.
Apply encoder-only large language models (LLMs) and decoder-based generative LLMs to extract opioid related information from clinical notes and understand the strengths and weaknesses of the two types of LLMs.
Understand the cost-effective p-tuning algorithm to adopt encoder-based and decoder-based LLMs for patient information extraction.

Speakers

Daniel Paredes, MS (University of Florida)

Exploring ChatGPT 3.5 for structured data extraction from oncological notes

In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in HIPAA secure environments, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT 3.5 could be used to supplement data in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. This has potential to increase interoperability between healthcare and clinical research.

Learning Objectives

Recognize the potential and limitations of using ChatGPT to extract structured data from unstructured clinical notes.

Speakers

Ty Skyles, BS candidate (Brigham Young University)

Enhancing Disease Detection in Radiology Reports Through Fine-tuning Lightweight LLM on Weak Labels

Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.

Learning Objectives

Identify key limitations preventing the practical application of large language models (LLMs) in the medical domain, including model size constraints and lack of cohort-specific labeled datasets.
Describe the process of fine-tuning lightweight LLMs, such as Llama 3.1-8B, using datasets with synthetic labels to improve task-specific performance.
Evaluate the impact of synthetic label quality on model performance, comparing outcomes from high-quality sources (e.g., GPT-4o) and lower-quality datasets (e.g., MIMIC-CXR).
Analyze the ability of fine-tuned Llama 3.1-8B to surpass noisy teacher labels, demonstrating the model’s inherent capability in disease detection and related tasks.

Speakers

Yishu Wei, PhD (Department of Population Health Sciences, Weill Cornell Medicine)

Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach

Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.

Learning Objectives

Explain how Sentence-BERT embeddings and machine learning models, including neural networks and XGBoost, can be applied to predict antibiotic resistance patterns using clinical documentation and microbiology data.

Speakers

Mahmoud Alwakeel, MD (Mahmoud Alwakeel, MD)

Continuing Education Credit

Physicians

The American Medical Informatics Association is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to provide continuing medical education for physicians.

The American Medical Informatics Association designates this online enduring material for 1.5 AMA PRA Category 1™ credits. Physicians should claim only the credit commensurate with the extent of their participation in the activity.

Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.

ACHIPs^TM

AMIA Health Informatics Certified Professionals^TM (ACHIPs^TM) can earn 1 professional development unit (PDU) per contact hour.

ACHIPs^TM may use CME/CNE certificates or the ACHIPs^TM Recertification Log to report 2025 Summit sessions attended for ACHIPs^TM Recertification.

Claim credit no later than within three years of the release date or within two years of your purchase date, whichever is sooner.

FAQs

Content was recorded live at AMIA's Informatics Summit March 10-13, 2025 in Pittsburgh, PA and at AMIA’s Annual Symposium event November 9-13, 2024, in San Francisco, CA.

Plan now to join us for the next Annual Symposium or Informatics Summit!

CME or CNE credit must be claimed no later than two from the release date or within one year of your purchase date, whichever is sooner. No credit will be issued that time.

Yes! AMIA On Demand is available for anyone to purchase. Become an AMIA member before you purchase to receive exclusive member discounts. Join AMIA today.

We’re glad you asked! AMIA offers a variety of membership options, all with exclusive benefits and abundant networking opportunities. Choose the membership that’s right for you.

The Audio-only format of all sessions is available free of charge exclusively to AMIA members.

Access the audio recordings now (login required):

Join us at the next AMIA event and engage with leaders from across the health informatics field.

Yes! You can claim Self-Study credit when you complete AMIA On Demand sessions, in addition to claiming Live credit for attending the live event. View the full details on self-study accreditation for this product.

Yes, The AMIA 2024 Annual Symposium On Demand Bundle (Presenter, Slides, and Audio) may be purchased for 8 educational credits using your health system’s code at checkout. Individual sessions (Presenter, Slides, and Audio) may be purchased for 1 educational credit per session using your health system’s code at checkout.

Not the Models You Are Looking For: An Evaluation of Performance, Privacy, and Fairness of LLMs in EHR Tasks

Learning Objectives

Speakers

Leveraging Open-Source Large-Language Model-Enabled Identification of Undiagnosed Patients with Rare Genetic Aortopathies

Learning Objectives

Speakers

Identifying Opioid Overdose and Opioid Use Disorder and Related Information from Clinical Narratives Using Large Language Models

Learning Objectives

Speakers

Exploring ChatGPT 3.5 for structured data extraction from oncological notes

Learning Objectives

Speakers

Enhancing Disease Detection in Radiology Reports Through Fine-tuning Lightweight LLM on Weak Labels

Learning Objectives

Speakers

Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach

Learning Objectives

Speakers

Continuing Education Credit

Physicians

ACHIPsTM

FAQs

When was the On Demand content recorded?

How long is CME/CNE credit available if I complete this activity?

I’m not an AMIA Member. Can I still purchase this content?

How do I become an AMIA member?

I only want to listen to the audio of these sessions. Is that format available?

Where can I learn more about AMIA's live events?

I attended an AMIA event in person. Can I still claim CME/CNE credit if I complete these online sessions?

Can On Demand Sessions be purchased with Health System Membership educational credits?

ACHIPs^TM