Not the Models You Are Looking For: An Evaluation of Performance, Privacy, and Fairness of LLMs in EHR Tasks
We use a private dataset derived from Vanderbilt University Medical Center’s EHR and GPT-3.5, GPT-4, and traditional ML, measuring predictive performance, output calibration, privacy-utility tradeoff, and algorithmic fairness. Traditional ML vastly outperformed GPT-3.5 and GPT-4 with respect to predictive performance and output probability classification. We find that traditional ML is much more robust to efforts to generalize demographic information compared to GPT-3.5 and GPT-4. Surprisingly, GPT-4 is the fairest model according to our selected metrics. These findings imply additional research into LLMs is necessary before deploying as clinical prediction models.
Learning Objectives
- Identify the concerns associated with using large language models (LLMs) as clinical prediction models and demonstrate how to assess their performance and reliability in clinical applications.
Speakers
- Katherine Brown, PhD (Vanderbilt University Medical Center)
Leveraging Open-Source Large-Language Model-Enabled Identification of Undiagnosed Patients with Rare Genetic Aortopathies
Hereditary aortopathies are often underdiagnosed, with many patients not receiving genetic testing until after a cardiac event. In this pilot study, we investigate the use of open-source LLMs for recommending genetic testing based on clinical notes. We evaluate the utility of injecting disease-specific knowledge into retrieval augmentation generation-based and finetuned models. Our result of 93% accuracy using a base model alone surprisingly suggests that incorporating domain knowledge may sometimes hinder clinical model performance.
Learning Objectives
- Identify the challenges involved in diagnosing rare genetic diseases.
- Explain how large language models (LLMs) can be used to screen patients for genetic aortopathies.
- Apply an open-source LLM-enabled recommender pipeline to make patient-level diagnostic predictions using clinical notes.
Speakers
- Zilinghan Li, Master of Science (Argonne National Laboratory)
Identifying Opioid Overdose and Opioid Use Disorder and Related Information from Clinical Narratives Using Large Language Models
In this study we evaluated the ability of ChatGPT-4o-mini to extract three social and environmental determinants of health (SEDoH) indicators (housing stability, substance use and socio-economic status) from clinical notes compared to a manually annotated reference standard, showing extraction with moderate accuracy, precision and recall. The model exhibited a moderate performance in identifying “socio-economic status” highlighting its potential for use in standardizing and integrating SEDoH data into healthcare systems.
Learning Objectives
- Understanding the critical public health implications of opioid overdose and opioid use disorder (OUD) in the United States.
- Identify opioid overdose, problematic opioid use, and other related concepts to facilitate studies counter opioid crisis.
- Apply encoder-only large language models (LLMs) and decoder-based generative LLMs to extract opioid related information from clinical notes and understand the strengths and weaknesses of the two types of LLMs.
- Understand the cost-effective p-tuning algorithm to adopt encoder-based and decoder-based LLMs for patient information extraction.
Speakers
- Daniel Paredes, MS (University of Florida)
Exploring ChatGPT 3.5 for structured data extraction from oncological notes
In large-scale clinical informatics, there is a need to maximize the amount of usable data from electronic health records. With the adoption of large language models in HIPAA secure environments, there is potential to use them to extract structured data from unstructured clinical notes. We explored how ChatGPT 3.5 could be used to supplement data in cancer research. We assessed how GPT used clinical notes to answer six relevant clinical questions. Four prompt engineering strategies were used: zero-shot, zero-shot with context, few-shot, and few-shot with context. Few-shot prompting often decreased the accuracy of GPT outputs and context did not consistently improve accuracy. GPT extracted patients’ Gleason scores and ages with an F1 score of 0.99 and it identified if patients received palliative care with and if patients were in pain with an F1 score of 0.86. This has potential to increase interoperability between healthcare and clinical research.
Learning Objectives
- Recognize the potential and limitations of using ChatGPT to extract structured data from unstructured clinical notes.
Speakers
- Ty Skyles, BS candidate (Brigham Young University)
Enhancing Disease Detection in Radiology Reports Through Fine-tuning Lightweight LLM on Weak Labels
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4-o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Learning Objectives
- Identify key limitations preventing the practical application of large language models (LLMs) in the medical domain, including model size constraints and lack of cohort-specific labeled datasets.
- Describe the process of fine-tuning lightweight LLMs, such as Llama 3.1-8B, using datasets with synthetic labels to improve task-specific performance.
- Evaluate the impact of synthetic label quality on model performance, comparing outcomes from high-quality sources (e.g., GPT-4o) and lower-quality datasets (e.g., MIMIC-CXR).
- Analyze the ability of fine-tuned Llama 3.1-8B to surpass noisy teacher labels, demonstrating the model’s inherent capability in disease detection and related tasks.
Speakers
- Yishu Wei, PhD (Department of Population Health Sciences, Weill Cornell Medicine)
Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach
Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.
Learning Objectives
- Explain how Sentence-BERT embeddings and machine learning models, including neural networks and XGBoost, can be applied to predict antibiotic resistance patterns using clinical documentation and microbiology data.
Speakers
- Mahmoud Alwakeel, MD (Mahmoud Alwakeel, MD)
Continuing Education Credit
Physicians
The American Medical Informatics Association is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to provide continuing medical education for physicians.
The American Medical Informatics Association designates this online enduring material for 1.5 AMA PRA Category 1™ credits. Physicians should claim only the credit commensurate with the extent of their participation in the activity.
Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.
ACHIPsTM
AMIA Health Informatics Certified ProfessionalsTM (ACHIPsTM) can earn 1 professional development unit (PDU) per contact hour.
ACHIPsTM may use CME/CNE certificates or the ACHIPsTM Recertification Log to report 2024 Symposium sessions attended for ACHIPsTM Recertification.
Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.