Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data make the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM
Learning Objectives
- Analyze the challenges associated with applying data-driven approaches to high-dimensional genotype data.
- Evaluate the effectiveness of advanced feature selection and engineering techniques informed by the latest developments in large language models (LLMs).
- Compare conventional data-driven methods with LLM-based knowledge-driven approaches for reducing genetic features and mitigating overfitting.
- Apply a novel knowledge-driven framework that leverages chain-of-thought reasoning and ensembling principles to enhance genetic feature selection and improve phenotype prediction with limited data.
Speaker
- Joseph Lee, Bachelor's of Science in Networked and Social Systems Engineering (University of Pennsylvania)
Inter-tissue coordination patterns of metabolic transcriptomes
Understanding inter-organ communication in the entire body is crucial for comprehending health and disease. We present a computational approach that allows to define inter-tissue communication and a general coordination pattern of metabolic transcriptomes at a whole-body scale, applied to 19 human tissues and validated using external datasets. We reveal known and novel inter-tissue metabolic links and a significant global coregulation pattern. Our framework may apply to other types of transcriptomes and used to detect changes across different conditions.
Learning Objectives
- Understand that metabolic transcriptomes are positively coordinated and form a significantly large community and are highly connected.
Speaker
- Judith Somekh, PhD (University of University of Haifa)
Evolution of Genomic Indicators for Pharmacogenomics: Retrospective Analysis and Implications for Knowledge Management
Pharmacogenomics (PGx) incorporates patient genetic data into pharmacotherapy guidelines to improve patient outcomes. Clinical decision support (CDS) systems rely on underlying knowledge bases, information models, and encoded rule logic to implement clinical guidelines. However, changes in PGx knowledge and result reporting standards necessitate continual maintenance of CDS rule logic and data reporting in electronic health records (EHRs). We reviewed over 12-years of PGx CDS implementation at Mayo Clinic, identifying three different methods of recording patient PGx data in multiple EHRs. Prior to enterprise-wide EHR convergence, each Mayo Clinic site followed task force developed gene-drug guidelines to develop rules for annotating gene-phenotype data within patient allergy and problem lists. These annotations frequently lacked discrete genotype or provenance data, precluding detailed tracking of changes in each system. After EHR convergence, all Mayo Clinic sites used Genomic Indicator (GI) profiles (N=158) within an EHR module specifically designed to capture gene-phenotype information. Several post-implementation modification events incorporated new PGx knowledge, including adding new gene-drug indicator sets, updating genotype-phenotype specifications, and assigning haplotype enzyme activity score data for quantitative phenotypes. The incorporation of phenotype results from a large multi-gene panel resulted in the creation of 29 test-specific indicators,12 of which were later removed or merged with previously established GIs due to the use of non-standardized nomenclature and classifications. Our results demonstrate limitations of using pre-coordinated terms for complex and evolving knowledge and suggest the need for a robust knowledge model and standardized nomenclature to provide adequate data provenance and support genomic medicine at scale.
Learning Objectives
- Describe the role of pharmacogenomics (PGx) in integrating patient genetic data into pharmacotherapy guidelines to improve patient outcomes.
- Describe 3 types of events that can impact the design of genomic indicators.
- Identify 3-5 design decisions to consider when creating genomic indicators, which may result in more stable implementations.
Speaker
- Robert Freimuth, PhD (Mayo Clinic)
Continuing Education Credit
Physicians
The American Medical Informatics Association is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to provide continuing medical education for physicians.
The American Medical Informatics Association designates this online enduring material for 1.0 AMA PRA Category 1™ credits. Physicians should claim only the credit commensurate with the extent of their participation in the activity.
Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.
ACHIPsTM
AMIA Health Informatics Certified ProfessionalsTM (ACHIPsTM) can earn 1 professional development unit (PDU) per contact hour.
ACHIPsTM may use CME/CNE certificates or the ACHIPsTM Recertification Log to report 2024 Symposium sessions attended for ACHIPsTM Recertification.
Claim credit no later than March 10, 2028 or within two years of your purchase date, whichever is sooner. No credit will be issued after March 10, 2028.