Improving model transferability for clinical note section classification models using continued pretraining
Presenters
Watch Recording
Statement of Purpose
Section classification refers to the task of assigning section names to clinical note segments. This is typically a preprocessing step that benefits downstream NLP tasks, such as named entity recognition and cohort discovery, for several key reasons. First, state-of-the-art information extraction models like transformers (BERT) often have word processing limits, while clinical notes tend to be longer than that. Truncating irrelevant parts of a note using section name metadata can help the text fit within model better. Additionally, specific clinical information may only be present in particular sections of clinical notes, such as social determinants of health found in the social history section. Identifying these sections through section classification can assist in locating such information.
Our study explored creating section classification methods with enhanced generalizability across healthcare institutions. Different healthcare institutions often have distinct note types or section names. To investigate the transferability of clinical note section classification models across institutions, we transformed section names into SOAP (Subjective, Objective, Assessment, and Plan) categories to ensure a consistent classification outcome space across datasets. We then enhanced model transferability by applying continuous pretraining techniques, which involve pretraining a model on an unannotated corpus of texts. We also measured the impact of adding target domain annotated data on model performance, and measured the impact of continued pretraining in the unit of target domain annotated data.
Learning Objectives
- Understanding the SOAP (Subjective, Objective, Assessment and Plan) note section definition and using it to study the performance of section classification models across healthcare institutions
- Understanding how to use continued pretraining to improve pretrained language models (BERT) with unlabled notes in the target domain
- Measuring the worth of continued pretraining in the unit of the number of annotated samples in the target domain