Skip to main content

Improving model transferability for clinical note section classification models using continued pretraining

Weipeng Zhou, Meliha Yetisgen, Majid Afshar, Yanjun Gao, Guergana Savova, Timothy A Miller, Improving model transferability for clinical note section classification models using continued pretraining, Journal of the American Medical Informatics Association, 2023;, ocad190.

Read the abstract


Weipeng Zhou, BA
University of Washington

Weipeng Zhou is a PhD student in biomedical informatics at the University of Washington. Weipeng’s research explores the intersection of natural language processing (NLP) and healthcare. He studies clinical note section classification, suicide report characterization, and use clinical narratives to gain insights about out-of-hospital cardiac arrest and Long COVID. He previously earned a BA in Computer Science and Statistics from the University of Wisconsin.

Timothy A. Miller, PhD
Boston Children’s Hospital and Harvard Medical School

Tim Miller is an Associate Professor in the Computational Health Informatics Program at Boston Children’s Hospital, Department of Pediatrics at Harvard Medical School, and at the Harvard-MIT Center for Regulatory Science. He is the PI of the Machine Learning for Medical Language Lab, home of several federally funded projects, including projects focused on basic biomedical NLP research, as well as projects that are driven by biomedical use cases. His research focuses on domain adaptation/generalizability of ML-based NLP methods, as well as methods for learning universal patient representations.

Watch Recording


Statement of Purpose

Section classification refers to the task of assigning section names to clinical note segments. This is typically a preprocessing step that benefits downstream NLP tasks, such as named entity recognition and cohort discovery, for several key reasons. First, state-of-the-art information extraction models like transformers (BERT) often have word processing limits, while clinical notes tend to be longer than that. Truncating irrelevant parts of a note using section name metadata can help the text fit within model better. Additionally, specific clinical information may only be present in particular sections of clinical notes, such as social determinants of health found in the social history section. Identifying these sections through section classification can assist in locating such information.

Our study explored creating section classification methods with enhanced generalizability across healthcare institutions. Different healthcare institutions often have distinct note types or section names. To investigate the transferability of clinical note section classification models across institutions, we transformed section names into SOAP (Subjective, Objective, Assessment, and Plan) categories to ensure a consistent classification outcome space across datasets. We then enhanced model transferability by applying continuous pretraining techniques, which involve pretraining a model on an unannotated corpus of texts. We also measured the impact of adding target domain annotated data on model performance, and measured the impact of continued pretraining in the unit of target domain annotated data.

Learning Objectives

  1. Understanding the SOAP (Subjective, Objective, Assessment and Plan) note section definition and using it to study the performance of section classification models across healthcare institutions
  2. Understanding how to use continued pretraining to improve pretrained language models (BERT) with unlabled notes in the target domain
  3. Measuring the worth of continued pretraining in the unit of the number of annotated samples in the target domain


  • 35-minute presentation by article author(s) considering salient features of the published study and its potential impact on practice
  • 25-minute discussion of questions submitted by listeners via the webinar tools and moderated by JAMIA Student Editorial Board members. 

Accreditation Statement

The American Medical Informatics Association is accredited by the Accreditation Council for Continuing Medical Education to provide continuing medical education for physicians.

Credit Designation Statement

The American Medical Informatics Association designates this live activity for a maximum of 1.0 AMA PRA Category 1 Credits™. Physicians should claim only the credit commensurate with the extent of their participation in the activity.

Type: Webinar
Course Format(s): On Demand
Price: Free