Skip to main content

Development and Application of a High Throughput Natural Language Processing Architecture to Convert All Clinical Documents in a Clinical Data Warehouse into Standardized Medical Vocabularies

This on-demand webinar does not offer CE credit.

Afshar M, Dligach D, Sharma B, et al. Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies. J Am Med Inform Assoc. 2019 May 30. pii: ocz068. doi: 10.1093/jamia/ocz068.

Read the article

Watch the Recording


Majid Afshar, MD, MSCR
Assistant Professor
Division of Pulmonary and Critical Care
Department of Public Health Sciences
Ron Price, Jr.
Associate Provost
Office of Informatics and Systems Development
Loyola University Chicago Health Sciences


Daniel Feller, MS
PhD Candidate in Biomedical Informatics
Columbia University


Kelson Zawack, PhD
Postdoctoral Fellow, Biostatistics Department
Yale University
Tiffany J. Callahan, MPH, PhD Candidate
Computational Bioscience Program
University of Colorado Denver Anschutz Medical Campus

Statement of Purpose

Information in the clinical narrative of the electronic health record (EHR) is a rich source of data and comprises a large majority of patient data, but its unstructured format renders it complex and difficult to utilize. Clinical data warehouses of health systems are becoming larger and more efficient in today’s health data ecosystem; therefore, high throughput architectures to manage and process the data are needed. Large-scale efforts at de-identification of clinical notes and curation of the data for research purposes are underway in the National Center for Advancing Translational Sciences (NCATS). Methods in natural language processing (NLP) have proven effective in automatic semantic analyses of clinical documents with concept mapping to standardized medical vocabularies. Several centers have demonstrated success in high throughput NLP but little guidance exists on optimizing their performance for an entire health system. We aim to develop a high throughput NLP architecture using the cTAKES engine to concept map over ten years of clinical documents from our CDW using the Unified Medical Language System (UMLS). Second, we aim to examine the application of our architecture in the context of a hospital 30-day readmission prediction task.

Our high throughput NLP architecture converted our health system’s data corpus of over 84 million unstructured clinical notes into a completely de-identified data repository of nearly 40 billion structured and standardized data elements. This task was accomplished at a rate of over 500,000 documents per hour through our on-premise data center. The result for predicting 30-day hospital readmission demonstrate that mapped concepts from UMLS performed similar to n-grams. The processed data is a new addition to our clinical research database for researchers and administrators interested in data mining and analytics from any note or report. This may be more appealing for end-users and researchers interested in using clinical notes from their CDW, and our results suggest that CUI features with standardized medical vocabulary is one option for large-scale clinical research in data analytics.

Target Audience

The target audience for this activity is professionals and students interested in biomedical and health informatics.

Learning Objectives

The general learning objective for all of the JAMIA Journal Club webinars is that participants will

  • Use a critical appraisal process to assess article validity and to gauge article findings' relevance to practice

After this live activity, the participant should be better able to:

  • Understand how to design a high throughput NLP architecture to produce a deidentified clinical data warehouse of a health system’s corpus of notes converted into standardized medical vocabularies, and
  • Apply concept unique identifiers (CUIs) from a big database/clinical data warehouse to perform data analytics such as applied predictive modelling or phenotyping tasks.

This JAMIA Journal Club does not offer continuing education credit.

In our dedication to providing unbiased education even when no CE credit is associated with it, we provide planners’ and presenters’ disclosure of relevant financial relationships with commercial interests that has the potential to introduce bias in the presentation: 

Disclosures for this Activity

These faculty, planners, and staff who are in a position to control the content of this activity disclose that they and their life partners have no relevant financial relationships with commercial interests: 

JAMIA Journal Club presenters: Majid Afshar, Ron Price, Jr.
JAMIA Journal Club planners: Michael Chiang, Kelson Zawack, Tiffany J. Callahan, Daniel Feller
AMIA staff: Susanne Arnold, Pesha Rubinstein


Dates and Times: -
Type: Webinar
Course Format(s): On Demand