Deploying NLP and machine learning in real-world health systems requires far more than model development—it demands scalable infrastructure, rigorous evaluation, and integration within complex operational environments. In this talk, we describe the design and deployment of a national NLP pipeline across the U.S. Department of Veterans Affairs, processing over a million clinical notes per day to extract clinically and operationally relevant signals for suicide prevention, overdose surveillance, and population health management. We highlight key architectural decisions, including deployment within secure cloud and high-performance computing environments, approaches to maintaining data provenance and auditability, and the use of task-driven data representations that depart from strict common data model constraints to better support real-time operational use.
We then present our integration of large language models into this production pipeline, focusing on where they provide measurable gains in signal extraction and where they introduce new challenges related to stability, cost, and validation at scale. We further discuss how these extracted features are incorporated into downstream predictive modeling frameworks and evaluated using metrics aligned with operational decision-making, including risk concentration under constrained intervention capacity. Finally, we outline next-generation directions, including hybrid symbolic–LLM architectures, dynamic phenotype modeling, and the integration of clinical, social, and economic data streams to support population-level inference. This work provides a concrete, experience-driven roadmap for translating advanced AI methods into durable, high-impact systems—and highlights key technical and scientific challenges that remain for the field.