Large language models (LLMs) are increasingly used in healthcare, yet most evaluations rely on clean, exam-style datasets that fail to capture the complexity of real-world clinical data, such as electronic health records (EHRs), and rarely keep pace with rapidly evolving LLMs.
In this talk, we will present BRIDGE, a multilingual benchmark constructed from real clinical tasks and over one million EHR-derived samples, where we evaluated 95 leading LLMs through 24,000+ experiments and 39 million predictions. Our findings show wide variation across model families, tasks, and languages, with several open-source models matching proprietary ones. We also observe that chain-of-thought prompting often lowers accuracy for these clinical tasks, and we provide the first large-scale analysis of stigmatized language generated during model reasoning.