Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation.
Author(s): Hong, Chuan, Chowdhury, Anand, Sorrentino, Anthony D, Wang, Haoyuan, Agrawal, Monica, Bedoya, Armando, Bessias, Sophia, Economou-Zavlanos, Nicoleta J, Wong, Ian, Pean, Christian, Li, Fan, Pollak, Kathryn I, Poon, Eric G, Pencina, Michael J
DOI: 10.1093/jamia/ocaf023