Large language models (LLMs) hold immense promise for democratizing access to medical information and assisting physicians in delivering higher-quality care. However, realistic evaluations of LLMs in clinical contexts have been limited, with much focus placed on multiple-choice evaluations of clinical knowledge. In this talk, I will present a four-level framework for clinical evaluations, encompassing multiple-choice knowledge assessments, open-ended human ratings, offline human evaluations of real tasks, and online real-world studies within actual workflows. I will discuss the strengths and weaknesses of each approach and argue that advancing towards more realistic evaluations is crucial for realizing the full potential of LLMs.