Treffer: Large Language Model-Based Patient Simulation to Foster Communication Skills in Health Care Professionals: User-Centered Development and Usability Study.
Patient Educ Couns. 2025 Feb;131:108548. (PMID: 39657391)
J Med Internet Res. 2025 Mar 03;27:e63312. (PMID: 40053778)
Nurs Rep. 2025 Mar 17;15(3):. (PMID: 40137676)
Med Educ. 2011 Aug;45(8):792-806. (PMID: 21752076)
CJEM. 2003 Jan;5(1):27-34. (PMID: 17659149)
J Gen Intern Med. 2024 Dec;39(16):3282-3289. (PMID: 39313665)
Med Teach. 2005 Jan;27(1):10-28. (PMID: 16147767)
BMC Med Educ. 2025 Feb 20;25(1):278. (PMID: 39979969)
Med Teach. 2025 Jan;47(1):40-42. (PMID: 38992981)
Med Teach. 2025 Feb;47(2):268-274. (PMID: 38478902)
J Med Internet Res. 2025 Apr 04;27:e68486. (PMID: 39854611)
J Gen Intern Med. 2025 Aug;40(11):2491-2498. (PMID: 39838250)
Acad Med. 1993 Jun;68(6):443-51; discussion 451-3. (PMID: 8507309)
Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2012 Sep;55(9):1106-12. (PMID: 22936477)
Int J Clin Health Psychol. 2015 May-Aug;15(2):160-170. (PMID: 30487833)
JMIR Med Educ. 2024 Aug 13;10:e59133. (PMID: 39137031)
JMIR Med Educ. 2024 Aug 16;10:e59213. (PMID: 39150749)
Med Teach. 2012;34(6):e421-44. (PMID: 22578051)
Weitere Informationen
Background: Case-based learning using standardized patients is a key method for teaching communication skills in medicine. Besides logistical and financial hurdles, standardized patients portrayed by actors cannot cover the complete diversity of sociodemographic factors of patients. Large language models (LLMs) show promise for creating scalable patient simulations and could probably cover a broader diversity of factors. They could also be integrated into the continuous training of future health care professionals' communication and interaction skills.
Objective: This study aimed to introduce the system architecture of a digital tool that leverages LLMs to simulate patient conversations for medical education, focusing specifically on medical history taking. Through an explorative analysis, we aimed to assess the tool's usability and examine differences between LLMs in simulating patient encounters.
Methods: We followed a user-centered design process, gathering initial requirements from 2 medical students. We then developed a fully functional web prototype using a Python Flask backend and a PostgreSQL database, integrating 5 LLMs from OpenAI, Anthropic, and xAI. The system includes an artificial intelligence-assisted case vignette generator and a dynamic patient simulator. For the explorative analysis of the prototype, we conducted a task-based usability test with 5 medical students, measuring their experience using the System Usability Scale (SUS) questionnaire and qualitative questions. We then conducted an explorative analysis in which 4 practicing physicians evaluated the simulation quality of 3 models (Grok 3, GPT-4, and Claude 3 Opus) across 7 criteria on a 5-point Likert scale.
Results: Usability testing yielded a mean SUS score of 91.5 (SD 8.40), indicating high perceived usability in a small formative sample. Students praised the system's simplicity and intuitive design but noted the absence of a formal conclusion and performance feedback, expressing a desire for a "didactic loop" to maximize learning. The models showed limitations in simulating uncertainties and memory lapses, responding to follow-up questions, and producing natural conversational flow. They perform well in simulating a coherent symptom profile, in using patient-like language, and in describing a realistic timeline and symptom progression. The differences among the models were not statistically significant. Ratings showed limited discriminative reliability (Kendall W=0-0.19, ie, very low) and a ceiling effect, with most scores clustered at 4-5, constraining interpretation; all group differences should therefore be viewed as exploratory.
Conclusions: We successfully developed a highly usable patient simulation tool that serves as a foundation for further development. Our results show that while the tool could be effective for communication training, its full potential will only be realized by integrating an automated feedback mechanism to create a complete didactic loop, as requested by the test users. Future work should assess in more depth the differences among the models in simulating psychosocial patient characteristics.
(©Ahmed Elhilali, Andy Suy-Huor Ngo, Daniel Reichenpfader, Kerstin Denecke. Originally published in JMIR Medical Education (https://mededu.jmir.org), 12.12.2025.)