ACCP - 2024 Annual Meeting

Sun-76 - Evaluating Accuracy and Reproducibility of Large Language Model Performance in Pharmacy Education

Original Research
Sunday, October 13, 2024
12:45 PM–02:15 PM

Abstract

Introduction: Large language models (LLMs) have demonstrated acceptable performance in the context of structured problems; however, model performance may be suboptimal when applied to complex scenarios.

Research Question or Hypothesis: How does performance compare across various LLMs when applied to multiple-choice case-based pharmacotherapy questions, and can performance be improved by prompt engineering or model customization?

Study Design: Comparative analysis of LLM performance on multiple choice questions.

Methods: Performance of five different LLMs (ChatGPT with GPT-3.5 and 4, Claude 2, Llama2-7b and 2-13b) was evaluated on a dataset of 219 multiple-choice pharmacotherapy questions. Each LLM was queried five times to evaluate the primary outcome of accuracy (i.e., correctness) and key secondary outcome of variance. Additional secondary outcomes included performance on knowledge vs skill-based questions, impact of prompt engineering techniques (zero-shot chain-of-thought (CoT), few-shot CoT, self-consistency) and training of a customized GPT on performance, and performance relative to year 3 pharmacy students on a subset of 120 multiple-choice questions.

Results: Chat GPT-4 exhibited the highest accuracy (71.6%), while Llama2-13b had the lowest variance (0.070). All LLMs performed more accurately on knowledge-based than on skill-based questions (e.g., Chat GPT-4: 87% vs 67%). When applied to Chat GPT-4, few-shot CoT across five runs improved accuracy (77.4% vs 71.5%) with no effect on variance. Self-consistency and the custom-trained GPT demonstrated similar accuracy to Chat GPT-4 with few-shot CoT. Overall pharmacy student accuracy was 81%, compared to an optimal overall LLM accuracy of 73%. Comparing question types, six of the LLMs demonstrated equivalent or higher accuracy than pharmacy students on knowledge-based questions (e.g., Self-consistency vs students: 93% vs 84%), but pharmacy students achieved higher accuracy than all LLMs on skills-based questions (e.g., Self-consistency vs students: 68% vs 80%).

Conclusion: LLMs demonstrate an accuracy comparable to pharmacy students in knowledge-based assessments. Performance of LLMs in complex tasks can be improved by prompt engineering and model training.

Presenting Author

Tara Kennell PharmD Candidate
University of Georgia College of Pharmacy

Authors

Mengxuan Hu PhD Candidate
School of Data Science, University of Virginia

Sheng Li PhD
School of Data Science, University of Virginia

Amoreena Most PharmD, BCCCP
University of Georgia College of Pharmacy

Brian Murray PharmD, BCCCP
University of Colorado Skaggs School of Pharmacy and Pharmaceutical Sciences

Andrea Sikora Newsome PharmD, MSCR, BCCCP, FCCM
Augusta University Medical Center/UGA College of Pharmacy

Susan E. Smith PharmD, BCCCP, FCCM
University of Georgia College of Pharmacy

Huibo Yang PhD Candidate
Department of Computer Science, University of Virginia

Posters

Sun-76 - Evaluating Accuracy and Reproducibility of Large Language Model Performance in Pharmacy Education

Scientific Poster Session II - Original Research

Abstract

Presenting Author

Authors

Cookies