Skip to main content
RACS ASC 2025
Which large language model performs best in the Generic Surgical Sciences Examination?
Verbal Presentation
Edit Your Submission
Edit

Verbal Presentation

12:19 pm

05 May 2025

Meeting Room C4.9

RESEARCH PAPERS

Talk Description

Institution: Westmead Hospital - NSW, Australia

Purpose Large Language Models (LLMs) like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude are being explored as tools in surgical decision-making. Since LLMs process information through neural networks, their performance in medicine can be assessed similarly to humans – with examinations. This study evaluates LLMs’ performance in the RACS Generic Surgical Sciences Examination (GSSE), a mandatory hurdle for Australian surgical trainees. Methodology ChatGPT 4.0, Gemini 2.0, and Claude Sonnet 3.5 were tested against 650 questions from a public GSSE question bank, weighted per RACS guidelines. A component and overall score >65% were required to pass. A 100,000-iteration Monte Carlo simulation calculated failure probabilities, and mean, maximum, and minimum scores. Results Claude performed best, averaging 84.8% with a 0.1% failure probability. Gemini averaged 84.1% with a 0.2% failure rate, while ChatGPT scored 78.5% on average, failing 1.7% of the time. Claude excelled in anatomy (76.4%), outperforming Gemini (72.6%) and ChatGPT (67.3%). In physiology, Gemini led with 88.9%, followed by Claude (85.6%) and ChatGPT (75.8%). Pathology scores were similar for Claude (90.0%) and Gemini (91.4%), both surpassing ChatGPT’s 82.9%. Differences were not statistically significant on t-testing. Conclusions Claude performed most consistently, but all three LLMs showed strong potential as decision-making tools in surgery. Further validation in advanced postgraduate and Fellowship-level exams is needed to confirm their utility in surgical practice.
Presenters
Authors
Authors

Dr Asanka Wijetunga -