Talk Description
Institution: Westmead Hospital - NSW, Australia
Purpose
Large Language Models (LLMs) like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude are being explored as tools in surgical decision-making. Since LLMs process information through neural networks, their performance in medicine can be assessed similarly to humans – with examinations. This study evaluates LLMs’ performance in the RACS Generic Surgical Sciences Examination (GSSE), a mandatory hurdle for Australian surgical trainees.
Methodology
ChatGPT 4.0, Gemini 2.0, and Claude Sonnet 3.5 were tested against 650 questions from a public GSSE question bank, weighted per RACS guidelines. A component and overall score >65% were required to pass. A 100,000-iteration Monte Carlo simulation calculated failure probabilities, and mean, maximum, and minimum scores.
Results
Claude performed best, averaging 84.8% with a 0.1% failure probability. Gemini averaged 84.1% with a 0.2% failure rate, while ChatGPT scored 78.5% on average, failing 1.7% of the time. Claude excelled in anatomy (76.4%), outperforming Gemini (72.6%) and ChatGPT (67.3%). In physiology, Gemini led with 88.9%, followed by Claude (85.6%) and ChatGPT (75.8%). Pathology scores were similar for Claude (90.0%) and Gemini (91.4%), both surpassing ChatGPT’s 82.9%. Differences were not statistically significant on t-testing.
Conclusions
Claude performed most consistently, but all three LLMs showed strong potential as decision-making tools in surgery. Further validation in advanced postgraduate and Fellowship-level exams is needed to confirm their utility in surgical practice.
Presenters
Authors
Authors
Dr Asanka Wijetunga -