Which large language model performs best in the Generic Surgical

Which large language model performs best in the Generic Surgical Sciences Examination?

Verbal Presentation

Edit Your Submission

Edit

Favourite

Verbal Presentation

12:11 pm

05 May 2025

Meeting Room C4.9

RESEARCH PAPERS

Disciplines

Surgical Education

Watch The Presentation

Presentation Description

Institution: Westmead Hospital - NSW, Australia

Purpose Large Language Models (LLMs) like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude are being explored as tools in surgical decision-making. Since LLMs process information through neural networks, their performance in medicine can be assessed similarly to humans – with examinations. This study evaluates LLMs’ performance in the RACS Generic Surgical Sciences Examination (GSSE), a mandatory hurdle for Australian surgical trainees. Methodology ChatGPT 4.0, Gemini 2.0, and Claude Sonnet 3.5 were tested against 650 questions from a public GSSE question bank, weighted per RACS guidelines. A component and overall score >65% were required to pass. A 100,000-iteration Monte Carlo simulation calculated failure probabilities, and mean, maximum, and minimum scores. Results Claude performed best, averaging 84.8% with a 0.1% failure probability. Gemini averaged 84.1% with a 0.2% failure rate, while ChatGPT scored 78.5% on average, failing 1.7% of the time. Claude excelled in anatomy (76.4%), outperforming Gemini (72.6%) and ChatGPT (67.3%). In physiology, Gemini led with 88.9%, followed by Claude (85.6%) and ChatGPT (75.8%). Pathology scores were similar for Claude (90.0%) and Gemini (91.4%), both surpassing ChatGPT’s 82.9%. Differences were not statistically significant on t-testing. Conclusions Claude performed most consistently, but all three LLMs showed strong potential as decision-making tools in surgery. Further validation in advanced postgraduate and Fellowship-level exams is needed to confirm their utility in surgical practice.

Presenters

Asanka Wijetunga

Westmead Hospital - Australia

Authors

Dr Asanka Wijetunga -

Menu Items

Which large language model performs best in the Generic Surgical Sciences Examination?

Disciplines

Watch The Presentation

Presentation Description

Presenters

Authors

Authors