Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis.

Publication Type Academic Article
Authors Gebhard H, Kartal A, Manalil N, Chung L, Daulat S, Cheng C, Waheed A, Shahzadi A, Hussain I, Härtl R, Elsayed G
Journal Global Spine J
Pagination 21925682261448823
Date Published 05/06/2026
ISSN 2192-5682
Abstract Study DesignDiagnostic accuracy study.ObjectivePrior evaluations of frontier models as radiology decision-support tools relied on 2-dimensional images or text reports; their ability to interpret volumetric data remains unclear. This study assessed Google Gemini 3 Pro for grading lumbar spinal canal stenosis on video lumbar magnetic resonance imaging (MRI) and evaluated diagnostic accuracy, agreement with neuroradiologist consensus, and the effect of localizer-assisted input.MethodsThe Radiological Society of North America (RSNA) 2024 Lumbar Spine Degenerative Classification Dataset, with American Society of Neuroradiology (ASNR) consensus labels, served as a reference benchmark; interobserver agreement among contributing readers was not reported. 100 examinations yielded 500 disc-level observations (371 normal/mild, 74 moderate, 55 severe), demonstrating marked class imbalance. Native imaging series were converted into synchronized video montages. Gemini 3 Pro generated one grade per disc level with and without localizer overlays. Primary outcome was linearly weighted kappa (κw); secondary outcomes included class-wise performance, severe-case error patterns, and overall accuracy.ResultsWithout localizer, overall accuracy was 75.6% (378/500) with fair agreement (κw = 0.39). Severe stenosis sensitivity was 41.8%; 43.6% of severe cases were downgraded to normal/mild, and 58.2% to non-severe. With localizer overlays, accuracy was 73.2% (366/500) with κw = 0.32, and severe sensitivity decreased to 30.9%; severe-to-normal/mild misses increased to 52.7%. Differences were not significant.ConclusionsGemini 3 Pro showed fair agreement with the neuroradiologist consensus benchmark, but apparent overall accuracy was inflated by the majority normal/mild class and masked clinically unacceptable under-detection of severe stenosis. Localizer-assisted input did not improve performance.
DOI 10.1177/21925682261448823
PubMed ID 42092340
PubMed Central ID PMC13149351
Back to Top