Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis.

Publication Type	Academic Article
Authors	Gebhard H, Kartal A, Manalil N, Chung L, Daulat S, Cheng C, Waheed A, Shahzadi A, Hussain I, Härtl R, Elsayed G
Journal	Global Spine J
Pagination	21925682261448823
Date Published	05/06/2026
ISSN	2192-5682
Abstract	Study DesignDiagnostic accuracy study.ObjectivePrior evaluations of frontier models as radiology decision-support tools relied on 2-dimensional images or text reports; their ability to interpret volumetric data remains unclear. This study assessed Google Gemini 3 Pro for grading lumbar spinal canal stenosis on video lumbar magnetic resonance imaging (MRI) and evaluated diagnostic accuracy, agreement with neuroradiologist consensus, and the effect of localizer-assisted input.MethodsThe Radiological Society of North America (RSNA) 2024 Lumbar Spine Degenerative Classification Dataset, with American Society of Neuroradiology (ASNR) consensus labels, served as a reference benchmark; interobserver agreement among contributing readers was not reported. 100 examinations yielded 500 disc-level observations (371 normal/mild, 74 moderate, 55 severe), demonstrating marked class imbalance. Native imaging series were converted into synchronized video montages. Gemini 3 Pro generated one grade per disc level with and without localizer overlays. Primary outcome was linearly weighted kappa (κw); secondary outcomes included class-wise performance, severe-case error patterns, and overall accuracy.ResultsWithout localizer, overall accuracy was 75.6% (378/500) with fair agreement (κw = 0.39). Severe stenosis sensitivity was 41.8%; 43.6% of severe cases were downgraded to normal/mild, and 58.2% to non-severe. With localizer overlays, accuracy was 73.2% (366/500) with κw = 0.32, and severe sensitivity decreased to 30.9%; severe-to-normal/mild misses increased to 52.7%. Differences were not significant.ConclusionsGemini 3 Pro showed fair agreement with the neuroradiologist consensus benchmark, but apparent overall accuracy was inflated by the majority normal/mild class and masked clinically unacceptable under-detection of severe stenosis. Localizer-assisted input did not improve performance.
DOI	10.1177/21925682261448823
PubMed ID	42092340
PubMed Central ID	PMC13149351