Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis.
| Publication Type | Academic Article |
| Authors | Gebhard H, Kartal A, Manalil N, Chung L, Daulat S, Cheng C, Waheed A, Shahzadi A, Hussain I, Härtl R, Elsayed G |
| Journal | Global Spine J |
| Pagination | 21925682261448823 |
| Date Published | 05/06/2026 |
| ISSN | 2192-5682 |
| Abstract | Study DesignDiagnostic accuracy study.ObjectivePrior evaluations of frontier models as radiology decision-support tools relied on 2-dimensional images or text reports; their ability to interpret volumetric data remains unclear. This study assessed Google Gemini 3 Pro for grading lumbar spinal canal stenosis on video lumbar magnetic resonance imaging (MRI) and evaluated diagnostic accuracy, agreement with neuroradiologist consensus, and the effect of localizer-assisted input.MethodsThe Radiological Society of North America (RSNA) 2024 Lumbar Spine Degenerative Classification Dataset, with American Society of Neuroradiology (ASNR) consensus labels, served as a reference benchmark; interobserver agreement among contributing readers was not reported. 100 examinations yielded 500 disc-level observations (371 normal/mild, 74 moderate, 55 severe), demonstrating marked class imbalance. Native imaging series were converted into synchronized video montages. Gemini 3 Pro generated one grade per disc level with and without localizer overlays. Primary outcome was linearly weighted kappa (κw); secondary outcomes included class-wise performance, severe-case error patterns, and overall accuracy.ResultsWithout localizer, overall accuracy was 75.6% (378/500) with fair agreement (κw = 0.39). Severe stenosis sensitivity was 41.8%; 43.6% of severe cases were downgraded to normal/mild, and 58.2% to non-severe. With localizer overlays, accuracy was 73.2% (366/500) with κw = 0.32, and severe sensitivity decreased to 30.9%; severe-to-normal/mild misses increased to 52.7%. Differences were not significant.ConclusionsGemini 3 Pro showed fair agreement with the neuroradiologist consensus benchmark, but apparent overall accuracy was inflated by the majority normal/mild class and masked clinically unacceptable under-detection of severe stenosis. Localizer-assisted input did not improve performance. |
| DOI | 10.1177/21925682261448823 |
| PubMed ID | 42092340 |
| PubMed Central ID | PMC13149351 |