MedFM-Robust: Benchmarking Robustness
of Medical Foundation Models

Anonymous

MICCAI 2026 · Under Review

Code Paper (Coming Soon) Results
MedFM-Robust Framework

Fig. 1. (a) Overview of our robustness evaluation framework. We generate SSIM-calibrated perturbations across five severity levels, combining base corruptions with modality-specific artifacts. We benchmark three Med-VLMs and two SAM-based segmentation models under a unified protocol, and investigate multiple fine-tuning strategies across VQA, captioning, visual grounding, and segmentation tasks.


Key Results

Fig. 2. (a) Traditional clean-image evaluation pipeline. (b) Our robustness benchmark applies modality-adaptive perturbations before the encoder and evaluates both Med-VLM tasks and segmentation under matched settings. (c) Metrics comparison: IoU drop closely tracks Dice drop, and representative fatal perturbations cause increasing performance degradation with higher severity levels.


40Perturbation Types
8Imaging Modalities
5VLMs Evaluated
2Seg. Models
5Fine-tuning Strategies
5Severity Levels

Abstract

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies.

Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency–robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI.

Key Findings

01

Fine-tuning strategy dominates robustness. LoRA exhibits ≈2× the degradation of full fine-tuning. SAM-Med2D's Adapter is the best PEFT efficiency–robustness trade-off.

02

Medical-specific corruptions are disproportionately harmful. 9 of top 15 perturbations are domain-specific — standard benchmarks underestimate real deployment risks.

03

Task formulation determines VLM robustness. LoRA-tuned Grounding drops >40 points, while zero-shot Captioning stays stable (<7% drop).

04

General VLMs excel at VQA but fail on Grounding. Gemini-2.5-flash: 54% relative drop. Medical VLMs are more stable; MedGemma shows the smallest drops overall.

Results

Comprehensive evaluation across segmentation and VLMs under 40 perturbation types at 5 severity levels.

Full results figure
Fig. 3. Left (Segmentation): (a) Performance–robustness trade-off. (b) Strategy ranking. (c) Model comparison. (d) Dataset sensitivity. (e) Top 15 perturbations. (f) Severity level impact. Right (VLMs): (g–i) Clean vs. perturbed on VQA, Grounding, Captioning. (j–l) Per-perturbation impact.

Segmentation — Strategy Ranking

RankStrategyIoU Drop
1Full fine-tuning0.025
2Dec-Only0.029
2Enc-Partial0.029
2Dec-Prompt0.029
5Adapter0.033
6LoRA0.048 (≈2×)

VLM — Task Robustness

TaskSettingDrop
CaptioningZero-shot<0.02 BLEU
VQA (med.)Zero-shot<8 pts
VQA (Gemini)Zero-shot36.1 pts (54%)
GroundingLoRA FT>40 pts

Benchmark Coverage

Perturbation Types (40 total)

  • Base (12): Gaussian/salt-pepper/speckle noise, Gaussian/motion blur, brightness, contrast, JPEG, pixelation, rotation, scaling, translation
  • Med-Specific (28): CT metal artifacts, MRI ghosting & bias-field, US acoustic shadowing, pathology stain variations, endoscopy bubbles & specular reflections, OCT shadow/blink/defocus, X-ray scatter & exposure, angiography haze

Datasets & Models

  • Segmentation: ISIC 2016, Brain Tumor MRI, Glaucoma Disc/Cup, Kvasir-SEG
  • VLM: OmniMedVQA, ROCOv2, MeCoVQA
  • Seg Models: MedSAM, SAM-Med2D
  • VLMs: LLaVA-Med, MedGemma, MedGemma-1.5, GPT-4o-mini, Gemini-2.5-flash
Dermoscopy MRI Fundus / OCT Endoscopy CT X-ray Ultrasound Pathology Angiography

BibTeX

@inproceedings{anonymous2026medfmrobust,
  title     = {MedFM-Robust: Benchmarking Robustness of
               Medical Foundation Models},
  author    = {Anonymous},
  booktitle = {Medical Image Computing and Computer
               Assisted Intervention (MICCAI)},
  year      = {2026},
  note      = {Under Review}
}