Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities — a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. We demonstrate our approach on glioma subtype classification from multi-sequence brain MRI, achieving a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Beyond this specific application, our framework provides a scalable and robust blueprint for various multi-modal medical imaging problems effectively leveraging vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1896_paper.pdf

SharedIt Link: https://rdcu.be/eHwYG

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_31

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/daniel-scholz/mm-dinov2

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SchDan_MMDINOv2_MICCAI2025,
        author = { Scholz, Daniel AND Erdur, Ayhan Can AND Ehm, Viktoria AND Meyer-Baese, Anke AND Peeken, Jan C. AND Rueckert, Daniel AND Wiestler, Benedikt},
        title = { { MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {320 -- 330}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces a novel and efficient framework called MM-DINOv2 that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging, using multi-modal patch embeddings and designed to increase robustness to missing modalities.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper clearly describes its contributions, and the structure of the paper is well organised.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major:
- I found that “Crops are always centred around one voxel containing tumour tissue” is quite unrealistic in any application-like scenario.
- It’s not clear how the embedding decomposes the third dimension.
- On this line, you use ResNet34 [13] but the implementation is 2D. The problem you are trying to solve is 3D. Not really a fair comparison.
- It would be great, as a baseline, to compare the performance of ad-hoc 3D CNN trained for the task.
- Pag.5 – Please describe the datasets briefly instead of using only references ([18, 3, 2, 26, 4, 6, 27, 12, 3]). Name of the datasets and composition. A table could be fine. We can’t ask readers to open nine different papers to understand the data you used.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is trying to solve a 3D problem with a 2D method, but at any point this is discussed, if not in the conclusion (as future work). Furthermore, no comparisons are made with existing 3D models.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The comparisons made with 2D only methods may provide a fair evaluation if the aim is to compare 2D methods, but not if the goal is to create a good classification system. Regarding centring the slices around the tumour, while automated detection could enable such cropping, your method does not include this step and doesn’t address the implications. Adding a detection stage would fundamentally alter the architecture and proposal, which is not reflected in your current work.

The paper lacks sufficient novelty, as the core approach relies on assumptions (e.g., precise tumor localization) that are not addressed within the proposed architecture. Additionally, the results section is weak, with limited empirical evidence to convincingly support the method’s effectiveness in realistic scenarios.

Review #2

Please describe the contribution of the paper

The paper introduces MMDINOv2, an adaptation of the DINOv2 vision foundation model for multi-modal medical imaging. It proposes two key contributions: multi-modal patch embedding, enabling the model to jointly process different medical image modalities (e.g., various MRI sequences T1,T2, T1c and FLAIR), and full-modality masking, which improves robustness by training the model to handle missing modalities—a common scenario in clinical practice. They utilised semi-supervised learning framework using a teacher-student DINO approach to effectively leverage large scale unlabelled BraTS medical data for improved representation learning.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Coherent Component Integration: The motivation behind introducing multi-modal patch embedding and full-modality masking is well-justified, and the two components complement each other effectively within the proposed end-to-end pipeline.
2. Complementary Design: The feasibility of full-modality masking is inherently enabled by the use of multi-modal patch embeddings, indicating thoughtful architectural synergy.
3. Clear and Concise Writing: The paper is written in a simple and clean manner, making it easy to follow and understand the proposed ideas.
4. Effective Use of Large-Scale Dataset: The utilization of the BraTS dataset, a large and diverse benchmark in medical imaging, is appropriate and demonstrates the model’s improved performance.
5. Performance Gains in Classification Tasks: Notable improvements in glioma classification are reported under both supervised and semi-supervised settings.
6. Well-Conducted Ablation Studies: Ablation studies are included to highlight the contribution of each component, strengthening the overall empirical validation of the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) Overstated Title Scope: The title implies a broad multi-modal approach involving diverse data types, but the work is restricted solely to medical image-based glioma subtype classification.

(2) Minimal Adaptation of DINOv2: The adaptation from self-supervised to semi-supervised learning via a teacher-student setup is a standard extension, offering limited technical novelty.

(3) Intuitive Architectural Modifications: The modifications to DINOv2 are minor and expected, lacking significant innovation or architectural advancement.

(4) Unclear Motivation for Learnable Modality Embeddings: While modality embeddings are introduced as learnable, the paper does not examine what is actually captured, missing an opportunity for deeper modality-specific insights.

(5) Section Title: The heading “Material and Methods” is unusual for a technical paper and feels inconsistent with standard academic structure.

(6) Insufficient Explanation of Loss Function: The loss formulation is not clearly explained, which could hinder reproducibility and understanding, especially for readers less familiar with the framework.

(7) Weak Justification of Experimental Protocol: Section 3.5 lacks clarity and rationale for chosen training/testing splits and evaluation settings.

(8) Lack of detailed analysis of Table 1: Table 1, lacks sufficient analysis to explain key inconsistencies—for instance, why F1 scores are higher for Ext than Int, while MCC and AUROC show the opposite trend. Additionally, the drastic performance fluctuations for the Oligo class across Int and Ext are unexplained, along with better results under semi-supervised settings with RGB-DINOv2.

(9) Writing Quality deterioration Post Section 3.4: The paper’s narrative and clarity decline significantly in later sections, especially the results and discussion, which reduces overall readability and impact.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper tackles a relevant problem by extending a foundation model to the medical imaging domain through semi-supervised learning. Its empirical results and ablation studies highlight the model’s effectiveness for glioma subtype classification.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces MM-DINOv2, a framework that adapts the DINOv2 vision foundation model for multi-modal medical imaging. It preserves spatial relationships across modalities using modality-specific and positional embeddings and introduces full modality masking to ensure robustness to missing MRI sequences. The model also supports semi-supervised learning, effectively leveraging unlabeled data. Applied to glioma subtype classification, MM-DINOv2 achieves a 0.6 MCC on an external test set, outperforming prior methods by 11.1%, demonstrating its potential for real-world clinical applications.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Multi-modal adaptation: Introduces multi-modal patch embeddings to adapt DINOv2 for medical imaging, enabling modality-specific feature extraction and spatial awareness.
- Robustness to missing modalities: Uses full modality masking during training to handle missing MRI sequences, a common issue in clinical datasets.
- Semi-supervised learning: Effectively leverages unlabeled data alongside labeled samples, improving performance in low-annotation settings.
- Improved classification performance: Achieves 0.6 MCC on an external glioma subtype test set, outperforming existing supervised methods by 11.1%.
- Extensive ablation study: Key design choices are validated through ablation studies, confirming their individual contributions to model performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

There are no major weaknesses in the paper. The methodology is sound, the contributions are clear, and the results are well-supported. However, there are a few minor issues related to clarity, phrasing, and analysis, which are detailed in the comments section below. Addressing these would further strengthen the overall presentation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Please provide the full forms of the tumor type abbreviations (e.g., Astro, GBM, Oligo) when they are first mentioned for clarity to readers who may not be familiar with these terms.
- The paper states that it is the “first to extend DINOv2 to handle multiple imaging modalities,” despite acknowledging that [14] also applies DINOv2 to multi-modal medical imaging, even though using a less sophisticated approach. While the proposed method clearly improves upon [14], the claim of being the first seems overstated. It would be more accurate to state this as a significant advancement or improvement over prior work rather than a first.
- The F1 score for oligodendroglioma drops in the semi-supervised setting, which the paper attributes to class imbalance in the unlabeled data. This explanation is reasonable, but in a future work, it would be valuable if the authors briefly discussed how this issue might be addressed.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a strong and well-motivated framework for adapting DINOv2 to multi-modal medical imaging. The method is technically sound, addresses important real-world challenges like missing modalities and limited labels, and shows clear performance improvements over prior work. While there are a few minor issues related to clarity and framing, they do not detract from the overall quality and relevance of the contribution.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their thoughtful feedback. We appreciate their recognition of the paper’s clear organization (R1) and the novelty and efficiency of our approach (R1), which addresses a common issue in clinical datasets (R3). We are pleased that our experiments and ablation studies were seen as well-conducted (R2) and extensive (R3), showing notable improvements in classification performance (R2, R3). We address reviewers’ comments below and will incorporate their feedback into the paper. Addressing 3D Challenges Through 2D Methodology (R1): Although the original MRI scans are volumetric, we use 2D slices as input for improved computational efficiency and due to the incompatibility of the DINOv2 pre-trained ViT with 3D volumes. Adapting DINOv2 to full 3D data remains a non-trivial challenge, as highlighted by recent literature (see Future Work section). While our current modality and positional embeddings do not explicitly encode 3D spatial relationships, we mitigate this by sampling slices from multiple orientations and tumor regions, thus preserving essential 3D context within the 2D framework. Importantly, all compared models, including baselines, operate on identical 2D slice data, ensuring a fair evaluation. Scholz et al. (MIDL 2024) trained and evaluated a 3D ResNet on the same datasets, achieving an MCC of 0.55, closely matching our 2D approach (0.54 MCC). Adaptations to DINOv2 (R2): While the semi-supervised extension used is an established concept, our contribution lies in demonstrating clinically relevant performance gains for glioma subtyping (Contribution 3) through its integration with DINOv2. Our intuitive architectural modification is crucial for adapting foundation models to multi-modal medical imaging, as evidenced in our ablation studies (Table 3). Inconsistency in Table 1 (R2): We regret inadvertently mixing up the class-wise F1 column headers. The MCC/AUROC table headers are correct, while the order of the F1 score columns should be: Int Astro, Int GBM, Int Oligo, Ext Astro, Ext GBM, Ext Oligo. In short, the first three columns of the F1 score correspond to internal, and the last three to external test set results. After correcting this, better internal over external performance is also evident in the class-wise F1 scores. We will correct the table headings. Our methods’ performance remains unchanged, as only the column heads are mixed up, but not the rows. Cropping Around the Tumor (R1): While cropping around the tumor tissue may seem unrealistic initially, automated detection methods make such cropping feasible as part of a standard two-stage workflow. Importantly, our approach does not rely on cropping and remains applicable even without this step. Title Scope (R2): While we believe our work serves as a blueprint for various multi-modal medical imaging problems, we will clarify and highlight the exact scope of our paper in both the abstract and conclusion. Learnable Embeddings (R2): We made the modality embeddings learnable, as prior work has shown that learned positional embeddings can be more powerful. Additional analysis on the modality embeddings is an interesting avenue for future work. Context of Loss Functions (R2): We will add further context of the loss functions to the paper. To ensure reproducibility, we will release our source code, including exact implementations, upon acceptance. Dataset Choice and Splits (R2): We chose the splits to achieve a roughly equal number of internal and external subjects. The cropping size and the minimum threshold of tumor voxels were chosen heuristically to ensure a sufficiently large portion of the tumor in the crop, and no tumor tissue was excluded. Additional Clarification: We will reformulate our contribution to improving the utilization of DINOv2 for multimodal imaging, and add the drop in oligodendroglioma performance to future work (R3). We will add the dataset names, class distributions (R1), and table abbreviations (R3) to the Dataset Section.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The author addresses the concerns in the rebuttal

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Although some minor issues are not well addressed, the overall technical contributions are enough and meet the bar of MICCAI.

back to top

MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis

Author(s):