Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our multimodal learning approach, offering a scalable, low-cost solution with significant potential for more complicated clinical applications that allow missing modality input. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2038_paper.pdf

SharedIt Link: https://rdcu.be/eG4Dk

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_28

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/omron-sinicx/medical-modality-dropout

Link to the Dataset(s)

PE dataset: https://stanfordaimi.azurewebsites.net/datasets/3a7548a4-8f65-4ab7-85fa-3d68c9efc1bd NLST dataset: https://www.cancerimagingarchive.net/collection/nlst/ NLST CT features by Google CT Foundation Model: https://research.google/blog/taking-medical-imaging-embeddings-3d/

BibTex

@InProceedings{GuYi_Learning_MICCAI2025,
        author = { Gu, Yi AND Saito, Kuniaki AND Ma, Jiaxin},
        title = { { Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {280 -- 290}
}

Reviews

Review #1

Please describe the contribution of the paper

The multimodal learning framework that combined modality dropout and contrastive multimodal pretraining.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors proposed the simultaneous modality dropout, learnable modality tokens and contrastive multimodal fusion strategy for multimodal learning.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) The description of tabular data is not sufficient. It is suggested to show the attribute details of tabular data. If the attribute is obtained by analyzing CT images, the redundancy problem needs to be addressed. (2) It is recommended to give more details about missing modalities. Experimental results showed the effects of multimodal data. However, the issue of missing modality is not discussed effectively.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

(1) It is advised to optimize the Table 1. All methods should select the inference mode. Additionally, the meaning of the Dagger symbol after some methods should be clarified. (2) In page 4, section 3 Experiment, “dtasets” should be “datasets”.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In this study, the topic of multimodal learning and missing modalities is valuable. However, the definition of missing modality is not clear. The issue of missing modality is not discussed effectively.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have successfully addressed most of my previous concerns and questions.

Review #2

Please describe the contribution of the paper
- Authors introduced a new method that helps AI learn from different types of medical data, even when some data is missing, using improved dropout and contrastive learning techniques.
- They showed that their method works well on large, public medical datasets for tasks like disease detection and prediction.
- They improved the performance of a recent CT scan model in a cost-effective way.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is clearly written and presents content that is valuable to the medical research community. Therefore, it is also relevant and of interest to the MICCAI community.
- The smart design lets the AI work even with missing data by training it to handle incomplete information and using placeholder tokens when certain patient data isn’t available.
- Authors achieved higher results for different evaluation metrics than state of the art models, which added to the quality of the proposed method and proved that their method is
- Including an ablation study and showing improved results strengthened the overall quality of the paper.
- References are up to date and relevant to the studied architecture.
- The fact that this framework is both suitable for predictive and detection tasks added to the quality of the paper.
- The system saves computing power by only training a small connection module while keeping the main data processors fixed, making it practical for real-world hospital use.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Is there a good explanation of why your batch size was set to 8 and is too small?
- Throughout the paper, authors mention that their method is more cost-effective than others, but there’s no table or evidence provided to show or compare the actual cost difference.
- Authors only test on public datasets without validating their model in real hospital settings or with outside patient groups. This factor limits the real-world confidence.
- Authors need to include more illustration: examples of images for specific medical detection tasks where the algorithm did succeed or failed the specific task. There should be a discussion section that highlights as well objectively the weaknesses and advantages of the proposed foundational model.
- The study fails to explain how doctors can understand the AI’s decisions, which features matter most for predictions, or how medical staff would use the system in daily practice, these factors are crucial for real medical tools.
- The authors should clearly highlight what makes their framework better than other foundational models. With so many new models available, why should a reader choose this one over the others?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Releasing your code upon acceptance would be highly valuable to the research community. Please ensure that it is made publicly available if the paper is accepted for publication.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is well-written, relevant to the MICCAI community, and presents a framework that performs strongly across both predictive and detection tasks. It demonstrates higher performance than state-of-the-art models and includes an effective ablation study, which strengthens the overall contribution. References are up to date, and the proposed method shows clear potential for impact in clinical AI applications. However, several issues must be addressed in the rebuttal. These include but are not limited to:
- Clarifying cost-effectiveness claims with proper comparisons or supporting evidence.
- Providing a rationale for design choices, such as the small batch size.
- Adding visual examples to better illustrate successes and failures in medical tasks.
- Highlighting the unique advantages of the framework over other modern foundational models. If the authors address these concerns with clear responses and revisions, the paper is suitable for acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces a new multimodal learning framework that combines an advanced modality dropout technique with contrastive learning to enhance robustness and improve representation quality. The proposed approach is tested on large-scale public clinical datasets for disease detection and prediction, showing effectiveness and efficiency.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Unlike prior methods that replace missing modalities with zero matrices, this paper introduces a simultaneous modality dropout strategy and replaces zero matrices with learnable modality tokens specific to each modality. Also, by applying contrastive loss between fused and individual modality representations, the proposed method achieves superior performance compared to conventional approaches.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The paper proposes simultaneous modality dropout as a more effective alternative to random sampling. However, the experimental evidence supporting this claim is limited. Even though this work provides a basic performance comparison, no training stability or convergence speed is analyzed. Therefore, it remains unclear how much benefit this method provides over random dropout.
2. While the proposed method works well with a few modalities, the number of combinations grows exponentially as more modalities are added. Although the paper mentions potential scalability, no experiments or analyses support this claim.
3. Since all encoders are frozen during training, the performance gains may heavily depend on the quality of the pre-trained encoders. Although the authors acknowledge this point, there is no investigation into how the method performs when the encoder quality is lower or when end-to-end training is used.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I would like to give a weekly acceptance for the paper because it identifies the missing modality problem and proposes a technique to address it. However, there is insufficient analysis on whether the simultaneous modality dropout offers advantages over the random sampling technique. Additionally, there is a lack of validation of the proposed method’s effectiveness in multi-modality environments, and the approach heavily relies on the performance of pre-trained encoders, which I see as a limitation. Due to the lack of research and experiments in these areas, I have decided on a weak acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their highly constructive and positive feedback. Our work focuses on improving multimodal learning accuracy while introducing negligible extra training and inference costs and supporting missing modalities–key for clinical practicality. We acknowledge that the original submission lacked sufficient discussion on handling missing modalities, efficiency analysis, and deployment potential. We address these below and follow with point-by-point responses.

Q1: (R1, R2) Handling missing modality in training, inference, and clinical use. We define a fixed set of target modalities (e.g., CT and tabular data) for all patients, but some may lack one or more in practice, resulting in missing modalities. Our model incorporates modality dropout (MD) during training to simulate missing-modality cases, enabling robustness at inference. When a modality is absent (e.g., missing tabular data), the system detects it and inserts a learned token to represent the missingness. Our model showed superior performance under both full-modality and missing-modality settings (see Tables 1 and 2), outperforming existing methods. Clinically, our method requires only available modalities at input, functioning seamlessly when modalities are partially available. The revised paper will add more discussion on the missing modality.

Q2: (R2, R3) Cost effectiveness. Our method adds only a lightweight fusion module (2.52M parameters) atop frozen unimodal encoders, making it a highly efficient plug-and-play solution. For example, training on the PE dataset took under 5 minutes on a Tesla V100 (16GB). Compared to conventional MD, which requires >300 epochs for convergence, our method converged in fewer than 50 epochs (6x faster) while achieving higher performance. This highlights the benefit of the proposed method over naive MD. The revised paper will include training cost comparisons to emphasize these practical advantages.

Q3: (R1) Insufficient tabular data description. We apologize for the insufficient tabular data description. For the PE dataset, we adopted RadFusion’s attributes process and removed duplicates, yielding 1226 fields. We applied SHAP-based ranking and performed ablation that led to selecting the top 8 features for best tabular-only performance. For NLST, we collected 36 non-cancer-related attributes from official public releases and selected the top 16 via the same procedure. The revised paper will provide further details, and we will release code for this process upon acceptance.

Q4: (R2) Highlighting the unique advantages. Our model offers a minimal-cost upgrade path to multimodal learning using any frozen unimodal encoder. It consistently improved both unimodal and multimodal performance across datasets, requiring minimal training effort and no encoder retraining. The revised paper will make the method’s benefits more explicit, especially the balance of low cost and high gain.

Q5: (R3) Robustness to encoder quality and end-to-end potential. We tested two encoder types: PENet (a small task-specific CT model, 26M params) and a large CT foundation model. In both cases, our method improved estimation performance, demonstrating robustness across model scales. The revised paper will highlight this finding. While this submission uses frozen encoders, extending to end-to-end training is a promising future direction.

Q6: (R2) The reason for the 8 batch size. All models were trained on a 16GB Tesla V100 GPU. Given the high memory requirements of 3D CT inputs, 8 was the maximum batch size that fit in memory for baseline models in the PE experiment. We used the same batch size across all methods in the same experiment for fairness and consistency. The revised paper will clarify this choice in the revised paper.

Q7: (R2, R3) Code release. The code will be released upon acceptance.

Other comments: The revised paper will refine all tables, correct typos (R1), and discuss success and failure cases using visual examples (R2).

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction

Author(s):