Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Ophthalmologists often use multimodal data to improve diagnostic accuracy, but complete multimodal datasets are rare in real-world applications due to limited medical equipment and concerns about data privacy. Traditional deep learning methods typically address these issues by learning representations in a latent space. However, we identify two main challenges with these approaches: (i) task-irrelevant redundant information in complex modalities (such as numerous slices) leads to substantial redundancy in latent space representations, and (ii) overlapping multimodal representations make it difficult to extract features that are unique to each modality. To address these challenges, we propose the Essence-Point and Disentangle Representation Learning (EDRL) strategy, which incorporates a self-distillation mechanism into an end-to-end framework to improve feature selection and disentanglement for robust multimodal learning. Specifically, the Essence-Point Representation Learning module selects discriminative features that improve disease grading performance, while the Disentangled Representation Learning module separates multimodal data into modality-common and modality-unique representations. This reduces feature entanglement and enhances both robustness and interpretability in ophthalmic disease diagnosis. Experimental results on ophthalmology multimodal datasets show that the EDRL strategy outperforms state-of-the-art methods significantly. Our code is released at https://github.com/xinkunwang111/Robust-Multimodal-Learning-for-Ophthalmic-Disease-Grading-via-Disentangled-Representation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0678_paper.pdf

SharedIt Link: https://rdcu.be/eHwY6

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_43

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xinkunwang111/Robust-Multimodal-Learning-for-Ophthalmic-Disease-Grading-via-Disentangled-Representation

Link to the Dataset(s)

https://yutianyt.com/projects/fairvision30k/

BibTex

@InProceedings{WanXin_Robust_MICCAI2025,
        author = { Wang, Xinkun AND Wang, Yifang AND Liang, Senwei AND Tang, Feilong AND Liu, Chengzhi AND Hu, Ming AND Hu, Chao AND He, Junjun AND Ge, Zongyuan AND Razzak, Imran},
        title = { { Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {447 -- 456}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a novel approach, Essence-Point and Disentangle Representation Learning (EDRL), for diagnosing ophthalmic diseases using multimodal data. The EDRL strategy addresses the redundancy in complex modalities and the challenge of extracting unique features from overlapping multimodal representations. The strategy consists of two modules: Essence-Point Representation Learning (EPRL) for feature selection and Disentangle Representation Learning (DiLR) for separating multimodal data. The method was tested on the Harvard-30k dataset, and the results indicate that it outperforms existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses a significant problem in the field of ophthalmic disease diagnosis and proposes a novel solution.
- The EDRL strategy’s design is innovative and addresses two critical limitations of current deep learning approaches.
- The experimental results show that the proposed method outperforms existing methods significantly.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The authors present the Essence-Point and Disentangle Representation Learning (EDRL) strategy to tackle the challenges of task-irrelevant redundant information and overlapping multimodal representations in ophthalmic disease grading. While the proposed approach appears promising, several critical aspects are underdeveloped, limiting the clarity and reproducibility of the work.

Firstly, the theoretical foundation of the method requires further elaboration. Specifically, the concept of essence-points is introduced without a clear explanation of how they are modeled or how they contribute to selecting discriminative information within each modality. Similarly, the role of the Disentangle Representation Learning (DiLR) module is not well articulated—there is little detail on how it separates feature embeddings into modality-common and modality-unique components.

Experimental validation is carried out on three ophthalmic multimodal datasets; however, crucial details about these datasets are missing. The paper lacks information regarding dataset sizes, class distributions, and characteristics of the missing data. Furthermore, the training procedure is insufficiently described. There is no mention of the optimization algorithm, hyperparameter choices, or other key training configurations, making it difficult to reproduce the experiments.

Several parts of the paper suffer from a lack of clarity. For instance, the integration of the self-distillation mechanism into the EDRL framework is not adequately explained. Additionally, the figures are poorly annotated and do not effectively illustrate the overall architecture or the flow of the method. Including clear diagrams with labels showing where different losses are applied would significantly enhance understanding.

While EDRL presents a potentially novel approach to multimodal learning for ophthalmology applications, the related work section is thin and does not sufficiently contextualize the method within existing literature. This omission makes it challenging to evaluate the novelty and contribution of the work.

The authors claim that EDRL significantly outperforms state-of-the-art methods, but the evidence presented does not convincingly support this assertion. The experimental results are presented in a somewhat disorganized manner, lacking the clarity needed to assess the performance advantages of the proposed method.

Minor Issues:
- Typos:
- Section 2.3: “correlation” should be corrected (likely a grammatical or contextual error).
- Section 3.2: “withour” should be “without”.
- Figure design: Including visual annotations to indicate the location and interaction of different loss functions would greatly aid comprehension.
- Table placement: The main results table should not appear within the Methods section; it disrupts the logical flow of the paper.
- Statistical analysis: While AUC values are reported, no statistical tests are provided to support the significance of the findings. This weakens the argument made in Table 1.
- Comparative analysis: The method is not sufficiently compared with other state-of-the-art techniques.
- Discussion of limitations: The paper does not address potential drawbacks or limitations of the proposed approach, which is important for a balanced evaluation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Provide a more detailed explanation of the proposed EDRL strategy, including how the essence-points are modeled and how they guide the selection of discriminative information.
- Provide more details about the datasets used in the experiments, including the size of the datasets, the distribution of classes, and the nature of the missing data.
- Provide a clear explanation of how the model was trained, including details about the training procedure, the optimization algorithm used, and the hyperparameters selected.
- Improve the clarity of the figures and provide a clear visual representation of the proposed method.
- Provide a thorough review of related work to demonstrate the novelty and contribution of the proposed method.
- Present the experimental results in a clear and understandable manner, and provide sufficient evidence to support the claim that the proposed method outperforms state-of-the-art methods.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Despite minor issues related to comparison with other methods, limitations discussion, and reproducibility, the paper presents a novel and promising approach to diagnosing ophthalmic diseases using multimodal data. The experimental results demonstrate the effectiveness of the proposed method. Therefore, I recommend accepting this paper.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper introduces Essence-Point and Disentangle Representation Learning (EDRL), a novel framework addressing two critical challenges in multimodal learning for ophthalmic disease grading: task-irrelevant redundant information and overlapping multimodal representations. The approach consists of two key components, one is an Essence-Point Representation Learning (EPRL) module that selects discriminative features by modeling prototype “essence-points” for each modality and class, and another is a Disentangled Representation Learning (DiLR) module that separates multimodal data into modality-common and modality-unique representations. The framework also incorporates a self-distillation mechanism to enhance robustness when handling missing modality scenarios. The paper did extensive experiments on three ophthalmology multimodal datasets to demonstrate superior performance over state-of-the-art methods in both complete and incomplete modality conditions.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper tackles a significant real-world challenge in multimodal medical imaging, specifically addressing the common problem of missing modalities in clinical settings.
- The proposed EDRL framework offers a novel approach to feature selection and disentanglement that effectively reduces redundancy and enhances discriminative capabilities.
- Comprehensive experimental validation across three ophthalmic disease datasets (AMD, DR, and Glaucoma) demonstrates consistent performance improvements.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The paper lacks a discussion on computational complexity. Given the architectural design involving multiple attention mechanisms, proxy-based sampling, and a dual-pipeline for self-distillation, both training and inference are expected to be computationally intensive. However, the paper does not provide any analysis of runtime efficiency, training time, or hardware specifications (e.g., GPU models), making it difficult to assess the method’s scalability and practicality in real-world or resource-constrained clinical settings.
  - The paper does not provide a thorough theoretical justification for modeling the essence-points using Gaussian distributions. While this formulation is common in probabilistic representation learning, it remains unclear why it is particularly suitable or optimal for this specific task and application
  - Although the paper claims to include a unified self-distillation mechanism, the code implements in the feature-level component (MMD) is missing, which means how the loss_MDD calculate is unknown, so that the self-distillation mechanism’s reproducibility is unknown (In fusion_train.py with line 198). And since the author states using Jensen-Shannon divergence for logits distillation, while the line 207 is commented out, and later it never used. How to validate the use this mechanism?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Figure 1 presents qualitative evidence of improved feature separation through visualization, which is intended to support the method’s ability to capture discriminative features. However, the claim would be more convincing if supported by quantitative separability metrics, such as Fisher’s discriminant ratio or inter-class/intra-class variance ratios, beyond just cosine distance.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a well-motivated and methodologically novel framework, EDRL, to tackle key challenges in multimodal ophthalmic disease grading. The integration of essence-point-based representation learning, disentangled feature modeling, and self-distillation is innovative and demonstrates consistent performance gains across multiple datasets. However, there are several areas that can be improved. The paper does not adequately address computational complexity, leaving out runtime performance and hardware details, which are crucial for evaluating the method’s scalability. Furthermore, theoretical justification for modeling essence points as Gaussian distributions is lacking, and the reproducibility of the self-distillation component is questionable, as the logits-level distillation is commented out and feature-level implementation (MMD) lacks a clear explanation in the code. Additionally, the paper contains qualitative visualization for feature separability but it might need more complete quantitative analysis. Despite these issues, the contribution remains significant and methodologically sound. I give a weak accept and would consider raising the score if the authors address these concerns during the rebuttal phase.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors propose a multimodal framework for ophthalmic disease grading that leverages representation learning to disentangle modality-specific and modality-common features. The method includes two key modules: 1) Essence-Point and Disentangled Representation Learning (EDRL), which learns the most discriminative class-wise features, referred to as “class essence points”. 2) Disentangle Representation Learning (DiLR), which aims to separate shared and modality-specific representations. 3) Unified Self-Distillation Mechanism where the logits distillation helps to generate more accurate representations for incomplete modalities. The approach is evaluated on the Harvard-30k dataset across three diseases with different grading scales.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed framework addresses a relevant and challenging problem in multimodal learning, particularly within the context of ophthalmic disease grading.
- The idea of learning disentangled features between modalities is interesting and can help improve both interpretability and generalization.
- The experimental results demonstrate promising improvements over baselines, supported by comprehensive ablation studies to assess the contributions of each module.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Given the model’s multimodal design for feature learning, it would be beneficial to evaluate it on a broader set of tasks, such as classification and segmentation, to better demonstrate its general applicability. While not mandatory for this submission, exploring this direction in future work could strengthen the impact of the method. Additionally, the learned essence points might be useful as class prototypes in downstream applications—this could be an interesting avenue to explore.
- The mathematical notation throughout the paper lacks consistency, which can hinder understanding. For example, the symbol K is used ambiguously (e.g., number of patients, number of samples). D is used for both the number of classes and the feature representation space. c is used for class index and later for a similarity score in Equation 2. Some symbols (e.g., B in Equation 5) are not defined—if it refers to batch size, this should be clarified, especially since N is used earlier in Equation 1 for the same purpose.
- It is unclear which modality the noise is applied to during training. Does it impact performance differently depending on the modality?
- In the baseline model, why are different backbones used for each modality? What is the backbone used in the proposed model—is it ViT? These should be clearly stated in the experiments section to improve reproducibility.
- How the number of essence points in each grading class is determined? If it is m*c, it means that the second modality has more essence points than the first one, and the third modality(if there is one) has more essence points than the first and the second modality? Please elaborate on this in Section 2.2.
- Equation 1 suggests comparing all features with all essence points—please clarify whether this is the case, and if so, provide a brief discussion on the computational complexity involved.
- In Fig. 2, why the OCT bscans are rotated?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The paper presents a novel and promising approach to multimodal disease grading through disentangled representation learning. While the proposed method is supported by strong experiments and interesting ideas, there are several areas where the manuscript would benefit from clearer explanations, particularly in terms of notation, architectural choices, and reproducibility. With these improvements, the work would make a meaningful contribution to the field.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Some details regarding the ablation studies and methodology is missing.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all reviewers, the area chair, and the program committee for their thoughtful comments and constructive feedback. We are grateful for the opportunity to address the concerns and clarify the design, implementation, and motivation behind our proposed EDRL framework.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

The reviewers leaned toward a cautious acceptance, with all three giving a “weak accept” recommendation. Several sound critiques of the reviewers should be addressed in the revised paper: why it models “essence-points” the way it does, the details about the datasets, the computational cost and others. Trusting the authors will do that, I recommend a provisional accept.

back to top

Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation

Author(s):