Abstract

Multimodal ophthalmic imaging-based diagnosis integrates colour fundus imaging with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used deep learning pipelines, such as modality imputation and distillation methods, face notable limitations: 1) Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2) distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasise the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport (OT) mechanism for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large-scale ophthalmic multimodal datasets demonstrate our model’s superior performance under various modality-incomplete scenarios, achieving state-of-the-art performance in both complete modality and inter-modality incompleteness conditions. The implementation code is available at \url{https://github.com/Qinkaiyu/RIMA}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0878_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Qinkaiyu/RIMA

Link to the Dataset(s)

https://yutianyt.com/projects/fairvision30k/

BibTex

@InProceedings{YuQin_Robust_MICCAI2025,
        author = { Yu, Qinkai and Xie, Jianyang and Zhao, Yitian and Chen, Cheng and Zhang, Lijun and Chen, Liming and Cheng, Jun and Liu, Lu and Zheng, Yalin and Meng, Yanda},
        title = { { Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {560 -- 569}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a novel method for aligning data of fundus and OCT modalities, such that with their approach it is possible to imput the representation of OCT based on fundus, and vice versa. They further propose a method for fusing these representations to improve performance. Their alignment method is a variation of Optimal Transport, using Gromov-Wasserstein Optimal Transport’s distance and more constraints. Averaged over multiple different datasets, their proposed approach (and individual compontents) lead to state-of-the-art performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors properly position themselves in the current field of research. The authors demonstrate the extent of the issue they are addressing. The authors extensively test their methods on different datasets, and conducted many different experiments and an ablation study. The authors (anonymously) share their code. The authors reflect on the impact of, and decision process behind, several of their choices. The “Inter-modality missing”-experiments yield very nice results.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In Table 1, the authors average the performance on three different datasets and report on this average. This does not seem like a fair way to compare methods. E.g., for the accuracies in “complete-modality fusion”, their method is outperformed by IMDR on the AMD- and Glauncoma-tasks by 1.5~2% and just for DR their method performs significantly better. This leads to the average misrepresenting the results. For nearly all datasets, the accuracy and f1-scores of the “inter-modality missing” experiment for fundus data seems to outperform the “complete-modality fusion”. This means that (possibly due to the lower performance of the OCT aspect) the model’s performance is decreased when using all data. This is not discussed in the paper, while it seems noteworthy. In Figure 3, only the AUC and F1-scores are reported on, not the accuracies, without any clear reason. Why is this?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Figure 1 seems a bit crowded, it may be better to split up Fig 1a and Fig 1b. There are quite some grammar errors that make sentences a bit awkward to read. The choice of some variable names is non-ideal (e.g., e and epsilon), though this may be a personal preference. In Table 1, the 2nd best performance is not always underlined. For Fig 2, the caption should include that it is a t-SNE visualization For Fig 3, it may be nice to mention this is the “Proportional random missing”-experiment

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors proposed a novel approaches for an important problem, not just in this domain. The results show that their approaches work well (though better for fundus than OCT). Further, the work is well structured and the approaches are well argumented. I would give it a [5. Accept] if it were not for the slight concerns surrounding transparancy, listed in the weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a novel multi-modal alignment and fusion method, capable of learning multi-modal feature representations which are robust to randomly missing data from either modality, and show its application to an opthalmic disease classification task from funds and OCT images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Interesting methodology, including many novel components, and solving an interesting clinical problem, showing good performance.

    The use of an optimal transport strategy as an alignment technique, and splitting this into class (prototype) vs feature alignment components, each with an associated loss function contributing to the classification loss in the context of an attention-based classifier is a very interesting approach.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors go into heavy mathematical detail from the outset, without fully clarifying the exact problem statement from the beginning, or providing layterm insight on the method, or a clear description of what they mean by class vs feature alignment in this context early on in the text. Clarifying these concepts as early as possible (e.g. an intuitive explanation of what is meant by a class prototype in this context) would enhance the readability of the paper and help the reader understand the concepts better.

    Similarly, the high-level overview of the framework could be made clearer. The paper currently relies mostly on a figure for this, which is quite dense and minimal in explanation, and thus not very clear. A textual overview of the framework would help clarify and set the scene and understanding early, and improve comprehension of the manuscript.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting contribution and clinical problem. Highly technical but novel and would be of interest to the conference audience.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel multimodal alignment and fusion method designed for robust ophthalmic disease diagnosis, particularly addressing scenarios with incomplete modality data (missing fundus or OCT images). The contributions of the paper include: 1) A formulation for robustly aligning semantic features across fundus and OCT modalities, specifically handling modality incompleteness by explicitly modeling class-wise and feature-wise relationships. 2) Introducing a multi-level alignment strategy utilizing class prototypes and feature alignment through OT, achieving soft matching between modalities. 3) A modality-specific fusion mechanism that leverages the global features from fundus imaging and localized semantic features from OCT data effectively. The authors thoroughly validated their framework across multiple ophthalmic datasets (Harvard-30k: AMD, DR, Glaucoma) under various conditions of modality completeness and incompleteness, demonstrating significant performance improvements over existing state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The use of Labeled OT to achieve multi-scale alignment of fundus and OCT modalities is novel and addresses critical limitations of existing modality-imputation and distillation methods. 2) The method significantly outperforms prior methods under challenging missing modality scenarios, showing practical applicability. 3) Comprehensive experimentation has been performed, including complete modality fusion, inter-modality missing scenarios, and random proportional missing scenarios, which convincingly demonstrates the effectiveness and robustness of the proposed method. 4) The paper includes rigorous ablation studies demonstrating the importance and contribution of each proposed component (class-wise alignment, feature-wise alignment, asymmetric fusion).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) While the paper presents extensive evaluations, it does not discuss the computational overhead or complexity of the proposed OT-based alignment method, particularly considering potential scalability issues in clinical deployments. 2) Although methodologically robust, the clinical implications and interpretability of the alignment results are minimally discussed. The paper remains unclear whether the enhanced performance translates into clinically meaningful improvements or decision-support benefits. 3) The authors primarily compare to recently published deep-learning-based methods but omit classical multimodal approaches that could serve as additional meaningful baselines. Inclusion of such comparisons would strengthen the claim of state-of-the-art performance. 4) Shorten overly complex sentences, especially technical explanations (e.g., equations and optimization descriptions in Section 2.1). Example: “Differently, we impose additional constraints on the cost function to confine transport operations within the same class, thereby enabling the semantic alignment of identical lesions across different modalities.” Recommended simpler version: “We introduce constraints to limit transport operations within each class, aligning identical lesions across modalities.” 4) Provide brief introductions to each sub-section in Methods to improve the narrative flow.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    ** It would significantly benefit the paper to provide some insight into computational resource requirements and runtime complexity, given the potential practical clinical implementation. ** Consider including a brief discussion on how ophthalmologists or clinicians would interpret or use these multimodal alignment results in practice. ** Clearly explain all key components in Figure 1 within the main text. For instance, explicitly detail how the self-attention module integrates the aligned features. **for fig 2, the t-SNE plots, please increase plot size and legend clarity. Consider using additional metrics (e.g., cluster compactness) to quantify the improvement in feature separability. ** for fig 3, Standardize Y-axis ranges across plots or explicitly mention the reason for different scales. ** for Table 1, Clarify explicitly why certain methods are missing results for some conditions. For example, briefly state why the results of other baselines (e.g., Eye-most) are absent under the missing modality conditions. **for Table 9, Consider providing confidence intervals or standard deviations to highlight statistical significance. **For Equation (2), clearly introduces the class-wise transport cost. It’s theoretically sound, though somewhat complex. Please Add brief intuitive explanations of each term explicitly below the equation to improve reader understanding. Currently, the entropy regularization term 𝜖𝐻(𝑇𝑐) could benefit from further explanation regarding its effect on optimization stability. **For Equation (4), Briefly discuss how this flexibility contributes to improved robustness under missing modalities. **For Equation (5), Briefly justify explicitly why cosine similarity is chosen over other distance metrics (e.g., Euclidean or KL-divergence). **For datasets, explicitly state class distribution, class imbalance handling, and whether stratified splits were used in cross-validation. **For robustness analysis, consider a brief discussion of why the model maintains performance even at high missing ratios, linking explicitly to methodological choices (class-wise vs. feature-wise alignment). **for ablation study, include discussion about the interactions or potential redundancy between class-wise and feature-wise alignments.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper represents a strong, innovative methodological contribution with robust theoretical underpinnings (Labeled Optimal Transport) and extensive empirical validation. Its clear strength is in handling modality incompleteness scenarios, outperforming current state-of-the-art methods consistently across multiple datasets.

    However, minor improvements in the clarity of technical explanations, computational discussions, clinical interpretation, and visualization quality would significantly enhance the paper’s overall clarity, readability, and practical impact.

    Overall, these strengths far outweigh the noted weaknesses, making this paper a strong candidate for acceptance at MICCAI 2025. After addressing suggested improvements, it has strong potential to be highlighted as an oral presentation and contender for best paper consideration.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

N/A




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top