Abstract

Diabetic retinopathy (DR) is a serious complication of diabetes, requiring rapid and accurate assessment through computer-aided grading of fundus photography. To enhance the practical applicability of DR grading, domain generalization (DG) and foundation models have been proposed to improve accuracy on data from unseen domains. Despite recent advancements, foundation models trained in a self-supervised manner still exhibit limited DG capabilities, as self-supervised learning does not account for domain variations. In this paper, we revisit masked image modeling (MIM) in foundation models to advance DR grading for domain generalization. We introduce a MIM-based approach that transforms images to achieve standardized color representation across domains. By transforming images from various domains into this color space, the model can learn consistent representation even for unseen images, promoting domain-invariant feature learning. Additionally, we employ joint representation learning of both the original and transformed images, using cross-attention to integrate their respective strengths for DR classification. We showed a performance improvement of up to nearly 4% across the three datasets, positioning our method as a promising solution for domain-generalized medical image classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4143_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{JanEoj_Revisiting_MICCAI2025,
        author = { Jang, Eojin and Kang, Myeongkyun and Kim, Soopil and Sagong, Min and Park, Sang Hyun},
        title = { { Revisiting Masked Image Modeling with Standardized Color Space for Domain Generalized Fundus Photography Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors revisited the use of MIM in foundation models to advance domain generalization for CFP classification. Specifically, they proposed an MIM-based approach that transforms images from different domains into a standardized color space, allowing the model to learn consistent representations. Furthermore, joint representation learning of the original and transformed images was performed using a cross-attention mechanism, and LoRA fine-tuning was also applied. Experiments demonstrated the effectiveness of the method in CFP classification.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study proposes a joint training strategy that combines Color Space Standardization with Masked Image Modeling (MIM) to enhance the generalization capability of fundus photography (CFP) classification models in cross-domain scenarios. By integrating domain-invariant representation learning with image reconstruction supervision, the approach represents an innovative extension of self-supervised pretraining in the context of domain generalization for medical imaging tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors propose a Domain Generalization method that combines color space standardization with self-supervised masked image modeling (MIM), achieving strong cross-domain performance on several fundus image classification tasks. The method is grounded in practical motivation, supported by extensive experiments, and demonstrates a certain level of novelty. However, several aspects still require clarification or enhancement to improve the academic and practical value of the paper:

    1.The paper proposes a Masked Image Modeling strategy based on standardized color space for Domain Generalization, but it does not sufficiently discuss the distinctions or advantages of this strategy compared to existing methods such as color style transfer or color augmentation. Additionally, the method uses RGB statistics from the training set as the standardization reference. It remains unclear whether this approach is more effective than using other standardized color spaces (e.g., Lab, HSV), and this deserves further discussion or empirical comparison.

    2.The method states that the color standardization parameters (μc, σc) are computed per class. This implies that label information is involved in the color transformation process, which raises concerns about potential label leakage or bias during testing, where ground-truth labels are not available. The authors should clarify how this transformation is handled during inference and ensure that it does not compromise the fairness of evaluation.

    3.Although the authors mention using cross-attention to fuse features from the original and transformed images, the paper lacks sufficient details regarding the specific design of the attention mechanism (e.g., single-head vs. multi-head, dimensional allocation). It is recommended to provide additional architectural details, ideally through a module diagram or pseudocode.

    4.The proposed two-stage training pipeline (MIM pretraining + cross-attention fine-tuning) may introduce significant computational overhead. It is recommended to report the training and inference time, GPU memory consumption, and other resource-related metrics to assess the method’s practicality for deployment.

    5.While the abstract and introduction focus heavily on diabetic retinopathy (DR) as the main clinical context, the experimental section is predominantly based on glaucoma datasets. This inconsistency makes the overall narrative somewhat confusing. The authors are encouraged to revise the motivation and organization to maintain logical coherence throughout the paper.

    6.Since the method aims to enhance discriminative ability via color standardization and joint representation, the current visualizations are insufficient. More qualitative results (e.g., attention or activation maps) are recommended to improve interpretability. Additionally, the disease types in Figure 2 should be clearly labeled for reader clarity.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper addresses the challenge of cross-domain generalization in color fundus image classification and proposes a novel and effective joint training strategy that combines color space standardization with Masked Image Modeling. The method achieves competitive performance across multiple benchmark datasets. However, further improvements are needed to ensure the completeness of the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Thank you to the author for the detailed response to the comments. The author’s rebuttal effectively addresses the main issues I raised. While the brevity of the rebuttal limits the depth of discussion on all points, the authors have satisfactorily clarified the main issues. I believe the manuscript is suitable for acceptance.



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel framework to address the out-of-domain CFP classification. Experimental results reveal that the proposed method achieves SOTA performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Domain generalized CFP classification is an interesting topic, and is significant to clinical application. This paper presents a MIM-based DG method, and outperforms comparison methods on three benchmark. This work is solid and make good contribution to the medical image community.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors present Standardized Color Transformation to make images of different domains have similar appearance, therefore, reducing the domain gap. While, grayscale images or CLAHE could also deal with this issue. There lacks a comparison with these commonly used image preprocessing methods, and the motivation behind the standardized color transformation is insufficient.

    As observed in Table 1, the authors only report experimental results of comparison methods, and it is unclear about the performance of baseline. While the baseline means that you could train a classification model directly on the source domains.

    Although the proposed model outperforms comparison methods, my concern is whether the improvement is caused by RET-Found, since it is pretrained on 0.9 million images. In contrast, the comparison methods only utilize the benchmark datasets. In my opinion, this is not a fair comparison.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic is interesting, however, the experimental results and ablation studies are not convincing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The authors do not provide any explanations on my concerns. Why?



Review #3

  • Please describe the contribution of the paper

    This paper presents a novel approach to domain generalization in color fundus photography. The key contribution is a two-stage pipeline: first, a color standardization model is trained to map fundus images into a standardized color space, and second, a downstream classification model is trained that utilizes both an original image and its standardized version for classification. The core idea is to mitigate domain-specific color variations in input images, thereby improving robustness across domains without requiring labeled data from unseen target domains. The authors also introduce a cross-attention fusion strategy to combine information from the original and standardized images, which further enhances generalization capability.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clarity and Structure: The paper is well-written and easy to follow. The abstract summarizes well the motivation, approach, and results. The introduction clearly outlines the research problem, limitations in current work, and the paper’s contributions.

    • Methodological Soundness: The proposed method is described in a logically structured and comprehensible manner.

    • Comprehensive Evaluation: The experimental section is thorough. The authors evaluate against a diverse set of baseline methods, report results over multiple seeds, and conduct well-designed ablation studies that help to isolate the contributions of individual components. This improves confidence in the reported improvements and the reproducibility of the findings.

    • Insightful Ablation Studies: The ablations provide useful insights into the roles of various components, including the use of cross-attention, the pretraining strategy, and different branches of the network. This increases transparency regarding how performance gains are achieved.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the core idea is well-motivated and the experimental validation is strong, several conceptual and methodological points require clarification or additional evidence:

    Methodological Concerns and Clarifications

    • Potential for Hallucination: The color standardization model may be prone to hallucinating features in underexposed or overexposed images if the standardized target is drastically different from the source image. It is unclear how the method ensures content preservation under such extreme variations (for example the one shown in Fig.1, Domain1), and whether such effects were observed in practice. It would be great if the authors could elaborate on this.

    • Bidirectional Cross-Attention: The rationale for using cross-attention in both directions ((Q^f, K^g, V^g) and (Q^g, K^f, V^f)) is not sufficiently justified. A discussion on why a single-directional fusion would not suffice, or why the current symmetrical cross-attention leads to performance improvement, would be valuable.

    • Label-wise Standardization: The decision to compute mean and standard deviation for color standardization on a per-label basis raises concerns, especially under imbalanced datasets. It is unclear whether this undermines the domain generalization goal by implicitly encoding label distribution biases into the standardization process. I.e., as it is now it seems that the standard color space might vary a lot depending on how much samples per domain in the training set comprise a specific label. In extreme cases that could lead to potentially high differences in the standardized images between distinct labels and might hamper model training. Furthermore, the manuscript states that the lack of label information during testing leads to lower-quality standardization. However, if standardized images are only used for pretraining, it is unclear why label information would be required at test time. This part of the methodology would benefit from clearer explanation or correction.

    • Masking Strategy During Pretraining: The Phase I pretraining relies on masked autoencoding despite already using a foundation model trained with masking as the backbone. It is not clear whether this second masking provides a benefit for the model performance or limits it because fine-grained clinical details might be masked out. Therefore, shouldn’t the Phase I training with unmasked full images yield even better results for this specific task? Did the authors consider this and are able to share their insights on this?

    • Definition of “Imperfect” Standardization: The claim that the standardization is imperfect is not substantiated. If this impacts the pretraining objective or the effectiveness of the reconstruction, more explanation is needed on how the model compensates for this and whether it affects overall robustness.

    • LoRA Implementation Details: More detail is needed regarding the LoRA integration: Are the low-rank adapters merged with the base encoder weights post-fine-tuning? If so, how does this affect the encoder-decoder’s ability to produce the color standardized reconstruction for the generation of Z_i after the encoder weights have been updated for the classification task? Furhermore, it is also unclear whether the same encoder instance is used for both reconstruction and classification, or if these are separate instances of the same encoder network.

    Experimental Setup and Comparisons

    • Pretraining Across Domains: It is not stated whether the encoder is pretrained on the full training set including all domains, or if a leave-one-domain-out protocol is used to align with the domain generalization evaluation setup. This distinction is critical, particularly when comparing results with methods such as RETFound.

    • Fairness of Qualitative Comparisons: The comparison with RETFound in terms of reconstruction quality is potentially unfair unless both models are tested on unseen domains. Clarifying whether RETFound has seen the test data or similar distributions is necessary to evaluate the qualitative results appropriately.

    Results and Interpretation

    • Unexpected Performance Patterns: Despite really poor reconstruction quality, RETFound-derived embeddings (Z_i w/ R) perform almost competitively with the proposed method. This discrepancy between perceptual quality and downstream utility deserves further discussion.

    • Effect of Masking and Label-wise Normalization: It would strengthen the study to explicitly evaluate whether the masking strategy or label-based color standardization (label-based mean and standard deviation values) are essential for achieving strong results. Their necessity is currently only assumed but not validated or proven.

    • Cross-Attention Directionality: Similar to earlier comments, the necessity of bidirectional attention remains unclear. Since Z_i appears to serve an auxiliary role, a unidirectional mechanism might suffice, and this could simplify the model without significant performance loss.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The use of multiple similarly labeled variables (e.g., X_i, X̂_i, X′_i, Z_i) in the ablation discussion is difficult to parse at times. It would be helpful to have a more distinguishable and clearer variable naming.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although there are several methodological concerns and open questions that warrant further clarification as pointed out under point 7, the core contribution is clear, the experiments are comprehensive, and the paper is overall well-written. If the (major) issues are adequately addressed in the rebuttal, I would be inclined to revise my score to an accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reviewing the authors’ rebuttal and carefully considering the reviews and concerns raised by the other reviewers, I now support acceptance of the paper. The authors have adequately addressed my main concerns, and their responses were clear, well-reasoned, and demonstrated a solid grasp of the technical issues raised during the initial review.

    While the other reviewers raise reasonable points that merit attention, I believe the strengths of the paper, particularly its meaningful contribution, technical soundness, and thorough evaluation, outweigh the remaining minor concerns. The rebuttal further reinforced the paper’s value and clarified several important aspects, strengthening the overall case for acceptance.

    I encourage the authors to incorporate key clarifications from their rebuttal into the final camera-ready version, where possible, to enhance the clarity and accessibility of the work for the broader research community.




Author Feedback

  • Clarify color transformation - R2,R3 As shown in prior studies, grayscale/CLAHE pre-processing and color augmentation have proven insufficient for the CFP DG task; similarly, we obtained poor results in our experiments (see Mixup [30] in Table 1, which employs advanced augmentation). Regarding label use in color standardization, we discovered a noticeable difference in color distribution across labels, which led us to adopt a strategy of standardizing images using label-specific μ and σ. Notably, labels are used only during training, not during testing; instead, our method uses generated images that do not require labels. Furthermore, our preliminary experiments demonstrated that label-specific standardization leads to higher accuracy than not using it. We believe this superior performance stems from (a) sophisticated image generation that leverages improved statistics, and (b) indirect supervision driven by varying label statistics in Stage 1, which ultimately benefits feature learning for downstream classification. Additionally, RET-FT/LP/LoRA are baseline methods where RETFound is fine-tuned using only the source domain and evaluated on the target domain (R1). The low baseline accuracy compared to the comparison methods (e.g., CauDR) suggests that fine-tuning alone is insufficient. Thus, our effectiveness arises not from large-scale parameters or RETFound’s pretraining, but rather from our own strategy. In addition, we used RGB, as it is the most commonly used color space in image processing. Exploring other color spaces will be considered in future work.

  • Cross-attention details - R2,R3 We carefully chose bi-cross-attention after comparing it with one-way attention, mean, and concat methods, as it achieved the highest accuracy (as similarly reported in [19]). Also, we used multi-head attention, details can be found in the released code.

  • Impact of masking - R2 As demonstrated in the MAE [14] paper, masking plays a crucial role in unsupervised model training. To identify the optimal setting, we conducted preliminary experiments by varying the masking ratio from 0% to 100%. The results indicated that the original RETFound setting of 75% masking was most suitable, and thus we adopted it.

  • Computation - R3 Despite requiring twice as long to train and inference, our method remains practical with 0.02s inference and standard GPU usage, as accuracy is paramount in the medical domain.

  • Target disease - R3 The proposed method was primarily validated on DR, specifically 4DR and APTOS. Experiments on the glaucoma dataset assess the applicability of our method to other eye diseases.

  • Robust to extreme cases? - R2 As shown in the Domain 4 image of Fig. 2, our model demonstrates stable reconstruction even in extreme cases. In addition, the generated image is blended with the original to form a joint representation, which helps alleviate potential negative effects caused by hallucinations.

  • DG protocol - R2 We pretrained the encoder in Stage 1 using a leave-one-domain-out protocol, and there is no overlap between the DG datasets and RETFound’s pretraining dataset.

  • RETFound reconstruction - R2 Although RETFound’s generated results may appear somewhat unnatural (e.g., blue edges), such results are consistent with the qualitative results presented in the original RETFound paper and do not indicate an error caused by domain differences.

  • Explain w/ R ablations - R2 Since Zᵢ is form by blending the original image Xᵢ with the generated image, the prediction remains valid even if RETFound generates a low-quality image, as Xᵢ is still involved in the prediction process.

  • LoRA details - R2 The E and D used for generating Zᵢ are separate from the LoRA-adapted E used for classification. Therefore, training the classifier via LoRA does not impact the Zᵢ generation.

  • Details - R3 In Fig. 2, the disease label is Normal for Domains 1, 2, and 4, and Mild-DR for Domain 3.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have clearly clarified the issues raised by the reviewers. The explanations in the rebuttal look reasonable and correct to me.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top