Abstract

Diabetic Retinopathy (DR), induced by diabetes, poses a significant risk of visual impairment. Accurate and effective grading of DR aids in the treatment of this condition. Yet existing models experience notable performance degradation on unseen domains due to domain shifts. Previous methods address this issue by simulating domain style through simple visual transformation and mitigating domain noise via learning robust representations. However, domain shifts encompass more than image styles. They overlook biases caused by implicit factors such as ethnicity, age, and diagnostic criteria. In our work, we propose a novel framework where representations of paired data from different domains are decoupled into semantic features and domain noise. The resulting augmented representation comprises original retinal semantics and domain noise from other domains, aiming to generate enhanced representations aligned with real-world clinical needs, incorporating rich information from diverse domains. Subsequently, to improve the robustness of the decoupled representations, class and domain prototypes are employed to interpolate the disentangled representations, and data-aware weights are designed to focus on rare classes and domains. Finally, we devise a robust pixel-level semantic alignment loss to align retinal semantics decoupled from features, maintaining a balance between intra-class diversity and dense class features. Experimental results on multiple benchmarks demonstrate the effectiveness of our method on unseen domains. The code implementations are accessible on https://github.com/richard-peng-xia/DECO.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0781_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0781_supp.pdf

Link to the Code Repository

https://github.com/richard-peng-xia/DECO

Link to the Dataset(s)

https://www.kaggle.com/competitions/aptos2019-blindness-detection https://github.com/deepdrdoc/DeepDRiD https://csyizhou.github.io/FGADR/ https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid https://www.adcis.net/en/third-party/messidor2/ https://www.kaggle.com/datasets/mariaherrerot/eyepacspreprocess https://www.kaggle.com/datasets/mariaherrerot/ddrdataset https://www.kaggle.com/datasets/mariaherrerot/eyepacspreprocess



BibTex

@InProceedings{Xia_Generalizing_MICCAI2024,
        author = { Xia, Peng and Hu, Ming and Tang, Feilong and Li, Wenxue and Zheng, Wenhao and Ju, Lie and Duan, Peibo and Yao, Huaxiu and Ge, Zongyuan},
        title = { { Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    the paper proposed a domain generation method for RGB images follows the idea of mixstyle.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. interesting to include prototypes into the mixstyle;
    2. the results are encouraging;
    3. presentation is clear and easy to follow
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. the experiemntal settings are unclear. there are no explanation on how the hyper-parameters are selected. the authors just claim that they adopt cross valiation. is there any validation set? how the search range for each parameter? how the parameter is tuned for cmoparison methods?
    2. the citation is not proper for section “decoupling and recombination of representation”. This section clearly follows mixstyle which is not properly cited.
    3. the proposed method is only tested for RGB images, and is not tested for more general medical images.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see weakness.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the paper is in good written and the idea is interesting with encouraging experimental results. However, the experimental settings are rather ambiguous for me, which makes the experments not convincing enough.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents an interesting idea of tackling the generalization of deep learning based methods to unseen domains in diabetic retinopathy through a representation disentanglement of domain and semantic features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of leveraging instance normalization to disentangle semantic and domain representations seems novel in DR research. Recombining the semantic and domain representations of different domains are simple but seems to be effective.
    2. The authors propose a robust pixel-level alignment loss to further align retinal semantics between original features and the augmented features (via the proposed decoupling and recombining).
    3. Experiments on leave-one-domain-out and train-on-single-domain generalization setups demonstrated the effectiveness of the proposed method in generalizing to unseen domains over other competing methods. Ablation studies on model design variants were also provided to support the effectiveness of the proposed components.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Though I like the main idea of this paper in decoupling and recombining of retinal semantics and domain noise, I have the following concerns and questions:

    1. The presentation of this paper makes readers harder to follow in the following aspects: a) The notation used in the “Task Settings” of section 2 (“Methodology”) is a bit messy and hard to parse. The authors use both “i” and “d” to index both domains and samples. E.g., “n_i = {n_{d, 1}, …, n_{d, c}}” should be “n_d={n_{d, 1}, …, n_{d, c}}”. The source domains should be defined as D_S = {D_1, …, D_s} to be consistent with the notation D_T used for target domain. I also suggest use “d” to index the source domains, and “i” to index data triplets {x_i, y_i, d} to make it clearer. The notation in the paper makes me hard to follow. b) I believe that there is also an error in Eq. (3), where “\sigma(r_i)” should be “\sigma(r_j)” based on the description of combining semantic representations (as well as the notation used in Eq. (8)) of x_i (i.e., z_i) and the domain representation of x_j (denoted by \mu(r_j) and \sigma(r_j)). And “\hat{y}=y_i” should be “y_j=y_i”, which indicates that those two samples are from the same category but in different domains. c) Which term corresponds to the pixel-level loss is extremely unclear to me (from my understanding, it should the second term). But still the notation of “I” and “k” is overloaded multiple times. 2 . The pixel-level alignment loss is more like a contrastive loss to me (the Eq. (9)) in the paper is highly close to Eq. (1) in [5]. If that’s the case, I think the authors should elaborate more about how the proposed pixel-level loss is different from the one used in [5]; otherwise, I do not think this can be listed as a contribution. Another question is that why the Eq. (9) provides pixel-level alignment of semantics. According to the notation, z(r_k) corresponds to the latent features of a sample from a mini-batch. I understand that z has spatial resolution, but the summation was not over spatial dimensions but over the number of samples in a mini-batch. I hope the authors should clarify this.
    2. How the hyper-parameters (\alpha in Eq.(9)) were selected is neither released and discussed. Even though the authors provided ablations on different components. It is still informative for the readers to know the importance of each loss components.
    3. The authors claim the improvement was significant, but no statistical test was performed.
    4. I am a bit surprised that training on the APTOS dataset generalized worse than training on the IDRID, Messidor dataset. The size of APTOS is almost 7/2 times larger than the IDRID/Messidor dataset. I am wondering about the authors’ insight into this.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No a huge issue here, but I hope the authors could release the code and the pretrained weights in the near future.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although I think the main idea is interesting and experiments are extensive, the presentation of this paper needs some major revision. The technical contribution of pixel-level alignment loss also needs in-depth discussion. Some of the results are not convincing to me, especially no statistical tests were performed. I hope the authors could address those concerns in the rebuttal phase. I’ll read the rebuttal carefully to reconsider my evaluation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The manuscript presents a novel approach aimed at addressing domain shifts in grading diabetic retinopathy (DR) across retinal images from diverse sources. The proposed methodology involves three key components: the disentanglement of semantic features from domain-specific features, the derivation of class and domain prototypes, and the implementation of a pixel-level semantic alignment loss. The authors have conducted an extensive evaluation of their method across six DR datasets, claiming superior performance over conventional methods, domain-generalization techniques, and other feature representation approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    • The paper addresses a critical issue in medical imaging—domain shift—which significantly impacts the deployment of deep learning methods in real clinical environments. This problem is both relevant and timely within the medical imaging community.
    • The methodological foundations and their derivation appear sound and are well-conceptualized and articulated.
    • The evaluation is thorough, encompassing multiple datasets and comparing the proposed method against a range of state-of-the-art domain-generalization and feature representation methods. This comprehensive benchmarking provides a robust foundation for the claims of improved performance.
    • The manuscript includes detailed ablation studies that help to isolate and demonstrate the impact of each component of the proposed method. This approach enhances the clarity and credibility of the results presented.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Comments:

    • While there are prior works that employ prototype learning, such as those mentioned in reference [1], it is important for the sake of contextual accuracy that these existing contributions be acknowledged in the manuscript. I do not request a direct comparison with various prototype learning strategies or specific referencing of this particular paper, but rather a general acknowledgment of prior work in the field.
    • The disentanglement strategy employs instance normalization, traditionally associated with style transfer applications. It would be beneficial for the authors to clarify how this strategy avoids the mentioned pitfalls typically associated with style transfer, such as the potential to overlook biases caused by implicit factors like ethnicity and age. How were the ethnicity and age specifically addressed by the proposed method? If not, please reconsider these claims/assumptions or provide experimental results to back such claims.
    • The treatment of essential biomarkers such as age and ethnicity as domain noise needs further discussion. The manuscript should discuss whether these factors could influence diagnostic outcomes in DR grading and whether treating them as domain noise could potentially have a negative performance on diagnostic performance.
    • The authors claim improvements in detecting rare classes. Clarification is needed on how the method enhances this aspect of DR grading. If this claim is not empirically validated, it should be reconsidered or supported with additional experimental evidence. Similarly, what constitutes a ‘rare’ domain?
    • The paper mentions pixel-level semantic alignment but appears to reference alignments in the feature space, not actual pixels. Clarification on this point would prevent potential misunderstandings about the nature of the alignment being performed.
    • How are the parameters lambda_c and gamma_c determined?
    • The term “significant” should be used in conjunction with statistical tests. The authors are encouraged to specify where statistical tests have been used to substantiate claims of improvement or difference, providing p-values or confidence intervals where applicable.

    [1] Gallée L, Beer M, Götz M. Interpretable Medical Image Classification Using Prototype Learning and Privileged Information. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2023 Oct 1 (pp. 435-445). Cham: Springer Nature Switzerland

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See comments above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript employs a combination of various (some of which known) techniques to address the challenging issue of domain shift in grading diabetic retinopathy. While some individual technique might be established in the field, their amalgamation into a single, cohesive framework to tackle this specific problem could still be considered novel. A major strength of the manuscript is the comprehensive evaluation conducted across multiple datasets, coupled with the reported improvements over existing methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper proposed a new domain generalization framework for diabetic retinopathy (DR) by decoupling DR-related (semantic) features shown in images and non DR-related (domain) noise, such image style or demographic differences. They perform data augmentation by combining semantic to other domains. In the test stage, the classification is performed by pixelwise alignment between the input image features and augmented data. Their experiments shows outperformed result to the state-of-the-art methods on various dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is reasonable to decouple the DR-related features shown in images and non DR-related features, to augment dataset, especially in clinical application which requires non-image information for the classification. In addition, as in clinical application the semantic features and domain features are sometimes correlated each other, it seems that authors introduced ‘class/domain prototypes’ to avoid too many non-realistic cases. It is a pretty well written paper - easy to understand, mathematical expressions are also well defined, and parameters are in the supplementary material.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The prototype part is not very clear to me. As I understand, decoupling assumes that two features are (almost) independent but actually not in real cases, so authors introduce concept of ‘prototype’, but how did you optimize the balance? Isn’t it biased to the existing training data again?

    In minor, they didn’t explain exactly what kind of source domains are used in the experiment for the diabetic retinopathy classification.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Authors didn’t mention or provide the source cod, but at least this work used open datasets. Some of parameters are not still clear. If they are clarified, it would be helpful for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Can you please answer the question described in the weakness part? And how did you decide the prototype parameters in the supplementary material?

    Can you please provide exactly what kind of source domains are used in the experiment?

    Please add the following the reference, and explain how the techniques and results are different, as the title is too similar. Galappaththige et al., “Generalizing to Unseen Domains in Diabetic Retinopathy Classification”, WACV, 2024

    In supplementary material, the method name in the parameter table is GDRNet. Please change to DECO.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors decoupled features to generated DR-related features shown in images and non DR-related features which brings more clinically meaningful.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We appreciate the positive feedback and suggestions from both the reviewers on our work, and we will consider their suggestions to further improve the quality of this work in either the final version or future work.




Meta-Review

Meta-review not available, early accepted paper.



back to top