Abstract

Diabetic retinopathy (DR) is a complication of diabetes and usually takes decades to reach sight-threatening levels. Accurate and robust detection of DR severity is critical for the timely management and treatment of diabetes. However, most current DR grading methods suffer from insufficient robustness to data variability (e.g. colour fundus images), posing a significant difficulty for accurate and robust grading. In this work, we propose a novel DR grading framework CLIP-DR based on three observations: 1) Recent pre-trained visual language models, such as CLIP, showcase a notable capacity for generalisation across various downstream tasks, serving as effective baseline models. 2) The grading of image-text pairs for DR often adheres to a discernible natural sequence, yet most existing DR grading methods have primarily overlooked this aspect. 3) A long-tailed distribution among DR severity levels complicates the grading process. This work proposes a novel ranking-aware prompting strategy to help the CLIP model exploit the ordinal information. Specifically, we sequentially design learnable prompts between neighbouring text-image pairs in two different ranking directions. Additionally, we introduce a Similarity Matrix Smooth module into the structure of CLIP to balance the class distribution. Finally, we perform extensive comparisons with several state-of-the-art methods on the GDRBench benchmark, demonstrating our CLIP-DR’s robustness and superior performance. The implementation code is available at https://github.com/Qinkaiyu/CLIP-DR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1493_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1493_supp.pdf

Link to the Code Repository

https://github.com/Qinkaiyu/CLIP-DR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yu_CLIPDR_MICCAI2024,
        author = { Yu, Qinkai and Xie, Jianyang and Nguyen, Anh and Zhao, He and Zhang, Jiong and Fu, Huazhu and Zhao, Yitian and Zheng, Yalin and Meng, Yanda},
        title = { { CLIP-DR: Textual Knowledge-Guided Diabetic Retinopathy Grading with Ranking-aware Prompting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces CLIP-DR, a variation of CLIP adapted specifically for DR grading. Generally speaking, it consists of a classic CLIP framework that includes some specific engineering tricks that enables its application to this particular field. The main technical contributions are in my opinion two auxiliary components specifically tailored for this task, namely a Similarity Matrix Smooth module–which aids to mitigate the effect of class imbalance–, and a rank-aware loss–that pushes the model to be aware of the ordinal relationship of the target variables–. Experiments performed on the domain generalization setting GDRBench show improvements in performance compared with multiple state-of-the-art techniques. The ablation study also demonstrates the incremental contributions of the components.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written, easy to follow and understand, and focuses on a topic that is relevant for the medical community (enabling DR grading with rank awareness).

    • It introduces an adaptation of CLIP to this particular problem, which was never done before. It features two new components tailored for two problems associated with DR grading (class imbalance and the ranking awareness).

    • Results in the benchmark datasets used in GDRBench are higher than those obtained by other state-of-the-art approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I struggle to see the point of using a framework like CLIP directly for DR grading, when no text is available to learn that context and improve the ranking task. If the text is always the same and there is no extra information available for the text encoder, why then should we use this framework at all?

    • There is a significant drop in performance when moving from training in multiple domains and evaluating in one to training in one and evaluating on the rest. Although the authors justify this by mentioning that they do not focus on domain generalization as a target task, I miss a more in-depth explanation of what might be the cause of this.

    • No qualitative results in terms of class activation maps are provided.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No code is released but information is provided in the supplementary materials to train the models. Evaluation is performed on publicly available datasets.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • My main concern is that it is not clear to me why CLIP is a good choice in this particular scenario. CLIP is being tremendously useful for pre-training models in applications in which a large amount of (image, text) pairs are available. In this context, I don’t see the actual contribution of a text encoder, which only features the same piece of text (“This is image is”) and the particular class of the image (e.g. “no DR”). The text encoder is not collecting any valuable context from the prompt, ending up in a configuration that looks quite alike as a typical contrastive learning task. I’d like to see a proper justification of this in the rebuttal (and the manuscript) so that the idea is better articulated.

    • In line with my previous comment, I think that the comparison with OrdinalCLIP, although reasonable, is unfair. OrdinalCLIP extracts valuable information from the encoded text and uses that to improve the ranking task. But in this particular scenario, the text available is way too limited, rendering the core contribution of OrdinalCLIP practically useless, and resulting in the matrix illustrated in Fig. 3. I would also like the authors to elaborate on this topic in the rebuttal.

    • Section 3 mentions that a ResNet50 network was used as an image and text encoder. How did you adapt the ResNet50 to deal with text? Please, detail that in the rebuttal and in the manuscript.

    • It is a pity but the supplementary materials are way beyond the limits in pages (the limit was 2 pages, and it has 5). Furthermore, a whole section of text is included right in the first page, which is in contradiction to the guidelines for authors. It should be pointed out that this is unfair for other works that struggled to accomplish this.

    There are also some minor formatting issues, spelling and grammatical errors to correct, namely:

    • Page 4. There is an isolated “Model Structure” title right before Section 2.2. I guess it was a typo, so it should be deleted.

    • Also in Page 4, Section 2.2, first sentence. It says “Soomth” but should be “Smooth”.

    • Also in Page 4, Section 2.2, there is a “We” with capital W that should be “we”.

    • In equations that use text (e.g. (7), (8), (9), or in the subscripts of the losses), you should use \text{} in LaTeX to wrap the text and avoid italics.

    • In the supplementary materials and in Fig. 3 you refer to “proliferative DR” as “proliferation”. Please, correct it.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces engineering tricks to guarantee the application of a learning framework like CLIP on a DR grading task. Experiments clearly demonstrate the contribution of each of those components, and there are differences with the state-of-the-art that demonstrate improvements. However, I struggle to see why CLIP is a good idea in this scenario, considering that it was thought as a pre-training framework to benefit from the intrinsic knowledge in text, but here the text is just a sentence with the classification information. We need to clearly see how the text encoder is making a difference here. My second concern is that I am not sure why (and how!) the authors used a ResNet-50 as a text encoder. If the authors could answer those two questions in a reasonable way, I’d be glad to change my rating, the numbers are very good.

    Aside from this, I’d like to raise my concern about the supplementary materials not fulfilling the guidelines for authors.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper describes a diabetic retinopathy (DR) severity assessment algorithm for fundus photographs. It is based on CLIP, with a few improvements to address class imbalance and take into account the fact that DR severity grades are ordered labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Exploring CLIP-based algorithms is still a topic of interest in the community, although I cannot say this approach takes advantage of image-text correlations (unlike the title suggests).
    • The proposed framework was assessed on GDRBench, a recent domain generalization (DG) benchmark. It was found to be at least non-inferior to the SOTA method (GDRNet) on this benchmark. DG is an important research topic.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Methodological novelty is not sufficiently stressed. Por instance, differences with OrdinalCLIP should be highlighted (it is based on a different idea, but this should be stated). In particular, regarding SMS for imbalanced regression, similarities and differences with “Delving into deep imbalanced regression” [19] should be clarified.
    • Conceptually, the proposed L_rank loss is not clinically accurate, as a patient might go directly from mild or moderate DR to proliferative DR, without going through severe DR. DR severity simply expresses a likelihood to develop proliferative DR. The questionable assumption of decreasing similarities should be discussed.
    • As far as I know, the proposed L_rank loss is novel. However, in essence, it is related to the quadratic-weighted Kappa loss, often used for classification with ordered labels: the proposed approach should be better motivated.
    • Comparison with the baseline (GDRNet) is not convincing: it is not clear who the winner is, if any. Statistical significance needs to be reported.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The datasets are already public (GDRBench).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The temperature parameter \tau in equations 8 and 9 needs to be defined.
    • Fig. 3 is misleading: it looks like 2 confusion matrices, while in fact, matrices are simply based on 5 samples. The better behavior for CLIP-DR might be due to chance. I would recommend plotting average values for all images in each DR severity class to alleviate this potential issue.
    • Type: Soomth -> Smooth
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty and superiority of the proposed approach was not established. However, I believe these two points can be addressed in the camera-ready paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have clarified some concerns about novelty and comparison with the state of the art, so I believe the paper can be accepted. Yet novelty is limited, so I keep my weak accept recommendation.



Review #3

  • Please describe the contribution of the paper

    The authors presents a new work exploiting CLIP with a new ranking-aware prompting to help the model in pathological analysis. The authors introduce a similarity matrix smooth module to help CLIP in the balance of the class distribution.

    The proposals seems to be interesting and the results are adequate. Also, the experimentation includes ablation studies demonstrating the impact of each proposed part.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The adaption and improvement of a strategy as is CLIP.

    The application in a practical domain and final domain as is diabetic retinopathy.

    The experiments and results that are presented, which are adeaquate.

    Ablation studies were conducted to demonstrate the impact of the modifications and improvements.

    Also, a comparison with SOTA was included in the manuscript.

    The availability of the source code.

    The extense of the used dataset, with a challenge involving 8 public datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    All the proposal is focused for DR. I would like to be it tested in other application or even medical image modality to see if the paradigm is working well in other domains.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposal is well explained, also ablation studies, comparative with SOTA and representative results were reported.

    My main reservation relies in the doubt that if it would work well in other medical applications, both in retinographies or other medical image modality.

    Regarding the SOTA, few information is introduced related to the works that were included and analyzed at the beginning of the section 4, results. How where they methods selected and introduced in the comparative? There are plenty of methods in the SOTA related to this pathological scenario and a clear explanation about their inclusion and analysis would be interesting in the manuscript.

    Have you tested it in other eye fundus disease?

    Have you tested it in other medical image modality?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the application of the proposal, as is DR, with a new proposal that was well tested, with availability of source code and image datasets, and the experiments and ablation studies that were conducted, I found suitable this proposal for the audience of the MICCAI conference.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The explanation provided by the authors are in line with my previous impression of the work. Therefore, I found this proposal suitable for the audience of the MICCAI conference.




Author Feedback

We thank reviewers R1, R3, and R4 for their reviews and recognition of our work’s novelty and strengths. We will update the response and correct minor typos in the camera-ready version. Reply to R1: #Q6-1: In the results section, we introduced OrdinalCLIP [21] and its limitations in DR grading. Both our approach and [21] aim to learn ranked feature spaces in CLIP [9]. While [21] uses linear interpolation, we make DR fundus order learnable via text-image pairs. Additionally, our SMS addresses data imbalance by smoothing similarity vectors, unlike the entropy-based method in [19]. #Q6-2: It is indeed observed and should be considered with longitudinal datasets for individual patients. Our dataset’s DR color fundus has natural rank information, which we used for diagnosis rather than prognosis. We will discuss assumptions for decreasing similarities in the camera-ready version.#Q6-3: While related, the weights in L_rank loss are in a self-adaptive non-linear fashion, unlike fixed weights in kappa loss which are linear or quadratic. #Q6-4: We have conducted a t-test comparing our model to GDRNet [14] (SOTA). Our model outperformed it with a statistically significant p-value of 0.02, where ours achieved the best F1 or AUC. #Q10-1: it was set as 1. #Q10-2: Both confusion matrices in Fig. 3 are averages of all samples, as described in Section 4 (page 7). Reply to R3: #Q6-1 & Q10-1 & Q10-2: Our model does not obtain rank information from learning optimal context prompts. Instead, we make the natural ordering of DR color fundus learnable in the CLIP [9] feature space. We designed an innovative rank loss to rank aligned text-image pairs in order. Similar in OrdinalCLIP [21], ranking information does not come from learning optimal context prompts. Instead, it uses linear interpolation of the base rank embedding of [class] to obtain a ranked feature. The text encoders in both our model and [21] are frozen during training, so rank representation does not come from the text encoder. Our input text format follows [21]: both use the same text and a specific class of images. For example, ‘age estimation: the age of the person is [class]’ for [21] and ‘This image is [class]’ for ours. [21] provides performance for 8 initial prompts, with only a 0.8% difference (min 2.30, max 2.32). Please see their OpenReview rebuttal for details. Notably. [21] requires much fewer basic rank classes (e.g., 1, 10) than the actual number of classes (100). This limits their ability to learn rank information with fewer classes (e.g., five classes in DR grading). See Section 4 Table 3 in their manuscript for details. #Q6-2: Besides domain issues, insufficient training data to fine-tune the CLIP-based model is a primary reason. This is explained in the supplementary material (Section 1, page 1). #Q6-3: We will add them to the camera-ready version. #Q9: We provide an anonymous code link in the abstract, including the test code and pre-trained model weight. We stated that the training code will be released after the paper is accepted. #Q10-3: Thanks for pointing out this error! We used the Resnet50 version of the pre-trained CLIP model (see the official GitHub page of CLIP). The text encoder is a Transformer, and the image encoder is Resnet50. Sorry for the confusion; we will correct the typos in the camera-ready version. #Q10-4: We apologize. We will abridge our supplementary material to meet MICCAI Conference requirements. Reply to R4: #Q10-1: We compared the same methods as the GDRNet [14] benchmark for fair comparison and included [9] and [21] for comprehensive comparison. #Q10-2 & Q10-3: Yes, our model applies to any classification task with inherent rank information. We have tested it on the CORN1500 (Mou, et al. 2022) benchmark for corneal nerve tortuosity, achieving 3% better AUC. Thank you for your suggestions. We will test more eye diseases and imaging modalities in future editions.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper gathered two accept recommendations, and reviewer 3 suggested weak rejection based on some doubts on how CLIP embedding could be useful for this particular problem. While I agree with R3 concern, the fact that the other two reviewers kept their acceptance recommendation post-rebuttal, and R3 did not come back, leads me to back the main trend of accepting this paper. I would still like to ask the authors to please answer R3 concerns in the camera-ready version of the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper gathered two accept recommendations, and reviewer 3 suggested weak rejection based on some doubts on how CLIP embedding could be useful for this particular problem. While I agree with R3 concern, the fact that the other two reviewers kept their acceptance recommendation post-rebuttal, and R3 did not come back, leads me to back the main trend of accepting this paper. I would still like to ask the authors to please answer R3 concerns in the camera-ready version of the paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top