Abstract

The Vision-Language Foundation model is increasingly investigated in the fields of computer vision and natural language processing, yet its exploration in ophthalmology and broader medical applications remains limited. The challenge is the lack of labeled data for the training of foundation model. To handle this issue, a CLIP-style retinal image foundation model is developed in this paper. Our foundation model, RET-CLIP, is specifically trained on a dataset of 193,865 patients to extract general features of color fundus photographs (CFPs), employing a tripartite optimization strategy to focus on left eye, right eye, and patient level to reflect real-world clinical scenarios. Extensive experiments demonstrate that RET-CLIP outperforms existing benchmarks across eight diverse datasets spanning four critical diagnostic categories: diabetic retinopathy, glaucoma, multiple disease diagnosis, and multi-label classification of multiple diseases, which demonstrate the performance and generality of our foundation model. We will release our pre-trained model publicly in support of further research.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1812_paper.pdf

SharedIt Link: https://rdcu.be/dY6k4

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_66

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1812_supp.zip

Link to the Code Repository

https://github.com/sStonemason/RET-CLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Du_RETCLIP_MICCAI2024,
        author = { Du, Jiawei and Guo, Jia and Zhang, Weihang and Yang, Shengzhu and Liu, Hanruo and Li, Huiqi and Wang, Ningli},
        title = { { RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {709 -- 719}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper employ CLIP-based model structure to combine image and text information for retinal image classification. Superior performances are achieved compared with the baseline methods including non-CFP foundation model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Large scale data are used (including multiple datasets) for model training. The performance shows improvements over baseline model. The proposed pipeline also conduct contrastive learning at multiple levels (monocular and patient).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The pipeline is based on CLIP structure. The difference with traditional CLIP model is not clearly explained.
    2. According to the method description, text information added is an important reason for the performance improvement. The method to process the text and “standardize the text” needs better explanation.
    3. The foundation model is non-CFP pretrained but the proposed method is exposed of in-domain data.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The code and sample of text-base data is important to reproduce the pipeline.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Extra paragraph can be provided to explain the difference with traditional CLIP model. The performance comparison needs in-doamin foundation model performance to make the comparison complementary.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The model structure is based on CLIP which shows no obvious improvements. Since the performance improvement is one of major contributions, the improvement resources is not obvious from paper discussion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The explaination of the text preprocessing is satisfied. The model comparison with traditional CLIP-based model mainly on different data domain. Novelty on foundation model adaptation in specific field is limited.



Review #2

  • Please describe the contribution of the paper

    An adaption of the CLIP model is proposed to create a foundation model for color fundus photographs. The authors propose to split the loss into monocular-level, where each eye contributes separately to the loss, and patient-level, where the information from both eyes is concatenated. The results show the superiority of the proposed model compared to other relevant foundation models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed adaptation of the CLIP model is very reasonable for this ophthalmic modality, and the results clearly demonstrate its superiority to other relevant foundation models.
    2. Although no models or data are shared, the method is simple enough to be quickly adopted by the ophthalmology image analysis community, which may contribute to the high impact of this work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No description of the training data for the downstream tasks.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The source code is included. The data and the trained models are not included, which is unfortunate since these would be very useful to the community.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed RET-CLIP model is an excellent contribution. First, it follows a very simple and reasonable approach for color fundus photographs. Second, there is an extensive evaluation with good results, which make it clear that RET-CLIP outperforms other models. Regarding this evaluation, it is worth mentioning the effort made to replicate the experiments with different model seeds, which enables the inclusion of error bars and statistical significance of the different results.

    My only concern with this paper is that there is no description on how the training/validation/test splits are created for the downstream tasks. Because of this, it is not clear how to interpret the results.

    Two minor questions:

    • “The diagnostic reports in Chinese are also included, extending linguistic versatility of the research domain beyond English” -> Which languages are included? From this sentence I would assume it’s both English and Chinese, but it is later mentioned that the weights “are initialized with the Chinese-CLIP weights”, which makes me think that maybe it’s all in Chinese? In general, it would be nice to have more details about the content of the clinical reports.
    • Are the different random seeds used for the pre-trained models, or the downstream linear prediction head?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a clear methodological contribution that, in my opinion, deserves acceptance. However, while I would like to assume there is a fair distribution of the training/validation/testing data for the downstream task, the results are not clearly valid without this description.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My only concern before the rebuttal was related to the lack of clarity in the data split for the downstream task. The authors have clarified this in the rebuttal.



Review #3

  • Please describe the contribution of the paper

    This work proposes a tailored CLIP structure for retinal image diagnosis. Specifically, the authors propose a RET-CLIP method that considers three levels (i.e., left eye, right eye, and patient) of representation to conduct contrastive learning. With a detailed methodology and thorough experiments, they demonstrate that their method outperforms the other methods in this domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) This is a solid pipeline for applying CLIP to retina image understanding tasks. Specifically, the authors consider the properties of retinal images and have modified the CLIP pipeline accordingly, as well as considered additional representations. With minimal effort, the pipeline has proven to be effective in the chosen downstream tasks involving CFPs. (2) The data collection and experimental process are extensive, and the scale of data can support the fine-tuning process in the selected sub-domain. The evaluation process is also informative, proving it’s an effective approach with potential to be extended. (3) The presentation of the paper (i.e., methodology, image illustration, and quantitative analysis) is satisfactory and relatively smooth to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The application of VLM on medical image analysis is not under-explored. In fact, multiple papers have applied this concept to different sub-domains, including chest X-ray, hematology, and geometric imaging, among others [2-4] (the authors also mention FLAIR in the introduction). Simply applying the concept without delicately adapting the existing VLM is not sufficient for a solid acceptance. (2) Given (1), the proposed pipeline is not novel, considering multiple works existed before the submission deadline of MICCAI’24 [1, 5-7]. Although this topic attracts much attention, the authors fail to stand out by proposing a novel and insightful addition to the CLIP structure in representation learning of retina images. Simply separating levels during training is not enough. (3) Considering (1) and (2), this paper, though valuable in its adoption of multimodal data in CLIP and tailored method for retinal images, still falls below the quality needed for acceptance independent of rebuttal. I would like to see the authors’ response regarding the exclusive novelty of this approach and potentially raise it from weak accept to accept.

    [1] Silva-Rodriguez J, Chakor H, Kobbi R, et al. A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision[J]. arXiv preprint arXiv:2308.07898, 2023. [2] Chambon P, Bluethgen C, Langlotz C P, et al. Adapting pretrained vision-language foundational models to medical imaging domains[J]. arXiv preprint arXiv:2210.04133, 2022. [3] Monajatipoor M, Rouhsedaghat M, Li L H, et al. Berthop: An effective vision-and-language model for chest x-ray disease diagnosis[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022: 725-734. [4] Qin Z, Yi H, Lao Q, et al. Medical image understanding with pretrained vision language models: A comprehensive study[J]. arXiv preprint arXiv:2209.15517, 2022. [5] Zhou Y, Chia M A, Wagner S K, et al. A foundation model for generalizable disease detection from retinal images[J]. Nature, 2023, 622(7981): 156-163. [6] Wei H, Liu B, Zhang M, et al. VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis[J]. arXiv preprint arXiv:2403.10823, 2024. [7] Tan T F, Chang S Y H, Ting D S W. Deep learning for precision medicine: Guiding laser therapy in ischemic retinal diseases[J]. Cell Reports Medicine, 2023, 4(10).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors mention their intention to open-source their dataset in the abstract but do not provide details later on. With that, the reproducibility of the work is questionable.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) Consider highlighting the novelty of this paper beyond choosing data representations to satisfy the requirement of applying CLIP to a downstream task of medical image analysis. (2) Consider adding qualitative analysis (i.e., how the embedding changed before and after applying RET-CLIP) to solidify the conclusion made by the authors.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Solidity of the pipeline and paper writing, questionable novelty, and reproducibility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Novelty clarification (Reviewer#3, #4): Thanks for the valuable comments. We will rearrange our contributions and revise them accordingly. Our contributions can be summarized as follows: 1) Clinical diagnostic reports are utilized as the textual information for CLIP, which we believe is the first attempt in retinal foundation models. The benefits of employing diagnostic reports includes: First, the textual information in diagnostic report is richer than classification label, which can improve the performance of foundation models. Second, the diagnostic report is patient specific, so the image and text correspondence is stronger, which is more appropriate for the training objective of CLIP. Last, diagnostic reports are widely available in ophthalmology, which makes the proposed method easily applied in clinics. 2) A novel strategy is proposed to decouple the information of left and right eyes in diagnostic reports, which is a simple yet effective paradigm to build a retinal foundation model. In practical scenarios, diagnostic reports are usually in patient-level, mixing information from both eyes, which brings a big challenge for directly using CLIP to build foundation models. The proposed monocular and patient level contrastive learning approach can handle this challenge in ophthalmology domain. 3) Previous CFP foundation models either did not incorporate textual information (RETFound), or employed simple labels and fixed descriptions constructed in advance as textual information (FLAIR), or utilized images and diagnostic reports generated by AI, which lacks clinical reliability (VisionCLIP). To our best knowledge, our paper presents the first work to directly integrate large-scale clinical data to build a well-established CFP foundation model. Various downstream tasks demonstrate that our model exhibits superior performance. We hope the release of our model can benefit the society of retinal image processing so that the foundation model can facilitate practical applications.

Reproducibility: The source code is already attached, and the pre-trained model will be released after acceptance.

Reviewer#1 Dataset split: The division ratio for training/validation/test sets is 0.56:0.14:0.3 for all downstream datasets, which is consistent with RETFound. To ensure the distribution consistency of different categories, we divide the data of each category accordingly, and combine them into a complete dataset. Minor questions: 1) The clinical diagnostic reports are all in Chinese, but the proposed method is also applicable to English. 2) Different random seeds are only used in downstream tasks, determining the shuffling of training data.

Reviewer#3 Text preprocessing: The “text standardization” here only involves correcting typos and consecutive punctuation errors caused by human input, restoring abbreviations to their full expressions, and unifying mixed Chinese and English expressions into Chinese. We will explain this in the final version. Model and performance comparisons: Following the RETFound published in Nature, which is a retinal foundation model pretrained with CFPs based on the traditional MAE, we propose the first (to the best of our knowledge) retinal foundation model pretrained with CFPs as well as diagnostic reports. For model comparisons, CN-CLIP and DINOv2 are two foundation models pre-trained on natural images, PMC-CLIP is pre-trained on general medical images, while RETFound and FLAIR are both pre-trained on CFPs (in-domain). We follow the similar comparison study in these works. The above methods haven’t considered the mixed information of left and right eyes in diagnostic reports, thus inappropriate to be trained on our dataset. The domains of compared pre-trained models will be clearly noted in comparison tables in the final version.

Reviewer#4 Highlighting novelty and qualitative analysis: Thanks for the valuable suggestions. We will follow the suggestions and revise accordingly in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The novelty of the proposed method is questioned, as it mainly builds upon existing technologies without significant new contributions. The differences between the RET-CLIP model and traditional CLIP models are not clearly explained, and the method’s reliance on non-CFP pretrained models raises questions about its generalization. However, The method demonstrates superior results compared to other relevant foundation models, highlighting its potential impact on the ophthalmology image analysis community. The paper includes a comprehensive evaluation with detailed experiments, robust statistical analysis, and replication with different model seeds, ensuring the reliability of the results.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The novelty of the proposed method is questioned, as it mainly builds upon existing technologies without significant new contributions. The differences between the RET-CLIP model and traditional CLIP models are not clearly explained, and the method’s reliance on non-CFP pretrained models raises questions about its generalization. However, The method demonstrates superior results compared to other relevant foundation models, highlighting its potential impact on the ophthalmology image analysis community. The paper includes a comprehensive evaluation with detailed experiments, robust statistical analysis, and replication with different model seeds, ensuring the reliability of the results.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although there are concerns about the technical novelty raised by some reviewers, the application to retinal images is interesting. There are adequate evaluations and results, and the rebuttal was able to address some of the concerns.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although there are concerns about the technical novelty raised by some reviewers, the application to retinal images is interesting. There are adequate evaluations and results, and the rebuttal was able to address some of the concerns.



back to top