Abstract

Retinal foundation models aim to learn generalizable representations from diverse retinal images, facilitating label-efficient model adaptation across various ophthalmic tasks. Despite their success, current retinal foundation models are generally restricted to a single imaging modality, such as Color Fundus Photography (CFP) or Optical Coherence Tomography (OCT), limiting their versatility. Moreover, these models may struggle to fully leverage expert annotations and overlook the valuable domain knowledge essential for domain-specific representation learning. To overcome these limitations, we introduce UrFound, a retinal foundation model designed to learn universal representations from both multimodal retinal images and domain knowledge. UrFound is equipped with a modality-agnostic image encoder and accepts either CFP or OCT images as inputs. To integrate domain knowledge into representation learning, we encode expert annotation in text supervision and propose a knowledge-guided masked modeling strategy for model pre-training. It involves reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding image. This approach aligns multimodal images and textual expert annotations within a unified latent space, facilitating generalizable and domain-specific representation learning. Experimental results demonstrate that UrFound exhibits strong generalization ability and data efficiency when adapting to various tasks in retinal image analysis. By training on ~180k retinal images, UrFound significantly outperforms the state-of-the-art retinal foundation model trained on up to 1.6 million unlabelled images across 8 public retinal datasets. Our code and data are available at https://github.com/yukkai/UrFound.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1942_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1942_supp.pdf

Link to the Code Repository

https://github.com/yukkai/UrFound

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yu_UrFound_MICCAI2024,
        author = { Yu, Kai and Zhou, Yang and Bai, Yang and Soh, Zhi Da and Xu, Xinxing and Goh, Rick Siow Mong and Cheng, Ching-Yu and Liu, Yong},
        title = { { UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This research presents a foundation model (UrFound) for disease detection in retinal images. UrFound processes colour fundus photographs (CFP) and optical coherence tomography with one framework and encodes the disease labels as auxiliary information for model training. The authors have evaluated UrFound on multiple retinal disease benchmarks and compared UrFound with multiple recent methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper is well organised with clear research questions and a concise logical flow, despite a few typos such as ‘per-training’ in the last paragraph of the introduction and ‘molding’ in the first paragraph of the conclusion. It is worth noting, however, that the supplementary material has more than two pages and includes massive text descriptions. I defer this point to the Area Chair.

    2) The authors combined the advantages of RETFound [19] and FLAIR [15] and proposed a conditional masked language modelling approach, which is simple but effective.

    3) The authors compared UrFound and SOTA methods like RETFound and FLAIR and included several benchmarks for model evaluation. The Ablation study is designed to verify the efficacy of different model components.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The external evaluation, e.g. cross-validation between IDRiD, APTOS2019, and Messidor2, is absent in this paper. Such analyses are important to assess the model generalisability after adaption.

    2) The equation 2 needs revision. The language encoder apparently has trainable parameters and was updated during training. It is better to express the reconstructed masked tokens in equation 2 using language decoder l(*), w_{j}, and z. Otherwise only image encoder and decoder seem to be updated during training according to the equations. Additionally, incorporating the backpropagation gradient flow in Figure 1 would enhance clarity.

    3) The assertion that UrFound captures the complementary information from CFP and OCT images appears overstated unless the authors showed that the model with the input of a paired CFP+OCT outperformed those with a single modality.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) The issues raised above should be addressed, such as involving external validation and correcting equations.

    2) Given that downstream tasks focus on disease detection, incorporating the disease labels, especially those the same as downstream diseases, in model pretraining is hardly regarded as a fair comparison to the models trained solely on unlabelled images. While acknowledging the benefits of leveraging publicly available resources for training, it’s essential to remain mindful of this potential limitation.

    3) While open-source research is not compulsory, the authors are encouraged to follow it, particularly when their work has benefited a lot from open-source resources.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is on the borderline. It introduces a clear research question, effective strategy, and comparison to state-of-the-art methods. However, this research lacks some crucial analyses for the foundation model, and its reproducibility cannot be assessed. I may consider raising the score if the authors address my concerns in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have solved my concerns. Combining that this paper lies on borderline and ranks first in my review pool, I raised the score to weak accept.



Review #2

  • Please describe the contribution of the paper

    This paper aims to propose a foundation model for retinal images. They leverage the idea from Masked Autoencoder (CVPR 2022) and combine it with multimodal clinical description data in a self-supervised learning pipeline. The authors used 25 CFP datasets and one large OCT dataset to pretrain the model. Experimental results show it outperforms other SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Interesting baseline multimodal model for retinal images. The problem under consideration is of high practical importance since MAE are becoming standard.

    2. The paper’s readability is satisfactory.

    3. Experiment: the authors claim their method is competitive with baseline methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Major weaknesses:

    1. The novelty is limited. There are a large amount of work focus , on using MIM, MLM into medical domain. see [1], [2].

    2. The experiment section is flawed due to an unfair comparison and also lack of compared methods. The authors only compared their SSL pretrained method with some supervised training methods, and the advantages of their method are derived solely from data rather than the model architecture.

    3. Performance for baseline methods are using MIM and MLM. It’s unclear if the additional performance is due to the use of specific datasets for pretraining or as a result of the initialization itself. It should be further verified.

    4. The absence of statistical significance analysis for the reported results hampers the ability to draw reliable conclusions regarding the significance of the performance improvements achieved with the proposed method.

    Reference: [1] Xiao, Junfei, et al. “Delving into masked autoencoders for multi-label thorax disease classification.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023. [2] Chen, Zhihong, et al. “Multi-modal masked autoencoders for medical vision-and-language pre-training.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None. I think it is simple to reproduce the method. As there are many existing repo to use MAE in general domain.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The motivation behind the work is too simple and not clear. Why do you choose MAE for retinal images? What about using other pretrained weights such as DINOv2 in this study?

    2. In comparison with other SSL papers, what is the superiority of the proposed method in the manuscript? The authors should clarify.

    3. To improve the paper, I suggest adding 2-3 ablation studies with other self-supervised approaches for visual encoder, such as Moco V3, to provide a fairer comparison and more comprehensive evaluation.

    4. This paper is primarily application-oriented, and it appears to be an attempt to combine previously implemented ideas. From a motivational perspective, the paper requires substantial revision with some clinical application study to enhance the novelty.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of this paper is limited and the experiment of the paper has major flaws.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks for the feedback, but I still have some concerns:

    The motivation of the proposed method is better clarified. However, the difference regarding the methodology between other MAE-based medical imaging paper and the proposed method seems still minor.

    The novelty of the proposed is limited. It is an application paper using MAE in a domain specific area and have not give enough interpretability. Besides, it is still unclear whether the comparison between “Ours” and other baseline methods is fair or not.



Review #3

  • Please describe the contribution of the paper

    The authors propose a retinal foundation model (UrFound) that is trained on multi-modal data, including CFP, OCT, and text. The method is novel and the results are good.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. UrFound accepts either CFP or OCT as input, where the complementary information in both modalities improves the performance.
    2. UrFound is trained by reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding retinal image. Categorical labels and clinical descriptions are converted to a general text format for training.
    3. The method is novel, the results are promising, and the writing is good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. OCT scans are usually 3D volumes, where the method is trained on 2D images. The 3D spatial correlation is neglected.
    2. It is interesting to see the comparison to more general multi-modal pre-training, e.g. Beit-v3 [a], and MAE on retinal datasets. The paper lacks a discussion on the SOTA multi-modal (vision-language) pre-training methods.

    [a] Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. Implementation details? e.g. input shape, data pre-processing.
    2. I hope the author can make their code and dataset public. The foundation model will have a good impact.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In Fig. 2(a) and (b), adding a label for the x and y axis in the figure is helpful for reading.
    2. The downstream tasks are mainly classification tasks. Perhaps in future work, you can explore: can UrFound has good results in dense prediction tasks, e.g. lesion segmentation in CFP or layer segmentation in OCT?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes UrFound trained with multi-modal data and has good results. Although lack of comparison and discussion of general vision-language pre-training methods, the paper has merit more than weakness.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the valuable comments from all the reviewers and address all the concerns as follows:

To All Reviewers:

Q1 Open-Source Research

We will make our data, code, and pre-trained models open-source after acceptance.

To R#1

Q1 External Evaluation

We conducted external evaluations on the IDRiD, APTOS2019, and Messidor datasets and found that our UrFound model demonstrates strong generalizability, outperforming RETFound and FLAIR in most cases, with statistical significance based on a t-test with a p-value of 0.05.

Q2 Eqn 2, Fig 1, and Typos

Thanks for pointing these out. We will revise Eqn 2, improve Fig 1, and correct the typos accordingly.

Q3 Complementary Information Between CFP and OCT

We acknowledge that our claim may be overstated without testing on paired CFP and OCT images. We will revise our statement to: “UrFound captures information from both CFP and OCT images and performs well with both imaging modalities.”

Q4 Comparison to Models Trained with Labels

We have compared our UrFound model against FLAIR and supervised task-specific models, which are also trained with disease labels. However, we acknowledge that UrFound relies on disease labels for training, and we will make this point clearer in our revised paper.

To R#3

Q1 OCT Scans as 3D Volumes

We followed the same settings as RETFound to train on 2D OCT slices. Learning from 3D OCT scans could be an interesting avenue for future work.

Q2 Discussion and Comparison to General Vision-Language Models

We have discussed related works on vision-language pre-training in the supplementary materials. We have compared our model with MAE and FLAIR, a CLIP-based retinal model, and plan to explore more vision-language approaches in the future.

Q3 Implementation Details

We resize the input image to 224×224 and preprocess the data following RETFound. Full implementation details will be included in our revised paper.

To R#4

Q1 Novelty

We want to clarify that most existing medical vision-language pre-training models are not trained on retinal images, and it’s unclear if methods designed for other medical images work well with retinal images, especially when dealing with multiple imaging types like CFP and OCT. In this work, we propose the first universal retinal foundation models for both CFP and OCT images, using expert knowledge, which has not been studied before.

Q2 Fair Comparison We do compare with SSL methods, including RETFound and FLAIR, which are based on MAE and CLIP, respectively. They are SOTA retinal foundation models.

Our UrFound training dataset has significantly fewer images compared to RETFound (160K vs. 1.6M) and is actually a SUBSET of the training data used by FLAIR. This demonstrate that UrFound’s improvements stem from our knowledge-guided pre-training strategy, not just the data.

Q3 Statistical Significance Analysis

We repeated our experiments 10 times with different random seeds and conducted a t-test with a p-value of 0.05. Our results show that UrFound performs similarly to the second-best method on IDRID and JSIEC, and significantly better on the other six datasets.

Q4 Why UrFound Adapts MAE

We develop UrFound based on MAE for fair comparison and systematic analysis, since the SOTA retinal foundation model, RETFound, is based on MAE. We adopted the same experimental setup as RETFound, ensuring that any improvements stem from the proposed pre-training strategy rather than differences in model architectures or initialized weights.

Q5 Superiority of UrFound

UrFound has two main advantages over existing retinal foundation models: 1. It uses expert knowledge via text supervision to capture domain-specific features, improving representation learning. 2. Unlike existing methods that require separate models for CFP and OCT images, UrFound handles both modalities within a single model.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper “UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling” presents a novel and effective approach for disease detection in retinal images by integrating color fundus photographs (CFP) and optical coherence tomography (OCT) into a unified framework. The method leverages a conditional masked language modeling technique, combining the strengths of existing models like RETFound and FLAIR, and is evaluated extensively across multiple retinal disease benchmarks, demonstrating superior performance. The proposed model is well-organized, with clear research questions and a concise logical flow, and includes a thorough ablation study to verify the efficacy of its components.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper “UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling” presents a novel and effective approach for disease detection in retinal images by integrating color fundus photographs (CFP) and optical coherence tomography (OCT) into a unified framework. The method leverages a conditional masked language modeling technique, combining the strengths of existing models like RETFound and FLAIR, and is evaluated extensively across multiple retinal disease benchmarks, demonstrating superior performance. The proposed model is well-organized, with clear research questions and a concise logical flow, and includes a thorough ablation study to verify the efficacy of its components.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I agree that the paper has made some novel contributions. I recommend accept provided that the authors will make the revisions as promised.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I agree that the paper has made some novel contributions. I recommend accept provided that the authors will make the revisions as promised.



back to top