Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline the assessment of retinal diseases. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.7, 3.9, and 2.1 percentage points over their baselines, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0660_paper.pdf

SharedIt Link: https://rdcu.be/eHwUR

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_51

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ronnief1/RetFiner

Link to the Dataset(s)

Duke iAMD: https://people.duke.edu/~sf59/RPEDC_Ophth_2013_dataset.htm Harvard Glaucoma: https://github.com/Harvard-Ophthalmology-AI-Lab/Harvard-GDP Noor Eye Hospital: https://hrabbani.site123.me/available-datasets/dataset-for-oct-classification-50-normal-48-amd-50-dme OCTDL: https://data.mendeley.com/datasets/sncdhf53xc/4 OCTID: https://borealisdata.ca/dataverse/OCTID NEHUT: https://data.mendeley.com/datasets/8kt969dhx6/1

BibTex

@InProceedings{FecRon_RetFiner_MICCAI2025,
        author = { Fecso, Ronald AND Morano, José AND Schmidt-Erfurth, Ursula AND Bogunović, Hrvoje},
        title = { { RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {541 -- 551}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a refinement framework to enhance the performance of existing foundational models. Based on this framework and further training on a private dataset, a new benchmark for OCT foundational models is established.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The problem setting considered in this paper is meaningful. Previous OCT foundational models have not evaluated their linear probing performance on downstream tasks. The improvement under this setting presented in this work contribute to enhancing their generalization and practical value.
2. The paper evaluates the proposed foundational model’s superior performance across a range of downstream experiments and commits to release the model weights depending on the paper’s acceptance, which is a valuable contribution to the field.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited methodology novelty: The proposed refinement framework is essentially a combination of widely used loss functions, lacking significant novel design. Additionally, there is no analysis of how these optimization objectives assist or interact with each other during their joint use. Furthermore, in Section 2.1, the authors state that the introduction of the MLM objective aims to address the false negative issue in ITC and ITM, but the theoretical rationale for this claim is questionable. Moreover, as shown in Table 3, adding the MLM objective on top of ITC does not result in performance improvement.
2. Insufficient experimental analysis: It remains unclear whether the performance improvement brought by ReRead is due to the proposed refinement framework or the inclusion of new data. The paper does not provide the performance of the MAE-based model pretrained on the private dataset. I have concerns about whether introducing new data with simple optimization objectives on top of existing FMs (considering that previous FMs might serve as better initialization than the private baseline) will actually improve performance. Additionally, building on the issue in (1), the ablation studies lack a thorough discussion of the actual contributions of different optimization objectives. The loss combinations provided in Table 3 are insufficient.
3. A new pooling strategy is proposed in this paper. It is important to mention whether the comparison models used their original optimal strategies during the downstream experiments, as this is crucial for the fairness of the comparison. While ReRead improves linear probing performance, how does it perform in terms of performance improvement under fine-tuning?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The ReRead model proposed in this paper establishes a new baseline for OCT foundational models, which holds some value. However, the framework presented is merely a naive combination of existing loss functions, lacking substantial novelty. This is the primary limitation of the paper’s contribution.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While the methodological novelty of the paper is limited, the proposed model still holds value for the community. The manuscript is generally well-written and readable, and the authors have designed comprehensive experiments to demonstrate that their model can establish a new baseline for OCT foundation models.

I encourage the authors to reconsider the presentation of the methodology section, particularly to clearly articulate the rationale behind the choice of four loss functions and to cite relevant prior works. The authors’ rebuttal statement that “ReRead is the first to simultaneously employ four loss functions, while others ultimately use only three” is far from sufficient to support claims of methodological novelty. The motivation remains vague and unconvincing.

Finally, and most importantly, I hope the authors to fulfill their promise to release the proposed model. This would constitute the core contribution of the paper.

Review #2

Please describe the contribution of the paper

This paper proposes ReRead, a vision-language refinement scheme designed to enhance retinal foundation models using paired OCT images and associated electronic health records (EHRs). ReRead integrates multiple training objectives—image-text contrastive (ITC), image-text matching (ITM), masked language modeling (MLM), and generative modeling (GM)—to improve cross-modal alignment and semantic representation. The approach is applied to refine three retinal foundation models (RETFound, UrFound, and VisionFM), using a 100k in-house OCT-EHR dataset. ReRead demonstrates consistent performance gains in linear probing across seven downstream OCT classification datasets, including public and in-house data. The paper includes detailed ablation studies and publishes code and model weights to support reproducibility.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper evaluates three foundation models and seven diverse downstream datasets, showing comprehensive empirical validation.
2. The ablation study is sufficient and carefully examines the contribution of each component in the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Although the authors present ReRead as a vision-language refinement scheme, the method closely resembles a modular vision-language pretraining framework. Related works such as BLIP-2, LiT, and FILIP are not discussed or cited, despite conceptual similarities—particularly with BLIP-2, which also combines ITC, ITM, and GM objectives.
2. The performance gains may largely stem from the additional in-house dataset. To fairly assess the contribution of ReRead, the authors should also compare against other modular pretraining approaches using the same data.
3. The authors adopt linear probing as the primary evaluation protocol. Why full fine-tuning was not considered or compared?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak accept. Despite some missing references and comparative baselines, this paper is well-written, presents a clear and practical method, and is supported by extensive experiments across multiple datasets. Addressing the noted weaknesses would further improve the clarity and rigor of the contribution.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposes refinement scheme for retinal foundation models by using paired image-text data. By re-training for less than 10 epochs, it can improve diagnosis precision for three sota OCT foundation models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea of the proposed method is simple but effective
2. The training expenditure is controllable and limited
3. The generalizability of the proposed method is proved on three difference FMs
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The authors use 15% mask ratio for MLM loss while 60% for GM loss. I wonder whether the results change notably if the ratio is changed?
2. I am curious about whether the authors find subgroup preferences of the original FM and the ReReadFM for patients in diverse demographics?
3. Fig. 2, the authors only present attention maps for the ReRead refined FMs. It is recommended to add attention maps for the original FMs, and add comparison between the two attention maps.
4. In the Introduction section, the introduction of the related works is too lengthy.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea is simple but effective, the experiment is solid, the manuscript is clear and well organized.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author has answered my question well. I maintain my original stance and support the acceptance of this article.

Author Feedback

We thank the reviewers for their insightful comments and for finding our method effective (R1–3), our models valuable (R1–3), the validation and ablation comprehensive (R2–3), and the paper well-written (R2–3). We address the main concerns below. Although the conference guidelines forbid new experiments, we will consider them for future work.

—Methodological novelty (R1) Methodologically, ReRead is the first approach to combine all 4 losses. Previous methods are either based on contrastive losses only [24] or use at most 3 [14]. ReRead also uses a novel pooling strategy significantly outperforming regular pooling methods. However, the main novelty of our approach lies in applying vision–language pretraining (VLP) ideas to efficiently enhance existing vision foundation models (FMs) for retina. With ReRead, we aim to shift current FM development and application paradigms (i.e. from-scratch FM training and downstream fine-tuning [FT]) by introducing a second-stage VLP method exploiting unprocessed medical images and their electronic health records (EHRs). To validate ReRead, we conducted the first thorough evaluation of OCT FMs using linear probing (LP), with our effective pooling, on a diverse benchmark. Finally, we believe our public ReRead models will be relevant for the community by providing better baseline FMs.

—Losses and their interactions (R1) This issue was studied in previous work [4,14,28], finding that while ITC provides strong global unimodal representations, ITM better models fine-grained image–text interactions, and GM and MLM learn fine-grained text representations that improve text understanding. Their combination leads to better multi- and unimodal representations and results in better image classification performance (Table 3) and more accurate attention maps (preliminary analysis). This will be further discussed in the paper. Due to limited space, only some of the loss combinations we tested were shown, although we previously confirmed that removing any of the 4 losses decreased performance.

—Improvements from method or data (R1–2) Our ablation (Table 3) shows the performance of our private MAE-pretrained model (MPM) after ReRead tuning with different losses. Since the 100k OCTs with associated EHRs used for ReRead were a subset of the total 261k OCTs used to pretrain MPM, no additional imaging data was introduced. While our pre-analysis showed inferior performance of MPM vs. MPM tuned with ReRead, further confirming that the improvement came from the VL method, we limited the ablation to VL methods for fairness.

—Fine-tuning experiments (R1–2) FT experiments would be valuable. However, due to limited space, we focused on LP, as we think it offers a more precise evaluation of FMs. In LP, all parameters are frozen except for a final linear layer. Thus, the evaluation is less dependent on the adequacy of downstream optimization, and directly determines how discriminative (i.e. meaningful) the features extracted by the FMs are for final tasks.

—Evaluation details (R1,R3) All evaluations used our proposed pooling, as we initially observed that it benefited all the FMs. Masking ratios come from previous work [5] and grid search. While the results on diverse datasets indicate robustness to diverse demographics, in-depth analysis would be valuable future work. Only ReRead maps are shown because they are generated using text encoder cross-attentions. We will add self-attention maps of FMs for comparison.

—Related work (R2–3) We thank R2 for the relevant references. While BLIP-2 has similarities with our work, the goal is different. BLIP-2 develops a unified VL model by tuning specialized modules that interact with frozen, separately trained vision and text encoders. ReRead leverages VLP to improve existing vision FMs using text as a supervisory signal. BLIP-2, however, is an interesting baseline for future comparison and will be discussed with the others in the Introduction, while summarizing the related work.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers have reached a consensus to accept the paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models

Author(s):