Abstract

Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0935_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Yuang-Yao/RetCoP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YaoYua_Continual_MICCAI2025,
        author = { Yao, Yuang and Wu, Ruiqi and Zhou, Yi and Zhou, Tao},
        title = { { Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {110 -- 120}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a method for continual pre-training a foundation model in retinal imaging. Three different retinal imaging modalities are integrated in a joint model in a sequential setup. To align the features textual guidance (CLIP) is used. The approach combines CLIP with rehearsal and

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well written and has a good structure.
    • The proposal of a joint pre-training in a continual setting is novel in retinal imaging. Such a model can be interesting, especially if the learning of another domain (e.g. OCT) is boosting classificiation on e.g. CFP.
    • The chosen baselines are reasonable and a valuable ablation study is provided.
    • Source code available.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • In the paper the images from OCT and CFP/FFA are encoded by the same encoder. However, OCT is depicting different features than CFP/FFA. What is the rational behind a unified feature extractor? Why not use a modality-specific feature extractor while still aligning them through the text encoder?

    • In Section 2.2., Representative Joint Embedding Rehearsal, I think the k-means sampling makes sense and is a reasonable choice. However, I am missing some key numbers. Could the authors provide some hyperparameter settings for that? How many k? How many samples/cluster? How many samples for rehearsal buffer? I would recommend adding those details to the experimental setup section.

    • For the embedding rehearsal I don’t understand the reasoning behind choosing the samples closet to the centroid. Is that not limiting the diversity of the sample as those samples are not only close to the centroid, but also close to each other? How about sampling uniformly over a given cluster?

    • Authors claim that they are inspired by MOD-X [16] for their off-diagonal information distillation. However, I was not able the find the differences between MOD-X and the proposed method. Could the authors please clarify and explain the differences between your off-diagonal information distillation and MOD-X? If there are no differences please use a proper wording such as “we use MOD-X” instead of “inspired by”. Combining MOD-X with rehearsal would still be novel.

    • In 3.1. it is stated that all images are resized. However, that would be that the B-scans in OCT that are usually wide-format are squeezed? Or is cropping applied?

    • The presentation of results is hard to follow for a CL-focused paper. From my point of view the most important comparison is the performance at the end of training (that is after all modalities are observed). To get this information one has to jump between Tables 1, 2 and 3. Results after an intermediate stage (stage 1, 2) are interesting for more insights, however they should not be in focus during validation. I would recommend reordering the results to have the performance for the three modalities after stage 3 next to each other.

    • For Table 1 the difference between the FIVES and the ODIR dataset is explained poorly. Why is the performance on FIVES so much worse than on ODIR? From looking at the code I suspect the reason is the inclusion of a subset of ODIR in the training data (there is no data leakage, training and testing is splitted correctly!) is helping the model to keep the knowledge better. Which also hints at a reduced generalization performance across the CFP domain (generalization in terms of datasets within CFP)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    For the results in 3.2. Experimental Results, for future work I would recommend adding baselines like a jointly trained model and single-modality models, as an upper bound for the achievable performance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper has some strength in the way it is presented and its motivation, I see two major factors that lead to my decision: First, it is not clear what parts are novel (i.e. “inspired by”) and what are not. Second, the presentation of results is hard to follow for a CL-focused paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I made my final decision as reject. The explanations about the buffer size (20k is not a “slim” rehearsal buffer) and choice of samples is not convincing. The unified encoder is not clear, the mentioned MedCoSS paper uses specialized tokenizers for 1D, 2D, and 3D data.



Review #2

  • Please describe the contribution of the paper

    This paper presents a continual vision-language pre-training framework for the retinal domain, which incrementally integrates multiple fundus imaging modalities to construct a unified foundation model. To address the catastrophic forgetting, the method leverages a rehearsal strategy using representative image-text pairs and introduces an off-diagonal information distillation mechanism.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novelty of the setting: This paper introduces the first work of a continual vision-language pretraining framework tailored to the retinal domain. The motivation to unify multiple fundus imaging modalities into a single foundation model is clearly articulated in practical clinical imaging.
    2. Clarity in Methodological Motivation: The stages and most components of the proposed framework are well motivated. For instance, the off-diagonal information distillation module is introduced to address the drift in the alignment between visual and textual features. Its role in mitigating misalignment in the representation space is clearly explained.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited baseline comparison: The experimental session does not include comparisons with recent and relevant works such as [1] and [2], which also tackle continual learning in multi-modal medical settings even if they are not CLIP-based. Mark that the most recent baseline included in the paper is from 2019. Considering the rapid development of foundation models, especially those built on CLIP-like vision-language architectures, it would strengthen the study to include more recent CLIP-based continual learning studies such as [3] and [4].
    2. Ablation study needs clarification: In Table 4, the discussion refers to “increased the model’s average forgetting rate”, yet the table reports accuracy-related metrics like ACC and AUC. It is unclear how these metrics are linked to forgetting, making the interpretation difficult to follow.
    3. Insufficient metric definitions: Section 3.1 lacks clear definitions for the evaluation metrics used. For instance, if ACC and AUC are used, they should be defined mathematically or described explicitly to ensure reproducibility and clarity.
    4. Privacy concerns with the rehearsal-based strategy: The use of a data rehearsal strategy, where representative samples are stored and replayed, could raise practical and ethical concerns in the medical domain, due to the sensitivity of patient data. Alternatives such as feature-level replay or synthetic memory could be considered as more privacy-preserving options.

    [1] Ye, Y., Xie, Y., Zhang, J., Chen, Z., Wu, Q. and Xia, Y., 2024. Continual self-supervised learning: Towards universal multi-modal medical data representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11114-11124). [2] Liao, W., Xiong, H., Wang, Q., Mo, Y., Li, X., Liu, Y., Chen, Z., Huang, S. and Dou, D., 2022, September. Muscle: Multi-task self-supervised continual learning to pre-train deep models for x-ray images of multiple body parts. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 151-161). Cham: Springer Nature Switzerland. [3] Zheng, Z., Ma, M., Wang, K., Qin, Z., Yue, X. and You, Y., 2023. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 19125-19136). [4] Ding, Y., Liu, L., Tian, C., Yang, J. and Ding, H., 2022. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While this paper presents an interesting setting for continual vision-language pretraining in the retinal domain, I believe the current experimental validation is insufficient to support a strong claim of method superiority. In particular, the absence of comparisons with recent and relevant baselines, especially CLIP-based continual learning methods, limits the ability to assess the effectiveness of the proposed method. Additionally, some confusion remains in the interpretation of ablation results and evaluation metric definitions. If the authors can address these concerns, I would be open to reconsidering my score in a future revision.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My main concerns pertain to: 1. the lack of experimental comparisons with recent baselines (all baselines predate 2019) - the rebuttal claims the method outperforms MedCoSS (2024) and ZSCL (2023). Other concerns are tackled to some extend. Since I find this continual pre-training problem quite interesting in the field, I recommend an accept.



Review #3

  • Please describe the contribution of the paper

    The paper presents RetCoP, a novel and timely framework for continual vision-language pre-training in the fundus imaging domain, addressing the practical challenge of incremental availability of multi-modal retinal data in real-world clinical environments. The main contributions of the work are as follows:

    1. Novel Problem Formulation: The authors are the first, to the best of my knowledge, to tackle continual pre-training across imaging modalities in retinal vision-language tasks. Unlike existing retinal foundation models that focus on single modalities, this work models incremental multi-modal learning, which is more aligned with clinical data acquisition processes.
    2. Methodological Innovation: The proposed framework introduces two effective mechanisms to mitigate catastrophic forgetting during continual pre-training: A Representative Joint Embedding Rehearsal strategy that selects and reuses key image-text pairs from previous modalities based on joint embedding similarity and K-means clustering. An Off-Diagonal Information Distillation (ODID) module that preserves semantic alignment by distilling similarity distributions across training stages, focusing on non-diagonal similarity patterns to maintain inter-modal consistency.
    3. Comprehensive Evaluation: The authors conduct extensive experiments on multiple public fundus datasets (CFP, FFA, OCT modalities), demonstrating consistent superiority of RetCoP over competitive baselines (e.g., LWF, EWC, ICARL, ER, SeqFT) across various evaluation settings including zero-shot, linear probe, and CLIP-adapter. The method shows lower forgetting rates and better generalization, particularly under modality shifts.
    4. Ablation Studies and Insightful Analysis: The paper includes detailed ablation studies that verify the contribution of each proposed component. In particular, the benefit of retaining the text encoder weights across stages and the necessity of both rehearsal and distillation strategies are well justified.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One of the major strengths of this paper lies in its novel problem formulation of modality-incremental vision-language pre-training in the retinal imaging domain, which is both technically underexplored and highly relevant to real-world clinical workflows. Unlike previous methods that assume all imaging modalities are available upfront or focus on single-modal pre-training, this work uniquely models a realistic scenario where data from different imaging modalities (e.g., CFP, FFA, OCT) arrive sequentially over time, necessitating continual updates to the foundation model. This is particularly important in clinical environments where medical institutions often acquire new imaging data asynchronously due to differences in equipment, diagnostic focus, or patient cohorts. By designing a framework that can incrementally integrate image-text pairs from different modalities without catastrophic forgetting, the authors move beyond traditional static pre-training approaches and offer a more flexible, clinically feasible paradigm for building robust and scalable vision-language foundation models in medicine. This formulation alone opens a new direction for continual learning research in multi-modal medical AI.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    A notable weakness of the paper is the limited novelty in the core contrastive learning framework and the lack of sufficient comparison to recent vision-language pre-training methods tailored for medical data. While the proposed RetCoP framework introduces new components such as representative joint embedding rehearsal and off-diagonal information distillation, the overall training paradigm closely follows the standard CLIP-style contrastive loss and rehearsal-based continual learning, which have been previously explored in both natural and medical domains. For example, MedCoSS (Ye et al., CVPR 2024) have already introduced rehearsal-based continual self-supervised learning frameworks in the medical imaging context, and MOD-X (Ni et al., ICML 2023) proposed the off-diagonal distillation idea for continual representation learning—RetCoP extends it to the vision-language setting but does not thoroughly contrast or benchmark against these methods in terms of representational stability or cross-modal alignment. Furthermore, although the authors claim clinical feasibility due to incremental modality integration, the paper does not include any qualitative results to demonstrate that the model indeed captures semantically consistent cross-modal features across stages. Including more task-specific metrics or downstream clinical outcomes would strengthen the practical relevance of the method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a timely and well-motivated contribution by addressing the challenge of continual vision-language pre-training across incrementally arriving imaging modalities in the retinal domain — a realistic and clinically relevant setting that is underexplored in the current literature. The proposed RetCoP framework is thoughtfully designed, with two novel components (joint embedding rehearsal and off-diagonal distillation) that are both intuitive and empirically effective. The evaluation is comprehensive, covering multiple datasets, modalities, and settings (zero-shot, linear probe, CLIP-adapter), and demonstrates strong performance and minimal forgetting compared to several competitive baselines. While the core contrastive pre-training setup builds upon known techniques, the adaptation to the medical multi-modal continual learning scenario is original and clearly justified. A more thorough comparison to closely related medical vision-language frameworks and qualitative or task-specific clinical analyses would further strengthen the work, but the current contribution is already significant, well-executed, and relevant to both the MICCAI and broader medical foundation model communities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Thanks for all the constructive suggestions.

To R1: Q1: Unified encoder A1: A unified encoder supports incremental modality without configuring specific extractor for each modality. Current model backbones provide enough capacity for encoding diverse modalities. Many works, e.g. MedCoSS[28], adopt this way. Sharing encoder params. enables intermediate layers implicitly correlate and compensate features across modalities. We did experiments showing unified encoder performs as good as specific ones.

Q2: K-means sampling A2: K is 2000, samples/cluster roughly varies from 90-450, and buffer_size is 20k. We ensure sample diversity via setting a large cluster quantity, and then guarantee representativeness within a cluster. Uniform sampling risks selecting non-representative cluster peripheries.

Q3: ODID & novelty clarification A3: Thanks for correcting wording. Here the contribution is to discover underexplored synergy between rehearsal and MOD-X in continual pretraining. Overall, our key innovations:

  1. The first continual vision-language pretraining framework in the fundus domain.
  2. A novel joint embedding-based rehearsal that synergizes with MOD-X. MOD-X enhances the encoder’s knowledge preservation across stages, while rehearsal leverages these knowledge-retentive embeddings to select more representative samples. This complementary sample-level rehearsal and feature-level distillation can effectively mitigate forgetting.

Q4: OCT image resizing A4: OCT images are resized with isotropic scaling and zero-padding (black background), preserving aspect ratio and structural integrity without squeezing, cropping, or distortion. Details have been added.

Q5: Results presentation A5: We have aligned with CL-focused paper by revising: (1) consolidating the ending stage performance for all three modalities into a table, (2) visualizing intermediate results per modality by line graphs, (3) adding joint and single-modality baselines.

Q6: FIVES & ODIR difference A6: The model was never trained on FIVES but partially on ODIR with three unseen categories withheld. ODIR’s partially seen distribution led to better test performance than the fully unseen FIVES. Both results show strong generalization to new datasets (FIVES) and categories (ODIR).

To R2: Q1: Experimental comparison A1:

  1. Our VL contrastive pretraining achieves much higher average ACC (+20% CFP, +43.2% FFA, +41% OCT) than [1], since image-text paired data introduce valuable aligned knowledge compared to image-only SSL methods.
  2. Recent CLIP-based continual learning studies [3,4] primarily focus on adapting pretrained models rather than continual pretraining from scratch. These methods emphasize preserving original zero-shot performance in downstream simple class-incremental tasks, but are less effective for our challenge of continual foundation model pretraining from scratch in medical imaging. Our method reduces average forgetting (-5.6% ACC & -3.9% AUC) compared to [3]. All your suggested references will be properly compared and cited. As [2,4] lack open-source code, we will reimplement them later.

Q2: Ablation results & evaluation metric A2: Forgetting is measured as Δ(ACC/AUC), now clearly marked with ↑/↓ in the revised ablation tables. ACC/AUC definitions will be included.

Q3: Privacy concerns of rehearsal-based strategy A3: Our current study only uses public datasets, without private data involved. We appreciate your suggestion and will consider them in future work.

To R3: Thanks for your support of our work. Q1: More comparison A1: As addressed in R2’s Q1, our comparative results against [1] MedCoSS (Ye et al., CVPR2024) and [3] ZSCL (Zheng et al., ICCV2023) further validate RetCoP’s superior performance.

Q2: Qualitative cross-modal feature & clinical outcomes A2: We added heatmap visualization of cross-modal feature consistency. RetCoP is transferable to various clinical tasks, like lesion detection and multi-disease diagnosis, which future studies will validate.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top