Abstract

Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at https://github.com/MICV-yonsei/PRETI

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0634_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MICV-yonsei/PRETI

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LeeYeo_PRETI_MICCAI2025,
        author = { Lee, Yeonkyung and Han, Woojung and Jun, Youngjun and Kim, Hyeonmin and Cho, Jungkyung and Hwang, Seong Jae},
        title = { { PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {526 -- 536}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces Learnable Metadata Embedding (LME) for dynamically refining metadata represesntations, and Retina-Aware Adaptive Masking (RAAM) with a dynamic masking ratio for improved foundation model performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Integration of age and gender metadata
    • Extensive evaluation against recent ViTs
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • RAAM appears a minor contribution
    • Non-public dataset used in training affects reproducibility
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. In Section 2.1, it is claimed that grouping CFPs potentially spans different scanners (cameras); was this actually the case for any of the datasets used in the evaluation?

    2. In Section 2.2, it is stated that the background (non-retinal) area patches are not included in the adaptive masking strategy. It might be clarified as to whether the background patches are then always non-masked but provided as input to the model, and how patial-background patches are defined (background or retinal)

    3. In Section 2.2, a cosine decay masking ratio is described. It might be clarified as to whether the decay function was empirically determined.

    4. In Table 1, an ROC of about 0.50 is reported for RetFound on Glaucoma and ViT prediction, which is essentially random. It might be checked as to whether the models were properly training; if possible, results using baseline CNN models should also be reported.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed extensions appear relatively minor. In particular, it is not known if simply combining metadata with the output of a ViT through ensembling would also give similar results.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel ophthalmic foundation model that integrates color fundus photography with patient meta data, specifically age and sex. The model is built around three key design components tailored to this multimodal setting, and its performance is evaluated on a range of downstream tasks, including ocular disease classification and the prediction of systemic biomarkers. Unlike previous ophthalmic foundation models that typically incorporate diagnostic reports for external context, this work explores the integration of structured meta data, which is both novel and clinically meaningful.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The use of structured meta data (age and sex) is a significant strength. Previous ophthalmic foundation models have primarily relied on diagnostic reports, which are more complex and less standardized. Incorporating meta data provides a more interpretable and broadly applicable source of patient context.

    2. The paper introduces several data-specific architectural innovations, such as patient-level paired data pretraining, Learnable Metadata Embedding, and Retina-Aware Adaptive Masking, all of which are logical and well-suited to the problem setting.

    3. The paper is clearly written, and both the methodology and results are presented in a mature and coherent way. The inclusion of attention map visualizations adds an extra layer of interpretability, making the overall work more comprehensive and transparent.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of robustness analysis: The primary concern lies in the evaluation methodology. The results are reported from a single experimental run, which introduces uncertainty due to potential random fluctuations. As demonstrated in other foundation models such as RETFound, it is critical to evaluate performance across multiple random seeds and report statistical significance (e.g., p-values) to establish robustness and reliability.

    2. Ambiguity in dataset description: The paper states that “PRETI was trained on an in-house dataset comprising 1,017,549 CFPs from 292,006 patients across six medical institutions including UK Biobank.” This raises a concern: why is UK Biobank considered part of an “in-house” dataset as it is publicly available? A more precise and transparent characterization of the dataset, such as data sources, population distribution, and access conditions, is needed.

    3. The captions for the main tables could be made more informative. For example, Table 1 should clearly indicate that the results are from the in-house dataset, whereas Table 2 should emphasize that the results are based on public datasets.

    4. The choice of Coronary Artery Calcium (CAC) score and estimated Glomerular Filtration Rate (eGFR) as downstream biomarkers is not well-justified. Are these selected due to their strong correlation with age and sex metadata? What is the clinical or physiological rationale for using these biomarkers in conjunction with retinal imaging?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes a well-motivated and novel approach by introducing meta data into the pretraining of a retinal foundation model, supported by thoughtful architectural innovations. The methodological presentation is clear and the use of attention visualization enhances interpretability. However, the experimental section lacks robustness analyses and suffers from insufficient dataset transparency and justification for biomarker choices. Addressing these limitations would greatly strengthen the paper’s impact and reliability.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents PRETI, a novel foundation model for retinal image analysis that integrates metadata (age, gender) directly into its learning process. It is built upon a self-supervised learning paradigm, leveraging an enhanced SiamMAE backbone. The model is trained on over a million images from patients and evaluated on diverse tasks (e.g., DR, AMD, glaucoma, CAC, eGFR).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Towards building a foundation model with meta data is extremely clinically meaningful and useful.
    2. The experiments are strong enough to prove the effiicacy of the models.
    3. The results are satisfactory
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    One minor weakness is there might be metadata other than ones used here, and those can be meaningful in determining the prediction outcome.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It would be great to have the trained model released so that users or clinicians can directly verify the performance on individual datasets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the strength and weakness of the papers. The paper proposes a very clinically relevant solution of the problem. The results are also strong and meaningful.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their feedback and support. Below, we respond to the main concerns and aim to clear up any possible misconceptions: R3: Robustness of evaluation. We fix the random seed to 0 to ensure consistent comparisons across experimental settings. As reviewer pointed out, we fully recognize that experiments conducted with a single random seed may not eliminate the possibility of stochastic variations. Due to rebuttal constraints, we could not include new experimental results, but we acknowledge this concern and plan to address it with comprehensive multi-seed evaluations in future work. R3: Clarity of data description. We use a non-public dataset because our method requires patient-level image pairs along with age, gender metadata, and images from multiple scanners to learn robust representations. Such comprehensive data is rarely available in existing public fundus datasets, making in-house data essential for this approach. The UK Biobank dataset is public, but we applied internal processing for our purposes and referred to it as in-house. We acknowledge that the use of the term “in-house” may cause confusion and will revise the phrasing for clarity. While we are unable to share the detailed population statistics due to institutional constraints, we release the full code, pretrained weights, and implementation details to support reproducibility. R3: Justification for certain biomarker selections. Previous studies have shown that CAC and eGFR can be predicted from retinal images. These biomarkers also strongly correlate with metadata such as age and sex, making them suitable and representative endpoints for our experiments. R1: grouping CFPs in the evaluation. We clarify that grouping CFPs to handle scanner variability (Sec 2.1) is used only during pre-training. At evaluation, our method takes a single retinal image without grouping. We will further clarify this distinction in the final manuscript. R1: Contribution of LME compared to ensembling. The approach of combining metadata only with the output of a ViT model treats image features and metadata as independent, which limits the model’s ability to learn interactions between them during training. In contrast, our method incorporates metadata at the input stage, allowing the encoder to learn patient-specific representations from the outset. This design choice is motivated by prior studies demonstrating that joint integration of auxiliary information yields more effective representation learning than combining modalities only at the output stage, particularly in medical imaging, where metadata can significantly influence visual interpretation. R1: Clarification on RAAM We apply adaptive unmasking only within the retinal region, ignoring background patches. A patch is considered retinal if over 50% of its pixels belong to the retinal area. We use cosine decay for the masking ratio to ensure stable early training and gradual detail learning. In our experiments, linear decay led to rapid early drops and performance degradation, making cosine decay the more effective choice. R1: Verification of the baseline results. The low AUROC of RETFound on Glaucoma does not indicate a training failure, as the model performs well on other tasks under the same setup (DR: 0.871, CAC: 0.783). In addition, our in-house dataset, collected from six hospitals with varying scanners, introduces greater heterogeneity than the public GF dataset, which originates from a single site. This increased variability may have further contributed to the performance drop. We excluded CNN baselines to maintain consistency with ViT-based comparisons. Although additional experiments were not feasible during the rebuttal phase, we will consider them in future work. R2: Metadata Expansion. We show that even basic metadata can enhance representation learning when integrated into the model. Incorporating richer metadata is an extension of our approach and could further broaden its applicability.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The reviewers appreciated the novelty of integrating structured metadata into a foundation model for retinal imaging and recognized the clear presentation, strong experimental results, and clinical relevance of the work.

    Some concerns were raised regarding the robustness of evaluation (e.g., performance across multiple runs), clarity of dataset description, and justification for certain biomarker selections. We encourage you to address these points carefully in the final version to further strengthen the work’s transparency and reliability.



back to top