Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. Existing methods for synthesising Colour Fundus Photographs (CFPs) largely rely on predefined disease labels, which restricts their ability to generate images that reflect fine-grained anatomical variations, subtle disease stages, and diverse pathological features beyond coarse class categories. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, captioned retinal dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses the visual language model (VLM) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Building on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Through extensive experiments, our method demonstrates superior performance across multiple datasets, with 62.07% of text-driven synthetic CFPs indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 5%-10% in diabetic retinopathy grading and glaucoma detection. Codes are available at https://github.com/uni-medical/retina-text2cfp.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0673_paper.pdf

SharedIt Link: https://rdcu.be/eHxe8

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05325-1_45

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{NinJun_RetinaLogos_MICCAI2025,
        author = { Ning, Junzhi AND Tang, Cheng AND Zhou, Kaijing AND Song, Diping AND Liu, Lihao AND Hu, Ming AND Li, Wei AND Xu, Huihui AND Su, Yanzhou AND Li, Tianbin AND Liu, Jiyao AND Ye, Jin AND Zhang, Sheng AND Ji, Yuanfeng AND He, Junjun},
        title = { { RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {470 -- 480}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper enables a pipeline for the generation of a large synthetic retinal image dataset, using LLMs and captions for retinal conditions. The generated synthetic images were shown to improve DR and glaucoma classification by a significant amount.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Provision of large synthetic caption retinal dataset
- Potential to improve classification for rare conditions
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Training data (image and EHR) source and methodology unclear
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. While the main contribution of the paper is a large dataset (RetinaLogos) of 1.4 million retinal images, it is not clear whether these are synthetic generated images, or real-world images. In particular, Section 1 claims that it is a “synthetic retinal caption dataset”, while Section 2.1 claims that they are “real-world fundus images sourced from both open-access and private datasets”. This should be carefully clarified.
2. If the 1.4 million images in RetinaLogos are indeed synthetically generated, then then training procedure for the LLM and CLIP models (Figure 2b) should be stated. Section 3.1 is titled “Dataset and Training Details”, but the following text refers only to the evaluation of generated images, and implementation details (without training).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While a true 1.4 million synthetic fundus image dataset would be a contribution to the field, clarity on the source of the training data, as well as the training methodology, is lacking. This recommendation hopes to allow the authors an opportunity to clarify on these points.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors have addressed our prior doubts on the nature of the dataset. The scale of the dataset should contribute to the field of retinal image analysis.

Review #2

Please describe the contribution of the paper

This paper introduced a huge, synthesized dataset, generated by utilizing a text-to-image generation method which is based on the Diffusion Transformer (DiT) model architecture. The quality of the results generated was also improved by the paper’s novel method of filtering and semantic refinement.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The major strengths of the paper lie in the use of text-to-image generation methods for fundus image generation. The use of text-to-image generation allowed for a more robust model, capable of generating images with flexible input conditions. Specifically, the generation model used did not rely on fixed conditional class inputs, such as specific disease gradings, but rather relied on flexible text inputs that may or may not include the different class conditions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper presented relevant quantitative and qualitative results to present the effectiveness of textual conditions in generating diverse outputs. However, the paper also presented several results that have no contribution to the understanding of the paper, some even detrimental to the methods used in the paper.
1. The first of such results is present in figure 3d, where resolution enhancement was shown. The description of the figures provided was not adequate as it did not explain where these images came from and what these images were aiming to prove.
2. The second is available in the ablation study of the paper, where the relevance of the methods used in this paper for generation was tested. The numerical evidence shows that the components PL and HR had no contribution to the improvement of FID or CLIP scores. However, the methods were still mentioned as a relevant contribution to the study. Further numerical evidence is required to support the relevance of the methods used.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
There were a few numerical issues present in the paper that should be addressed.
1. The results mentioned that the accuracy of models improved by 10~25%, when in reality, they have only improved by 7~10% for each backbone model used.
2. If the previous results refer to an improvement in F1-Score, the paper should specifically state so.
3. The legibility in Table 3 is low and understanding certain aspects of the table should be refined. As an example, the use of the term (ground truth) should not be used for synthetic images.
4. Other data synthesis papers also use the concept of discriminators from GAN to evaluate the quality and authenticity of the images being synthesized. The discriminator score can be used as a better alternative to FID and KID.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents strong contributions to the domain of medical research. It provided relevant progress, where Fundus images could be synthetically generated for the providence of a large dataset, which is usually challenging to obtain. Furthermore, the method for synthesizing the dataset contributed to the novelty of the paper by utilizing textual conditions, which gave significantly better performances when compared to class conditions. However, the paper failed to support some of the methods that were mentioned, negatively affecting its overall novelty. A clarification on figures such as 3(a) should be made for readers to understand the significance of such results, and additional explanations should be made for the necessity and relevance of modules that produced negative results in the ablation study.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.
The paper presented its novelty in its providence of a large synthetic dataset that helped improve the performance of models when only tested on test datasets. There were a few major rebuttals to be made, which after clarification, has proven to stem from the poor clarity of the paper. The rebuttals made clarified the following points:
1. After the clarification, it was made clear that the figures and paper do not demonstrate upscaling for preexisting images, but rather utilizes a previous model’s capabilities to render upscaled synthetic images.
2. The ablation study provided was used as proof for how the synthetic image dataset contributed to better learned models, when compared to models with further training.
3. Other metrics have also been measured and presented in the rebuttal, such as FID, KID, IS, and discriminator score, where it was shown that the proposed method improved the metrics from the baseline models significantly.
As such, the paper’s ability to present a synthetic dataset that improved the scores of other models proves itself to be of relevance for future research and discussion.

Review #3

Please describe the contribution of the paper

This paper introduces RetinaLogos, a novel text-to-image framework designed to synthesize high-resolution retinal fundus images using fine-grained captions. The authors construct a massive dataset (RetinaLogos-1400k) of 1.4 million image-caption pairs and develop a 3-stage training. The generated images are validated both quantitatively and qualitatively, including expert evaluations and improvements in downstream clinical tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) Large-scale dataset construction and a good engineering of a full pipeline. (2) Clinical grounding through expert evaluation and real downstream performance gains. (3) Potentially impactful for synthetic medical imaging, especially in under-resourced settings.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The proposed pipeline is well-engineered and contributes meaningfully to ophthalmic AI. However, from a technical innovation perspective, the methodology primarily integrates existing components rather than introducing new generative mechanisms or architectural changes. The work is solid in execution but moderate in novelty.
2. The contribution is more practical than methodological.
3. The expert evaluation adds clinical credibility, but: It lacks information about the number of clinicians involved. The sample size and selection process for expert scoring should be explicitly stated.
4. Since private datasets and GPT-based generation are used, a discussion is needed on: How diverse or redundant these captions are across disease classes. How many captions were reviewed and modified by ophthalmologists vs. accepted as-is. Note that LLM-generated captions might embed hallucinated clinical details.
5. Comparisons with recent state-of-the-art generative models are limited.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a well-executed and clinically motivated application of large-scale text-to-image generation in ophthalmology. While it does not introduce new generative techniques, its integration of LLMs, clinical context, and detailed evaluation make it a valuable contribution to the field.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed all of my concerns effectively. Overall, the paper presents a good contribution and is deserving of acceptance.

Author Feedback

Thank you for the feedback: Reviewers 1 and 2 praise our 1.4 million semantic-refined retinal image–captions dataset and rare-condition gains, while Reviewer 3 highlights our expert validation and clinical improvements; our responses follow.

R1:

Purpose of Fig.3d: We apologize for the oversight. Fig.3d shows simplified CFPs generated under settings to illustrate resolution enhancement. We will update its caption to make this clear.

Ablation Modules: PL is a diagnostic control that explores the baseline ceiling (FID 240.406/247.135/251.953 on APTOS/EyePACs/AIROGS; CLIP 0.5398) and confirms extra training alone yields no gains. SR is the key quality module: adding SR after PL cuts FID to 68.79/73.64/61.46 and raises CLIP to 0.5561, showing improvements stem from data-quality filtering rather than training length. HR transfers SR’s benefits to up-to-1024² outputs with flexible aspect ratios; with SR (Tab 4, Exp. IV) it further reduces FID and boosts CLIP while enabling high-resolution CFPs. In short, PT + PL sets the baseline and identify the limit, SR ensures the image quality, HR enables high resolution.

Reporting & Presentation Corrections: We apologize for the inconsistency. The 10–25% figures denote relative accuracy gains (e.g., 0.4563→0.5533 with ViT-B/16, ≈21%), while 7–10% values are absolute increases. We’ve clarified these definitions, corrected the numbers, and removed minor ground-truth errors in Tab.3 for a cleaner presentation.

Alternative Evaluation Metrics：We trained GAN discriminator on real IDRiD dataset. Scores were 0.5189 (SynFundus-1M) and 0.7552 (RetinaLogos), confirming our model’s superior fidelity. We will cite the reference.

R2:

Unclear Image Source: Thank you for pointing out. All 1.4 M CFPs in RetinaLogos are real CFPs sourced from open-access (e.g., AIROGS 140 k images) and private datasets (1.2 M images). “synthetic” applies only to the captions we generated with paired CFPs images to build our caption dataset. 2.Lack of Explaination for Training: We apologize for the confusion caused by the organization of Section 3.1. To clarify, our dataset consists of 1.4M real CFP images, not synthetic ones. The synthetic component refers to the generated captions, and the generated CFPs shown in later sections are outputs of RetinaLogo model, used for validation purposes. Our pipeline includes 3 stages: 1. collecting real CFP images from open-source and private sources (including EHRs), 2. generating detailed captions using a structured data-to-text pipeline, and 3. training a text-driven image synthesis model in a staged manner (resolution from 256 to 1024) and is then evaluated via expert-designed and automated metrics. We will update the title and explanation of Sec 3.1 for model training.

R3:

Moderate Novelty: Thank you for recognizing our practical impact. Our innovation is a scalable data-collection pipeline with a reproducible, text-driven framework, enabling fine-grained, clinically grounded caption generation and high-fidelity CFP synthesis. To our knowledge, it is the first large scale text-to-retinal-image model trained on over 1 M real CFPs with captions.

Expert Review: Thank you for pointing out. We collaborated with 2 ophthalmologists to design a clinical evaluation featuring a Turing test and scoring of 5 retinal features (Tab 3). Each reviewed a few hundred images—sampled 1:1 vs real:generated, including rare disease cases (Fig. 4) to ensure clinical validity.

Caption Diversity & Reliability: Thank you for emphasizing clinical credibility. We applied two safeguards: (1) structured EHR priors in prompts, reducing direct duplication to <1% (Fig. 2a); (2) ophthalmologists reviewed 10% of captions 72% accepted, 21% slightly edited, 7% rewritten.

More Method Comparisons: Not shown in the paper due to page limits, we present here as Diffusion with Lora FID 118.8 / KID 0.1445 / IS 1.829; Stable Diffusion FID 121.9 / KID 0.1084 / IS 2.632. Thank you for consideration.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

Author(s):