Abstract

Diabetic macular edema (DME) is a leading cause of severe vision loss in the working-age population. Optical coherence tomography (OCT) is the gold standard for DME management and primary care referrals, providing retinal thickness maps (RTMs) that quantify retinal pathologies. However, its limited accessibility in resource-constrained settings necessitates more efficient solutions. While color fundus photography (C-FP) is a cost-effective screening tool, its potential for quantitative thickness evaluation remains underexplored. In this paper, we propose a novel Global-to-Local conditional Diffusion model for Retinal Thickness prediction (GLD-RT), the first attempt to predict RTM solely from C-FP. Our framework predicts thickness distributions of macular region from 2D inputs through a diffusion process guided by hierarchical global-to-local retinal features. Experimental results demonstrate that GLD-RT accurately depicts both physiological and pathological retinal morphology, achieving superior performance in thickness quantification and enabling a more detailed examination of retinal structures. Furthermore, C-FP-generated RTMs exhibit promising utility in facilitating DME diagnosis. This approach transforms conventional fundus imaging into a comprehensive and cost-effective diagnostic tool for DME screening and monitoring in resource-limited settings, thereby holding significant clinical implications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1179_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheWen_Seeing_MICCAI2025,
        author = { Cheng, Wenquan and Sun, Yihua and Wang, Jinyuan and Guo, Jia and Li, Zihan and Wang, Zhuhao and Ning, Guochen and Zheng, Yingfeng and Liao, Hongen and Wong, Tien Yin and Song, Su Jeong},
        title = { { Seeing Beyond the Surface: Retinal Thickness Prediction from Color Fundus Photography for DME Management } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {562 -- 572}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a model to generate a retinal thickness map predictions relying only on color fundus photographs. They combine local and global features extracted with two different encoders and test their method on images from 283 patients. Moreover they perform foveal region segmentation to perform binary classification of diabetic macular edema (DME) on the publicly available dataset mBRSET.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and the motivation to exploit the practicality of color fundus images to improve the diagnosis of DME is well justified.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The authors introduce second model in the Results section with limited information on their motivation to optimize two separate models.

    • The authors do not provide a fair comparison to the performance of their model in the mBRSET. The baseline for the binary classification of DME in the mBRSET is lower than the one presented in the corresponding study with a simpler method and larger test set.

    • The authors miss important state-of-the-art references tackling the combination of OCT and color fundus images for the characterization of diabetic retinopathy, e.g.10.4103/IJO.IJO_2614_22, https://doi.org/10.1016/j.oret.2021.12.021.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though, the authors mention the problems of previous approaches when image quality was poor (Arcadu et al.) they do not address this issue in their current work. Moreover, the work by Arcadu et al. was tested on a much larger dataset, and the multi center validation study from by Liu et al. contains a similar number of cases than the datasets used in this study. Moreover, these previous studies, have performed a more comprehensive evaluation of DME predicting quantitative measures of thickness, and/or classification of diabetic retinopathy. It is not clear, whether the presented approach outperforms the state-of-the-art for this topic. The authors could provide a better contextualization of their work and comparison to the state-of-the-art in the discussion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This manuscript proposes a method to estimate retinal thickness maps (RTM) from color fundus photography (CFP) images. To prepare training data, RTM and CFP are registered by aligning vessels, using infrared fundus photography (IRFP) as an intermediate modality. Then, a neural network architecture is proposed that uses two encoders (“local” based on Swin-Transformer zoomed in on the macula, and “global” based on pre-trained visual transformer with the entire fundus as input) and a conditional diffusion decoder to predict the RTM. The method is compared to other architectures that use either CFP or IRFP or both combined as input, and an ablation experiment is performed. Promising results in terms of MAE and PSNR and visual examples are shown on an in-house dataset (1418 patients with Diabetic Macular Edema (DME)). Moreover, in an external test set (1291 patients with and without DME), the predicted RTM along with the CFP is used as input for a ResNet50 trained to diagnose the presence of DME, and compared to a ResNet50 using only CFP as input; the results with RTM show better diagnostic performance in terms of area under the curve.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -S1) The manuscript is clearly written. -S2) The results are impressive. -S3) The last experiment predicting DME is a nice contribution, demonstrating the clinical value of the estimated RTMs. -S4) Clinical relevance seems high.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    -W1) The results are reported without measures of variance (for example, confidence intervals, or standard deviation across cross-validation iterations). -W2) The architecture of the neural network could have been motivated better. Why a normal U-Net is not sufficient? What makes the problem of CFP->RTM prediction unique compared to other tasks? Why this particular architecture with two encoders and a diffusion based decoder is necessary for this application? -W3) The ablation test is nice, but the variants with only the global encoder have not been tested. -W4) It would have been interesting to add an experiment using IR-FP instead of CFP as input for the GLD-RT method. -W5) The proposed architecture and the competing architectures all have some hyperparameters, e.g., number of filters, levels etc. Have there been any attempts to optimize these hyperparameters? At least it would be insightful if the number of trainable network parameters for each method was reported.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    FURTHER DETAILED COMMENTS: -D1) It would be good to report some measures of computation time (for training and inference) and memory usage for the different architectures. -D2) Notation of eq 1 could be improved. The expectation is taken over M, T, F_m and F_g, but if I understand well, M, F_m and F_g are the triplets of training data, so it would be more clear if they are grouped, e.g.: \expec_{t, {M, F_m, F_g}} -D3) Introduction mentions, “C-FP offers superior spatial resolution, …” , but in experiments, the CFP images are downsampled to the same resolution (544x544) as IR-FP. Please discuss.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The clinical application is interesting, the results are quite promising, and the presentation is clear. As an application paper, it could therefore be a nice contribution to the conference. The methodological innovations are less convincing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This manuscript presents a novel framework for estimating retinal thickness maps from color fundus photography (CFP). It is based on an encoder-decoder architecture in which separate processing streams encode a detailed view of the macular (using swin transformers) and a global view of the full fundus (using the RETFound foundation model). The decoder is a conditional diffusion model. The framework is trained and evaluated on paired OCT, infrared, and color fundus images, and is reported to outperform previous approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Estimating retinal thickness from CFP is an interesting challenge and could become practically relevant if sufficiently reliable

    • The proposed architecture is plausible and uses state-of-the-art network architectures, as well as a recent domain-specific foundation model

    • The method is evaluated on a reasonably sized test set (283 patients). Quantitative and qualitative results indicate a benefit over simpler alternatives.

    • A meaningful ablation study clarifies the relative benefits of individual parts of the pipeline, such as including global context, using the domain-specific foundation model, and using the diffusion decoder, both qualitative and quantitatively.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Technical novelty is limited; the proposed approach mostly stitches together network architectures that are natural candidates for the respective subtasks

    • A sufficiently clear and detailed description is given that should make it possible to build a very similar framework. I expect that exactly reproducing the results will be difficult since source code will apparently not be released, and some details, in particular with respect to the diffusion decoder, remain vague.

    • The qualitative results in Fig.2 look convincing, but it does not become quite clear how representative they are: Have they been selected to highlight cases where the proposed approach outperforms the SOTA alternative particularly strongly? Were there also cases in which the SOTA gave more accurate results, which are not shown here?

    • It is questionable to treat the two eyes of the same patient as independent in the Wilcoxon signed-rank test

    • There is no validation on an additional, external dataset

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Some of the text in 3.4 reaches into the page margin.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-written MICCAI paper with a clear goal, plausible technical approach, and suitable evaluation. Even though the question of potential cherry-picking in the qualitative results might be addressed in a rebuttal, and even though releasing the source code would be greatly appreciated, I am optimistic enough about this work that I would not insist on a rebuttal unless others raise more serious concerns.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their insightful comments, which we will incorporate into the revised manuscript and future work.

R1 - Qualitative Results Representativeness: The cases illustrated in Fig. 2 intentionally reflect both typical and challenging scenarios. Our method notably excels in capturing pathological irregularities characteristic of DME compared to standard U-Net models. We will enhance the qualitative analysis discussion in the revised manuscript.

R2 - Variance Measures: If allowed, we will include standard deviation metrics from cross-validation iterations in the revised manuscript.

R2 - Model Architecture Motivation: We appreciate the inquiry into our model’s architectural design. The task of predicting RTM from C-FP is uniquely challenging due to the intrinsic multi-scale complexity of retinal structures, including vessels, optic nerve heads, and various retinal layers. Pathological features of DME significantly alter vessel morphology—such as trajectory, branching patterns, caliber, and density—and consequently impact adjacent retinal tissues. U-Net architectures often fall short in capturing the simultaneous presence of fine-grained pathological changes and broader anatomical context. To address this, our approach integrates detailed pathological information with comprehensive global anatomical structures, ensuring robustness and precision in representing subtle yet clinically significant DME-related retinal alterations. We will enhance the discussion in the revised manuscript.

R2 - Hyperparameter Optimization: We explored various hyperparameters (e.g., time steps, network depth). If allowed, we would detail these efforts and report the number of trainable parameters.

R2 - Resolution Clarification: Registration constraints necessitated downsampling CFP images to match IR-FP resolution (544x544 pixels). Despite this, the inherent chromatic contrast of CFP images provides richer structural information. Our experiments demonstrated that substituting IR-FP with registered CFP in baseline models consistently improved performance, particularly in the clinically significant foveal region. Future studies will explore the impact of native resolutions on prediction performance.

R3 - Baseline Comparison in mBRSET: We used concatenated inputs as a 4-channel image, training from scratch without ImageNet-pretrained weights for a fair comparison. This approach differs from the referenced study using pretrained weights and larger transformer models, and will be explicitly clarified in the final version.

R3 - Introduction of Second Model: The second model objectively validated the clinical utility of retinal thickness maps (RTM) for diabetic macular edema (DME) diagnosis. We acknowledge that subjective validation would strengthen our findings and plan to include this in our future studies.

R3 - Comparison with Previous Studies of Diabetic Retinopathy and DME Diagnosis Our work uniquely focuses on generating RTMs from color fundus photography (C-FP). This approach complements previous studies, providing quantitative references valuable for clinical scenarios such as patient follow-ups.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top