Abstract

Color fundus photography (CFP) is widely used in clinical practice for its convenience and accessibility. However, it faces challenges such as low image quality, limited depth information, susceptibility to artifacts and low contrast, which reduce diagnostic accuracy and hinder the detection of small lesions. Fluorescein angiography (FA), on the other hand, effectively highlights features such as vascular leakage and non-perfusion. However, it also has drawbacks, including health risks and the lack of color information. To address these challenges, we propose a multi-stage retinal image fusion framework, RIFNet, to improve image quality and diagnostic efficacy by integrating multimodal information from CFP and FA. First, to address the problem of missing modalities due to the difficulty of accessing FA as an intrusive inspection, we design a bi-stream generative subnetwork to generate pseudo FA images by pre-training with real CFP images as the generating condition, which effectively supplements the modality information. Subsequently, the color representations of different modalities are unified by color coding, and fed into the multimodal discriminative fusion network to generate the fused color-coded images. Finally, a multiscale reconstruction method is used to generate a high-resolution and high-contrast enhanced image. Experiments demonstrate that this multimodal fusion framework supplements FA information, reduces medical costs, and reveals lesion details unobservable with a single modality, supporting accurate ocular disease diagnosis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1554_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Liyuyu666/RIFNet

Link to the Dataset(s)

The Isfahan MISP dataset: https://misp.mui.ac.ir/fa/node/1399 The DRIVE dataset: https://drive.grand-challenge.org/

BibTex

@InProceedings{LiYuq_RIFNet_MICCAI2025,
        author = { Li, Yuqing and Hou, Qingshan and Cao, Peng and Ju, Jianguo and Wang, Tianqi and Wang, Meng and Zou, Ke and Tham, Yih Chung and Fu, Huazhu and Zaiane, Osmar R.},
        title = { { RIFNet: Bridging Modalities for Accurate and Detailed Ocular Disease Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {519 -- 529}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. The authors propose RIFNet, a multi-modal framework for retinal image fusion that combines color fundus photography (CFP) with fluorescein angiography (FA) together and generates a high-quality fused image.
    2. The idea is that the fusion may potentially enhance the visibility of lesions as well as other structures (vascular tree, vessel leakage, thrombosis leading to a lack of perfusion) that can aid retinal disease diagnosis.
    3. The authors mention that there is no
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors have sufficiently motivated the need for combining CFP and FA images.
    2. The RIFNet approach seems to be a novel technique.
    3. The methodology is adequately explained. However, without the release of the code, it is hard to identify the utility.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. There are multiple comparisons being performed, but there is no statistical testing to determine the improvement in performance using the proposed RIFNet.
    2. A proxy segmentation task was evaluated on the DRIVE dataset, but there is no description of segmentation metrics (Dice, Hausdorff/Normalized Surface distances).
    3. The viability of the approach is in question because it was only evaluated on a test data subset of 6 CFP-FA image pairs. There is no external testing, and even the results on the DRIVE dataset are not particularly very good.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Introduction:

    1. The authors do not have enough paired data, which makes sense. But what is the utility for the clinician? If they normally use both mutually informative sources of data, then how is this approach going to fit into their workflow? This aspect is unclear.

    Methods:

    1. Testing on 6 CFP-FA image pairs brings into the question the translation of the results to a larger dataset. No meaningful claims about performance can be made from this small sample.
    2. Following on from the above, it is not not even enough to perform statistical testing! This brings me to the lack of a statistical test to compare the perceptual quality of the fusion, especially in the setting of multiple comparisons (the authors compared 7 other approaches to their work).
    3. I recommend the reviewers to focus less on the perceptual quality results (especially in light of the small test sample size) and instead focus on a downstream task (e.g., segmentation and/or classification). They could have used the MESSIDOR or EyePACs dataset to show proof that their approach works better. (https://www.ophthalmologyscience.org/article/S2666-9145(23)00133-1/fulltext)

    Results

    1. There is no description of the segmentation metrics.

    Discussion/Conclusion:

    1. There is no mention of the limitations of the work - 6 samples is pretty small.
    2. What is the clinical utility of this work? Why should an ophthalmologist use this approach and deviate from their current clinical workflow?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Small test data subset size brings into question the generalizability of the results to other (larger) external datasets, lack of statistical testing, lack of segmentation metrics, lack of classification results for a target task.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a multimodal fusion framework for retinal imaging (RIFNET), which enables the generation of pseudo Fluorescein angiography (FA) images from Color fundus photography (CFP). The method addresses missing modality problem w.r.t FA, and achieves outstanding reconstruction quality, particularly in generating high-contrast CFP images. Compared to existing generative methods for this task, RIFNET demonstrates outstanding performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In a novel approach, the method introduces a GAN-based architecture with two discriminators, enabling the model to capture visual patterns from both imaging modalities. As a result, the generated images exhibit high-quality, detailed reconstructions that preserve critical visual features from both CFP and FA. Additionally, the inclusion of a dedicated sub-network for pseudo-FA generation is particularly relevant in addressing the missing modality problem. The results clearly show that, in scenarios where the FA modality is absent, the method significantly outperforms other approaches in reconstructing CFP, an outcome that is both practically valuable and clinically relevant.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The paper is very detailed, but in some cases overly, then, certain steps, such as the “viridis colormap encoding” are described with excessive granularity, which makes it harder to follow the core contribution: the generative framework. I would recommend simplifying the overall decription of the proposed method, including Figure 1, to make the proposed method easier to follow and their main components. Some aspects, e.g the Viridis colormap conversion I think are not relevant to have a complete section (Section 2.2), and even, are not mathematically explained in that section. Maybe it would be clearer to simply state that the images are converted to a Viridis colormap (as a pre-processing, and viceversa to gray scale as a post-processing), unless a stronger justification is provided for its importance to the method.

    • Do the authors have an explanation for why the proposed method improves on the Q AB/F metrics when the FA modality is missing? This is unusual, as most other methods of SOTA show a performance drop in this setting, which would be expected given the missing modality. Could this behavior indicate potential overfitting, raising concerns about the model’s robustness and generalizability to new scenarios? Some discussion or analysis of this phenomenon would be helpful.

    • The abstract and conclusions refer to the method as “multistage,” while the main body of the paper presents it as a multimodal fusion approach. This inconsistency in terminology is confusing.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    There are other minor remarks to improve paper quality:

    • The dataset description and implementation details are mixed and should be better separated. For example, the second item regarding the GPU setting appears in the middle of the explanation of the dataset. It would be clearer to present the dataset and implementation details as distinct paragraphs (or sections).

    • The captions in Figure 2 are not clearly visible.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strategy is particularly relevant as it achieves outstanding performance in common scenarios where the FA modality is missing (often in clinical routine, as its costly and hard-to-acquire modality), outperforming several state-of-the-art methods. This highlights the practical value of the approach.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper describes a novel framework for fusing multimodal fundus images (color fundus photography and fluorescein angiography images). This includes a component for generating a high-resolution version of the FA image for fusion purposes, or generating one from scratch if missing.

    The end result of the pipeline is generated Color Fundus Photography images with retinal vessel shadows (and other FA-relevant details) “enhanced” on the CFP image. This can be used for clearer clinical examination, or as an input to further medical image analysis algorithms.

    The superiority of the produced images is demonstrated using a segmentation task, as compared against competing fused image methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A very clear, albeit non-trivial pipeline, well presented and easy to follow in terms of the steps involved.

    A methodologically novel architecture, using individual components of the pipeline that are well established methods in the field, and with a clear demonstration of improved performance in a clinically-relevant task.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper makes certain assumptions of ‘goodness’ which are not explicitly shown to have been tested or presented.

    E.g. One of the steps in the pipeline depends on a registration step, which is simply assumed to be ‘perfect’. It would have been useful to confirm this quantitatively, and to demonstrate examples where this goes wrong and how it affects the final result. More generally it would be useful to show ‘worst case’ fusion examples, to get an idea of the algorithm’s robustness.

    Similarly, the high resolution “generated” FA images are assumed to capture all important information that the actual FA (upscaled) images would capture. But this has not been tested. And also, it would be surprising (given that CFP images might not show the required detail that is captured by FA). The extent to which the generated FA images match known FA image ‘ground truth’ would have been useful to mention here. If the generated FA images add or omit important information, then this is clinically important.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Presumably row 2 in Fig 3 represents the GradCam result. Please state this clearly in the caption.

    2. Please be more explicit what ground truth your validation metrics are actually comparing against. And whether they take into account only one modality for their computation (and if so, why the other modality was not deemed important to validate against)

    3. In the segmentation task, if possible, please include the ground truth segmentation from the dataset for the particular example shown. It would be interesting to know if the ‘fine details’ appearing in your result were also captured at the ground truth level or not.

    4. The viridis scheme is not a ‘better’ colormap more generally, let alone for computer processing tasks. It is simply a “perceptually uniform” map, meaning it is specifically useful for human perception, taking into account how humans process luminance. So it is unclear why it would make a better shared colourspace than other maps. Also, the exact manner in which this is applied is not clear in the pipeline. A ‘viridis’ image is still an RGB image with 3 channels. Does this mean your transformed input contains 3 channels? In your schematic it looks more like a single (i.e. 1 channel) input to the network. Does this mean the viridis image is converted back into grayscale? (in which case, is there a comparison to a simple grayscaling scheme directly from the original photos?). This part of the paper is a bit unclear, it would be useful to clarify the exact inputs/outputs of that component.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A very clear, well presented paper, with an interesting novel algorithm for fusion of multimodal fundus images. The method is nontrivial but presented and explained very clearly, with clear improvement over state of the art

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal adds some further clarity to the work and sufficiently addresses most concerns / suggestions from this review. Happy to accept.




Author Feedback

We appreciate the reviewers’ efforts in evaluating our work and acknowledge the paper’s contributions, including its innovative method (R1) and practical value of strategy (R3). We address each reviewer’s questions as follows: I. “Goodness” Assumptions (R1-7) (a) Ablation experiments show registration errors cause artifacts in fusion results, resolved by the fusion network integrating multimodal information. SSIM between CFP and FA improves 13% (51%→64%) after registration, confirming rigid registration’s effectiveness. Notably, rigid registration serves only as a preprocessing step to provide data for supervised training in subsequent stages when paired images are unavailable. (b) FA provides high-contrast information on fundus vasculature and aids lesion identification, vascular and lesion details in CFP are highlighted during generation. The generated FA is not intended for clinical diagnosis, but rather as supplementary information to enhance lesion details during training. II.Experimental Evaluation (R1/R3) (a) Fig.3 shows RIFNet extracts finer details than others. Quantitatively, RIFNet achieves the highest Dice, improving by 3.61% (68.93%→72.54%) over non-fusion, confirming that the fine details appearing in our results align with those in the ground truth. (R1-10.Q3) (b) To avoid color distortion associated with IHS or YUV, we use the ‘viridis’ colormap when fusing RGB and grayscale images. Its pixel intensity shows clearer peaks after preprocessing, aiding lesion extraction. Directly converting CFP/FA to grayscale results in blurred details, lacks robustness to registration errors. Viridis conversion uses seaborn to encode images, which outputs 3-channel RGB for “jpg” and 4-channel RGBA for “png”. We use “png”, the fusion network input after “viridis” in Fig.1 is 4-channel. (R1-10.Q4) (c) As R3 noted, we agree excessive technical details (viridis colormap encoding) may distract from the main contribution, so we will simplify the description and revise Fig.1 for clarity. (R3-7.Q1) (d) To assess whether multimodal information is preserved, columns 2–7 in Tab.1 show summed results of the fusion image compared separately with CFP and FA. Columns 8–13 compare only with CFP to evaluate CFP quality enhancement, which might be mistaken as ignoring FA or favoring one modality (R1-10.Q2). Fusion images enhance local texture and structural clarity, which helps to improve QAB/F (R3-7.Q2). III. Feasibility and Clinical Use (R6) (a) Due to limited open CFP–FA paired images, we use 10 cross-validation to reduce sample bias, Tab.1 shows related results. Statistical significance was assessed by paired t-test, with showing improvement in 39 of 42 comparisons (7 methods × 6 metrics) at p < 0.05, especially MI and VIF (p < 1e-6), indicating highly significance. (b) For external validation, we applied RIFNet to DRIVE for vessel segmentation, Fig.3 shows qualitative results. Quantitatively, it outperforms others and original images in Dice (72.54%), AUC (85.17%), and Kappa (69.42%), showing improved vascular detail capture. Further tests on Messidor and PALM19 show top results in DR grading (Accuracy: 69.58%, Recall: 51.52%, F1: 47.49%) and OD segmentation (Dice: 82.64%, IoU: 78.58%, Kappa: 83.75%), demonstrating adaptability across tasks and datasets. (c) CFP is widely used clinically, while FA is invasive and difficult to obtain. When FA is unavailable, RIFNet enhances CFP by integrating vascular and lesion features from generated FA. This method is especially valuable in clinical scenarios requiring enhanced diagnostic accuracy in the absence of FA. IV. Structure and Wording (R1/R3/R6) (a) We will revise Fig.2/Fig.3 titles and unify the term “multistage” in the camera-ready version to avoid confusion, and separate dataset descriptions from implementation details per R3’s suggestion for clarity. (R1-10.Q1, R3-10-Q2/3) (b) We have uploaded the code to the CMT system, but due to this year’s policy, it is currently not visible. (R6-9)




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    N/A

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top