Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, resulting in a more powerful and stable approach. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1683_paper.pdf

SharedIt Link: https://rdcu.be/eHwMP

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_4

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DopamineLcy/BenchReAD

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiaChe_BenchReAD_MICCAI2025,
        author = { Lian, Chenyu AND Zhou, Hong-Yu AND Hu, Zhanli AND Qin, Jing},
        title = { { BenchReAD: A systematic benchmark for retinal anomaly detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {35 -- 45}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces BenchReAD, a systematic benchmark for retinal anomaly detection covering multiple datasets. It also proposes NFM-DRA, an extension of an existing supervised anomaly detection method, which combines it with a normal feature memory to improve robustness to unseen anomalies.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents BenchReAD, a comprehensive benchmark for retinal anomaly detection that integrates two widely used imaging modalities: fundus photography and optical coherence tomography (OCT). Additionally, it introduces NFM-DRA, a novel model that leverages disentangled representations of abnormalities and a memory bank mechanism to enhance robustness against unseen anomalies.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

While the benchmark is comprehensive and the evaluation thorough, the paper’s contributions lack substantial novelty in methodology or benchmark design. The proposed method, though effective, is a relatively incremental combination of existing ideas.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper introduces a comprehensive benchmark for retinal anomaly detection and proposes a straightforward extension (NFM-DRA) that enhances robustness to unseen anomalies, its contribution is primarily empirical and incremental, lacking substantial scientific novelty.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

there is no meaningful scientific contribution in this paper.

Review #2

Please describe the contribution of the paper

The paper introduces a new benchmark designed for evaluating retinal anomaly detection methods. This benchmark includes two imaging modalities commonly used in ophthalmology: fundus photography and optical coherence tomography (OCT). For each modality, the authors provide training and testing datasets that include a diverse set of anomalies. The training datasets are curated to support methods with different supervision levels, including unsupervised, one-class, semi-supervised, and fully supervised. The corresponding test datasets contain both seen and unseen anomalies to assess generalization performance. In addition to the benchmark, the paper proposes a novel anomaly detection method that combines two state-of-the-art techniques, aiming to improve performance across both modalities for unseen anomalies.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

S1. [Relevant for the field] The lack of extensive benchmarks is one of the biggest issues in the evaluation of anomaly detection methods. Many existing approaches demonstrate strong performance on narrow datasets but fail to prove generalization across different imaging modalities or anomaly types, where they typically fail. This has led to a proliferation of methods that appear effective but are in fact tailored to the specific particularities of limited benchmarks. To advance the field, broader and more diverse benchmarks are needed. This paper makes a valuable contribution by proposing a benchmark for retinal anomaly detection that addresses this gap. It includes two imaging modalities (fundus and OCT), a diverse range of anomalies, and a larger dataset than prior benchmarks. The paper is a step toward more generalizable evaluation.

S2. [Detailed dataset report] The paper offers a clear and detailed analysis of the benchmark composition, including the number of images per split and per anomaly. These details will allow researchers to better understand and utilize the benchmark. The visual summary in Figure 1 is particularly helpful.

S3. [Extensive evaluation] The experimental evaluation is another strength of the work. The authors compare six state-of-the-art anomaly detection methods alongside a novel proposed approach that combines two existing methods. The evaluation is thorough, with performance broken down by dataset and anomaly type, revealing substantial variability in performance across different conditions. This analysis highlights the challenges of generalization in anomaly detection and motivates the proposed benchmark.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
W1. [Limited novelty] The paper does not introduce a new dataset but instead just combines existing datasets to construct the proposed benchmark. While this is a valid approach and contributes to standardization of evaluation, it limits the novelty of the contribution.

W2. [Unclear metrics] The authors claim for greater emphasis on threshold-dependent metrics, but do not clarify how thresholds were set during evaluation. This omission is significant, as threshold selection can greatly influence sensitivity and specificity. It is unclear whether thresholds were optimized for sensitivity, specificity, or another criterion. Tables 1 and 2 include extreme results, with some methods showing near-zero specificity and near-perfect sensitivity, or vice versa. This suggests that thresholds may have been chosen inappropriately or inconsistently, specially considering the ROC curves of Figure 2. Authors should clarify how threshold were selected (and perhaps improve the threshold selection protocol) for transparency and reproducibility.

W3. [Minor issues] While the paper is generally clear, several minor improvements could enhance clarity and presentation. For example:
- Figure 1 would benefit from indicating which anomalies are present in the training set and which are not (this could be achieved, e.g., through color coding or markers).
- A complementary figure to Figure 1b for the training set would also help complete the overview.
- Including references to the methods directly in figures and tables would make them more complete and self-contained.
- It would also be helpful to indicate the supervision level associated with each method (e.g., through colors, line styles, or grouped columns in tables).
- In Figure 1a, the term “evaluation set” is used. I assume this is the “test set”. For consistency with the rest of the paper, I suggest replacing it with “test set”.
- In the experiment section, authors include a short description about NFM-DRA (starting from “Motivated by…” at the end of page 6). The method NFM-DRA has been properly described in section 2.3 and this description is redundant and feels out of place. I suggest to either move this description to Section 2.3 or omit it entirely.
Typos:
- Page 6: ODR should likely be ODE?
- Fig 3: AUORC should be AUROC.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Despite its simplicity and minor issues, the paper presents a valuable contribution. The proposed benchmark addresses an important necessity in the evaluation of anomaly detection methods and, as such, it has potential for broad impact in the community.

The paper is well written and it includes a thorough experimental evaluation. While some aspects of the presentation could be improved, I do not think this undermines the overall value of the work.

Nevertheless, I encourage the authors to address the issues mentioned in the weaknesses section to further strengthen the final version.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I appreciate the authors’ rebuttal.

I already had a positive view of this paper prior to the rebuttal and continue to see it as a valuable contribution to the field. Therefore, I am maintaining my final recommendation as accept.

Review #3

Please describe the contribution of the paper

Introduced a comprehensive retinal anomaly detection benchmark (BenchReAD) and proposed NFM-DRA, a novel method that enhances model generalization by integrating a Normal Feature Memory to improve performance on unseen anomalies.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Consideration of vast amounts of labeled abnormal data and unlabeled data in actual clinical settings.
2. The dataset has been made publicly available, facilitating replication and following the paper’s findings.
3. The dataset incorporates two common ophthalmic imaging examinations: fundus photography and optical coherence tomography (OCT).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In the fundus benchmark, the authors state, “ensuring statistical robustness by excluding anomaly categories with fewer than 20 images,” which raises concerns, as rare diseases do occur in actual clinical settings.
2. The comparative methods considered in the experiments seem to lack the latest approaches; for instance, the fully supervised evaluation only includes the DRA method. In reality, numerous subsequent methods have been proposed, such as BGAD presented at CVPR 2023 [1] and AHL introduced at CVPR 2024 [2]. Thus, claiming that NFM-DRA achieves state-of-the-art performance is not rigorous.
3. Although a 95% confidence interval for the AUC is provided, the significance of the AUC comparisons between NFM-DRA and the other methods has not been calculated.
[1] Yao, Xincheng, et al. “Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. [2] Zhu, Jiawen, et al. “Anomaly heterogeneity learning for open-set supervised anomaly detection.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the benchmark has limitations regarding the comparison methods not being state-of-the-art, it thoughtfully incorporates a wide range of anomaly detection settings, including scenarios where the training dataset contains both unlabeled data and labeled abnormal data. It is beneficial for the field of anomaly detection.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors’ rebuttal addressed my concerns. I believe this paper makes a meaningful contribution to the field of anomaly detection and is suitable for acceptance.

Author Feedback

We sincerely thank the reviewers for their valuable feedback and for recognizing our benchmark’s contributions to the community, comprehensive experiments, and clarity. We also thank the meta-reviewer for their oversight and coordination.

Q1: Contributions in benchmark design and methodology. (R1) A1: Our paper’s primary contribution is establishing the first benchmark for generalizable retinal anomaly detection (acknowledged by R2 and R4), with a secondary contribution of a methodology improvement derived from benchmark insights.

Benchmark contribution: (a) A new evaluation framework: We provided a more rigorous assessment platform than prior single-dataset evaluations, which integrates 7 datasets across different retinal imaging modalities. (b) Generalizability assessment: We addressed an essential capability for clinical deployment that remains unaddressed by existing benchmarks. We specifically design the benchmark to evaluate both seen and unseen anomalies across different data sources. (c) Novel methodology taxonomy: We revealed that previous medical benchmarks have overlooked the potential of utilizing available abnormal data. We introduced the first complete categorization of anomaly detection approaches based on supervision levels. (d) Clinically practical metrics: Our benchmark enables evaluation that better reflects real-world clinical requirements by incorporating threshold-dependent metrics alongside traditional AUROC.

Methodological innovation: (a) Benchmark-driven insight: Our systematic evaluation identified a previously unreported limitation: fully supervised methods excel with seen anomalies but degrade when facing certain unseen anomalies. (b) Memory-enhanced architecture: We introduced a new approach that enhances disentangled representation learning with normal feature memory to address the identified weakness. (c) Quantifiable advancement: This innovation produced substantial performance gains (AUROC improved from 85.6% to 89.0% on RIADD and 89.2% to 96.7% on JSIEC).

Q2: The benchmark is not based on newly gathered data. (R2) A2: We carefully integrated datasets from variant sources to construct a large, diverse, and reliable benchmark tailored for retinal anomaly detection, where the benchmark design is novel and systematic. It provides a generalizability assessment, a novel methodology taxonomy, and clinically practical metrics. Moreover, as highlighted in A1 above, our benchmark has already successfully facilitated the methodology innovation.

Q3: How thresholds were selected. (R2) A3: Thresholds were selected based on the best F1 scores on the validation sets. We will clarify this in the manuscript.

Q4: Minor issues of figures, tables, expressions, and typos. (R2) A4: We will conduct revisions to address these issues.

Q5: Rationale for excluding anomaly categories with fewer than 20 images. (R4) A5: Our primary objective is to create a robust evaluation platform for algorithms. Therefore, to ensure the statistical stability and reliability of the evaluation results, we excluded categories with too fewer images. We recognize the clinical importance of rare diseases and will develop a dedicated benchmark.

Q6: Lack of benchmarking the lasted approaches. (R4) A6: We selected the methods from top-tier journals and conference with public code, prioritizing those with higher citations, such as DRA (~150 citations). We will benchmark more recent approaches to drive methodological innovation. Also, the claim “NFM-DRA achieves state-of-the-art performance” will be revised to “NFM-DRA outperforms all benchmarked methods” for rigorousness.

Q7: The absence of significance analysis. (R4) A7: We calculate 95% confidence intervals by bootstrapping. Regarding RIADD and JSIEC datasets, non-overlapping 95% CIs between our NFM-DRA and other methods indicate statistical significance. For the other three datasets, NFM-DRA performs comparatively to the best methods benchmarked while showing greater robustness.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

despite simplicity, this paper addresses an important issue of out-of-distribution and anomaly detection in retinal imaging. The paper is well written, and the authors have addressed the reviewers’ concerns

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

BenchReAD: A systematic benchmark for retinal anomaly detection

Author(s):