Abstract

Scanning laser ophthalmoscopy (SLO) images provide ophthalmologists with a non-invasive way to examine the retina for diagnostic and treatment purposes. Manual reading SLO images by ophthalmologists is a tedious task. Thus, developing trustworthy disease detection algorithms becomes urgent. However, up to now, there are no large-scale SLO image databases. In this paper, we collect and release a new SLO image dataset, named Retina-SLO, containing 7943 images of 4102 eyes from 2440 subjects with labels of three diseases, i.e., macular edema (ME), diabetic retinopathy (DR), and glaucoma. To our knowledge, Retina-SLO is the largest publicly available SLO image dataset for multiple retinal disease detection. While numerous deep learning-based methods for disease detection with medical images have been proposed, they ignore the model trust. Particularly, from a user’s perspective, the detection model is highly untrustworthy if it makes inconsistent predictions on different SLO images of the same eye captured within relatively short time intervals. To solve this issue, we propose TrustDetector, a novel disease detection method, leveraging eye-wise consistency learning and rank-based contrastive learning to ensure consistent predictions and ordered representations aligned with disease severity levels on SLO images. Experimental results show that our TrustDetector achieves better detection performances and higher consistency than the state-of-the-arts. Dataset and code are available at https://drive.google.com/drive/TrustDetector/Retina-SLO.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1108_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1108_supp.zip

Link to the Code Repository

https://drive.google.com/drive/folders/1wzoCppWgUhM_9pN1kVrhmHycc8MYEj06

Link to the Dataset(s)

https://drive.google.com/drive/folders/1wzoCppWgUhM_9pN1kVrhmHycc8MYEj06

BibTex

@InProceedings{Hu_AScanning_MICCAI2024,
        author = { Hu, Yichen and Wang, Chao and Song, Weitao and Tiulpin, Aleksei and Liu, Qing},
        title = { { A Scanning Laser Ophthalmoscopy Image Database and Trustworthy Retinal Disease Detection Method } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose (i) a new dataset SLO Ophthalmoscopy dataset consisting of ~8k images from ~4k eyes along with disease labels; (ii) a new training method (TrustDetector) that encourages, via a new contrastive loss, learning a representation that is similar for different images of the same eye and different for images of different eyes; and (iii) leveraging a rank loss that learns an ordered representation reflecting disease severity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Proposing a new dataset, which is a valuable contribution to facilitate open and reproducible science.

    • A new loss to encourage eye-wise disease prediction consistency across images.

    • Leveraging rank loss for disease severity.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Clarifications needed in different parts of the paper.
    • Novelty in application-only of a known loss (Rank-N-Contrast).
    • Concerns about the completeness of the results and the messages they claim to deliver.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Detailed notes:

    • The new Dataset contribution:
    • Will the image IDs of the different splits be released for facilitating comparison across methods using the same dataset?
    • Similarly, will standard code be released to accompany the data for performing consistent evaluation of different methods?
    • Will there be metadata (e.g, sex, age, ethnicity. etc.) released, to examine their effect on prediction?
    • Is the data unbiased to sensitive attributes, like sex, age, ethnicity, to aim for the development of fair AI?
    • Is it common clinical practice to acquire more than one image per eye?

    • The TrustDetector contribution:
    • “…while push away features from different eyes”. Shouldn’t the loss also consider that different eyes that exhibit the same disease have a more similar representation compared to that of different eyes that exhibit different diseases?

    • The Rank-based Contrastive Learning:
    • How is this loss different than Rank-N-Contast or RankSim?

    • Experiments:
    • Augmentation: Aren’t ophthalmoscopy images captured under standard orientation? If is -30 to 30 degree augmentation plausible and useful? Similarly, or more critically, is up-down flip justified?
    • The authors should report results, in Table 3, for a baseline: w/o L_{eyeCon} and Without L_{rank}, to better assess the value-added of each of these losses over the baseline.
    • Some of the differences in mean performance metrics in Table 2 seem not statistically significant when examining the standard deviations.
    • The potential lack of statistical significance is an even more critical in Table 3. For example examining the rows TrustDetector vs w/o L_rank, 2 out of 5 metrics do not favour the proposed TrustDetector. In particular, the standard deviation may be too large to render the results statistically insignificant for mKappa: 9.58+/-0.47 vs 90.45+/-0.35, and for mF1: 58.96+-0.89 vs 57.78+-1.17.
    • Have alpha and beta been optimized for the proposed method (as seen in Table 4). This raises the question whether better performance may be obtained by competing methods, in Table 3, if their parameters have also been optimized.
    • If it is common for more than one image to be collected per eye, then I wonder how a naive approach, of passing each image through a predictor (that was not trained with any of these proposed losses) and then combining the predictions (e.g. by averaging the probability vectors), would compare to the proposed method.

    Other:

    • Define the acronym UWF.
    • Caption of Fig 3 (or paper body text) ideally should guide the reader through the visuals in Fig 3.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see weaknesses and detailed notes.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The author’s answer “L_Rank enforces different eyes of same disease” is convincing. However, answers to other, IMO, important issues, are not, for example:-

    • The authors seem not concerned about the importance of attending to sensitive attributes when releasing a new dataset.

    • The authors described the “Differences of RNC&RankSim” but the question was asking about the difference between the proposed “Rank-based contrastive loss” and those two, in order to clarify any technical novelty.

    • The answer about justifying augmentation is unsatisfying; e.g. why augment with flip up/down for such medical images that typically have standard orientation.

    • The authors dismiss the need for ablation study on the two used losses, i.e. dismiss the need to report a baseline method performance where both new proposed losses are removed: without L_{eyeCon} and without L_{rank}. It is not impossible that adding these losses degraded the result below the baseline, hence the request to see the results of the baseline.

    • The review noted that “standard deviation may be too large to render the results statistically insignificant”, but the authors responded that the standard deviations of their approach are smaller than competing methods. While this may be true and a good thing, this does not answer the question about whether the difference between the means is statistically significant. To exaggerate the situation, for clarity, imagine: method 1 with avg accuracy 70%+/-20% and method 2 with avg accuracy of 72+/-25%, even though std.dev of method 1 is smaller (20<25), the std-dev’s are still too large compared to the difference between the mean accuracies, rendering the advantage of method 1 over method 2 not stat. significant and may be due to chance with high probability.

    • The authors did not comment about how the hyperparam’s of competing methods were optimized; raising the concern that the proposed method may have had an unfair advantage as its parameters are optimized.



Review #2

  • Please describe the contribution of the paper

    This paper collected and released a new SLO dataset with labels for three diseases: ME, DR, and glaucoma. They also propose a novel disease detection method that ensures consistent predictions and ordered representations aligned with disease severity levels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Problem well defined and structured
    • Good figure for overview
    • Aware of class imbalance problem that is often existent in pathological datasets
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -It’s best to use another publicly available dataset, besides the collected one for better comparison with the state-of-the-art. -Missing good description of the state-of-the-art and how this work improves the work of previous papers.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Visual representations of the result is often better than just using numerical results.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Problem is well-defined and methodology well-explained, but missing good comparison and description with the related work.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors are created a new SLO (Scanning Laser Ophthalmoscopy) image set called Retina-SLO. It has the potential of being used in deep learning algorithms for diagnostic purposes. Compared to the existing data sets, it contains more images labeled for three different disease of macular edema (ME), diabetic retinopathy (DR), and glaucoma. next, the authors proposed TrustDetector. It leverages eye-wise consistency learning and rank-based contrastive learning techniques to ensure consistent predictions and ordered representations aligned with disease severity levels on SLO images. By ensuring consistent predictions on different SLO images of the same eye captured within relatively short time intervals, TrustDetector aims to enhance the trustworthiness of the disease detection model. The experimental results presented in the paper demonstrate that TrustDetector achieves better detection performance and higher consistency compared to state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Creating a big image set containing around 8000 images from three different eye disease is appreciable and has potential to be used by many researchers. Proposing a new trustworthy disease detection based on consistency and Contrastive Learning is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    detection results are comparable with the previous methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Apart from trust worthy disease detection, publishing a rich data set available for other researchers is appreciable.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    making a new data set

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We highly appreciate reviewers’ comments and constructive suggestions and recognising our work, particularly the new dataset for trustworthy multi-disease detection (R3, R4, R5), the consistency loss (R3), well defined problem and good figures (R4) and framework (R5). Below we address the reviewer’s concerns.

To R3: Dataset&code&Metadata&Bias Data and code will be released in June. Metadata won’t as they are sensitive personal data. No sex bias and exist age and ethnicity bias. Fair AI is important but beyond the scope of this work.

Is multi-image/eye common? Weight the results? 2031 of 4102 eyes have more than one image (see Tab.1). Supposing the probabilities of ME for 3 images in Fig. 1 are 0.4, 0.95, 0.45. Average(0.4, 0.95, 0.45)=0.6, indicating positive. Average(0.4, 0.45)=0.425, indicating negative. Weighting does not guarantee consistency.

TrustDetector L_EyeCon enforces different images from the same eye to have more similar representations compared to different images from different eyes. L_Rank enforces different eyes of same disease to have more similar representations compared to those of different diseases.

Differences of RNC&RankSim RNC learns ordered features for samples based on labels via contrasting in feature space. RankSim learns features via matching the sorted list of sample’s neighbours in feature space with the sorted list of the neighbours in label space.

Augmentation and baseline Rotation is inevitable. They are widely used, e.g., in [1]. Baseline is ConvNeXt V1 in Table 2.

Means and stds in Table 2 The 2nd best ConvNeXt V1: 56.47+/-1.26 50.89+/-1.96 93.18+/-0.41 94.35+/-0.57 89.69+/-0.81. Ours: 58.96+/-0.89, 53.67+/-0.88, 93.42+/-0.30, 95.07+/-0.47, 90.58+/-0.47. The means/stds of ours are greater/smaller to the 2nd. So, our method is better than the 2nd.

Lack of statistical significance, large stds in Table 3 mAcc considers whether the prediction is correct. Standard deviations (std) in Tab. 3 are about 0.3, which are acceptable. mF1 and mKappa consider the extreme class imbalanced issue. Slight differences of predictions on the small number of positive samples of each trail would lead to large std. A larger scale dataset would alleviate this issue in future.

Table 3 shows how two losses affect results. We never claim L_Rank can contribute to consistency. L_Rank considers the severity levels and contributes to the rating agreement of predictions and the GT labels. In other words, it improves mKappa. L_EyeCon is responsible for the improvement on mAccCon. Together with the two losses, as Tab. 2 shows, ours achieves best.

Applying the proposed losses to SOTAs? Applying the losses to SOTAs in Tab. 2 would improve performances. f. ex. setting alpha to 0.2 and beta to 0.1, performances of ConvNeXt V2 increase to 55.55+/-0.85, 50.09+/-0.80, 93.06+/-0.27, 95.23+/-0.40, 90.07+/-0.53 from 52.04+/-1.99, 47.11+/-2.41, 93.02+/-0.13, 95.17+/-0.90, 89.67+/-0.58.

Other UWF: Ultra-widefield. Detailed description about Fig.3 will be included in caption.

R4: Results on another dataset, description of SOTAs, how proposed improves SOTAs, visualising tables. Experiments on DeepDRiD [1], a dual-view fundus image dataset for 5-level DR grading have been conducted but not incldued due to the limited pages. The SOTAs will be well described in the revised. We will use t-SNE to visualise the feature space to illustrate how the proposed improves SOTAs and Radar Chart to visualize Tables.

R5: Comparable detection results For balanced evaluation metrics mF1 and mKappa, our method surpasses the 2nd best ConvNeXtV1 by 2.49, 2.78. In mAcc, it surpasses the 2nd best by 0.24. The improvements are obvious. For the consistency related metrics mCon and mAccCon, ours surpasses the 2nd best by 0.72 and 0.89. In future, we will use diffusion models to generate multi-views of images to improve the consistency.

[1] Liu R, et al. DeepDRiD: Diabetic retinopathy grading and image quality estimation challenge[J]. Patterns, 2022




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper has good contributions in terms of dataset and a novel method incorporating trustworthiness. This will be a good contribution to the field.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper has good contributions in terms of dataset and a novel method incorporating trustworthiness. This will be a good contribution to the field.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    R3 gave this paper a weak reject based on the methodology but did not comment on the usefulness of the new dataset. While I agree with R3 that there are some flaws in describing the methodology and the results are not very convincing, I find the other two reviewers to be very enthusiastic about the release of the new dataset. I am hence inclined towards accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    R3 gave this paper a weak reject based on the methodology but did not comment on the usefulness of the new dataset. While I agree with R3 that there are some flaws in describing the methodology and the results are not very convincing, I find the other two reviewers to be very enthusiastic about the release of the new dataset. I am hence inclined towards accepting this paper.



back to top