Abstract

The detection of semantic and covariate out-of-distribution (OOD) examples is a critical yet overlooked challenge in digital pathology (DP). Recently, substantial insight and methods on OOD detection were presented by the ML community, but how do they fare in DP applications? To this end, we establish a benchmark study, our highlights being: 1) the adoption of proper evaluation protocols, 2) the comparison of diverse detectors in both a single and multi-model setting, and 3) the exploration into advanced ML settings like transfer learning (ImageNet vs. DP pre-training) and choice of architecture (CNNs vs. transformers). Through our comprehensive experiments, we contribute new insights and guidelines, paving the way for future research and discussion.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2427_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Oh_Are_MICCAI2024,
        author = { Oh, Ji-Hun and Falahkheirkhah, Kianoush and Bhargava, Rohit},
        title = { { Are We Ready for Out-of-Distribution Detection in Digital Pathology? } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study benchmarks various classifiers for histopathology imaging, focusing on addressing out-of-distribution (OoD) problems. It explores the use of transfer learning with both convolutional neural networks (CNNs) and transformers, presenting results that demonstrate the impact of transfer learning on histopathology tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This benchmark study in histopathology imaging classification introduces a promising approach, featuring recent architectures to ensure a fair comparison. The juxtaposition of CNNs and transformers is insightful, and the inclusion of transfer learning from the pathology domain is particularly promising. However, the study requires substantial enhancements to fully substantiate its findings and meet its stated research objectives.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This study raises several crucial considerations that need addressing. Firstly, there needs to be greater consistency between the problem statement and the evaluations conducted. The study should encompass scenarios and cases that align with the primary motivation of the research—determining whether classifiers can acknowledge their limitations by indicating “I don’t know.” It should also explore how such acknowledgments affect subsequent tasks and how pathologist insights could mitigate these issues. Additionally, the analysis should examine how different classifiers tackle this challenging problem, particularly in instances where some fail to classify correctly. It is also essential to clarify how these issues relate to out-of-distribution (OoD) and generalization problems. The problem formulation in the current study lacks clarity, and the novel insights proposed are not sufficiently supported by the experimental results and evaluations provided.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors can make the code available on GitHub for reproducibility and to support their findings.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This benchmark study in the field of histopathology imaging classification presents intriguing initial findings, yet several areas require significant refinement to enhance the clarity and completeness of the research. The problem statement, which queries whether classifiers can express uncertainty by stating “I don’t know” and under what conditions, needs to be more prominently and clearly defined. Moreover, the study should conclusively address this question, supported by appropriate visual and quantitative data. Several improvements are suggested for this study:

    1. The concept of semantic out-of-distribution (SOoDD) detection is introduced on Page 1 as “Any sample {x, y} ∉ P(XID, YID) is viewed as OoD and SOoDD specifically seeks to detect semantic OoDs shifted in the label space, {x*, y ∈ P ∉ (YID)}.” However, the explanation lacks clarity, particularly in differentiating SOoDD from general OoD shifts. If SOoDD involves shifts in the label space, this could also imply a domain shift in the input space. The manuscript should clearly delineate how SOoDD differs from traditional OoD and provide examples to illustrate cases where the input distribution might be OoD while the label space remains within the distribution, and vice versa. The authors should also cite work on Covariate OoD on Page 1.
    2. Is there a specific reason to formulate the detection problem as a binary classification task? The study revolves around the classification task, and the authors have motivated the study based on two types of detection problems mentioned on Page 1. The reviewer suggests sticking with classification instead of detection for consistency in the work.
    3. On Page 2, in the Related Work subsection, citations are needed to support the statement that “Although limited, a few studies have evaluated detection tasks in digital pathology (DP).”
    4. Additionally, the simulation of an Open Set Recognition (OSR) setting, as mentioned on Page 2, by excluding a small fraction of the class during training and then performing SOoDD on the held-out classes, seems to align more closely with a generalization test set scenario rather than true out-of-distribution (OoD) or semantic out-of-distribution (SOoDD) detection. This distinction requires clarification.
    5. On Page 3, the claim that “Transformers are better than CNNs, as many recent studies suggest” should be substantiated with specific citations. Furthermore, the discussion about novel insights regarding transformers needs to be refined. While the manuscript mentions novel findings, the literature already contains several studies addressing this topic [1,2,3,4]. It is crucial that the authors specify how their insights differ from or advance the existing research. This will help reinforce the originality and value of the current study.
    6. Similarly, previous studies [5,6] have explored the topic of Uncertainty Quantification (UQ), which the authors describe as novel insights. It is essential for the authors to specify how their contributions differ from existing works to underscore the novelty of their research. Additionally, considering other recent works [7,8] that touch on related topics could provide further context and enhance the originality and relevance of the study’s findings. The reviewer recommends that the authors integrate such references to deepen the discussion of their purported novel insights. The reviewer thinks that the novel insights need to be revisited and reconsidered.
    7. The explanation provided for the problem illustrated in Figure 1 is not clear. The authors should enhance the clarity to improve readers’ understanding.
    8. On Page 3, the abbreviation MUS is mentioned but not defined. It appears to be a typo.
    9. The presentation of Table 1 is confusing, particularly the significance of the four numbers of OSR for each dataset. Could the authors provide a clearer explanation?
    10. In the caption of Table 1, please add “respectively” at the end like this: “I and O denote ID (closed-set) and OoD (open-set), respectively.”
    11. On Page 4, the term NCT-CRC is introduced without explanation. What does NCT-CRC stand for?
    12. The phrase “reaching count milestones” mentioned on Page 4 is vague. Could the authors explain it?
    13. Citations are missing on Page 4 where the text discusses “Multiple recent works have proposed a general-purpose DP model for TL.”
    14. There is a type in the second-last paragraph on Page 4: “We make these supervised checkpoints public: .” It should be corrected to “We make these supervised checkpoints public.”
    15. The abbreviation DE (Deep Ensemble) is used on Page 5 without prior introduction. Please define this abbreviation the first time it is used in the text.
    16. The term “macro-accuracy” mentioned on Page 5 is unclear. Could the authors clarify what this refers to and provide citations to support their discussion?
    17. The clarity and presentation of Table 3 need improvement. The table features a column for accuracy and separate columns for AUROC percentages for different methods but does not clarify the accuracy, which can be confusing. It would be helpful to label these clearly, for example, “BreakHis: ID Acc% & SOoDD AUROC%” or “NCT-CRC: ID Acc% & SOoDD AUROC%” need proper explanation to enhance readability and comprehension of Tables 3, 4, and 5.
    18. It appears that ImageNet pretrained models are outperforming other methods. The authors should delve deeper into this observation, providing insights and potential reasons behind these results.
    19. The manuscript primarily discusses various classifiers without sufficiently connecting these to the main question of classifier uncertainty. Furthermore, the study lacks robust visual and quantitative support for its conclusions. To improve, the authors should refocus on the primary research question, provide detailed data analysis, and explore the implications of pathologist intervention more explicitly. The reviewer hopes that these modifications will improve the current study. [1] Atabansi, C.C., Nie, J., Liu, H. et al. “A survey of Transformer applications for histopathological image analysis: New developments and future directions.” BioMed Eng OnLine 22, 96 (2023). [2] Luca Deininger, Bernhard Stimpel, et al. “A comparative study between vision transformers and CNNs in digital pathology”, arXiv, 2022. [3] Yutong Bai, Jieru Mei, Alan Yuille, Cihang Xie,”Are Transformers More Robust Than CNNs”, NeurIPS 2021. [4] Christos Matsoukas, Johan Fredin Haslum, Magnus Söderberg, Kevin Smith, “Is it Time to Replace CNNs with Transformers for Medical Images?”, Workshop ICCV, 2021. [5] Ke Zou, Zhihao Chen, Xuedong Yuan, Xiaojing Shen, Meng Wang, Huazhu Fu, “A review of uncertainty estimation and its application in medical imaging”, Meta-Radiology,Volume 1, Issue 1,2023. [6] Kurz A, Hauser K, Mehrtens HA, et al., “Uncertainty Estimation in Medical Image Classification: Systematic Review.” JMIR Med Inform. 2022. [7] Marini N, Marchesin S, , et al., “Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations.” NPJ Digit Med. 2022. [8] Zakaria Senousy , Mohammed M. Abdelsamea et al. “MCUa: Multi-Level Context and Uncertainty Aware Dynamic Deep Ensemble for Breast Cancer Histology Image Classification”, IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2022.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The benchmark study in histopathology imaging classification offers promising insights but falls short in several key areas. The core research question, “Can the classifier say ‘I don’t know’?”, needs to be clearly defined and more thoroughly addressed. The manuscript primarily discusses various classifiers without sufficiently connecting these to the main question of classifier uncertainty. Furthermore, the study lacks robust visual and quantitative support for its conclusions. To improve, the authors should refocus on the primary research question, provide detailed data analysis, and explore the implications of pathologist intervention more explicitly. These revisions are crucial to enhance the study’s clarity and relevance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I would like to change my score as the authors have promised to address the raised concerns, particularly point 1, which is a significant issue. The authors have addressed most of my concerns, and I hope these changes will be reflected in the final camera-ready paper. Additionally, I believe that some of the references I provided can help differentiate this study from previous work in the field. It is crucial that the authors specify how their insights differ from or advance existing research to reinforce the originality and value of their study. The authors should revise the presentation of the results in Tables 3, 4, and 5 for better understanding and readability in the final camera-ready version.



Review #2

  • Please describe the contribution of the paper

    The paper presents a thorough investigation into semantic out-of-distribution detection and misclassification detection for digital pathology. The authors identified four major issues with previous studies: misleading practices, limited detectors, simple or non-public datasets, and insufficient depth. Despite these limitations, the authors presented a robust study with certified protocols and a broader scope. The study employs two open-source digital pathology datasets (BreakHis and NCT-CRC). These datasets were augmented for SOoDD and MD by extracting classes from the iD set used for training, using the classes during testing, and performing heavy augmentations. The study employed a diverse set of neural network architectures, ranging from convolutional neural networks to visual transformers. This study used nine OoD detectors as well as Deep Ensembles with two levels of uncertainty. The results are presented as averages from multiple runs, with AUROC and PRR.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper does a great job of defining the contexts of SOoDD and MD and how they relate to DP image classification.

    The related work and the identified limitations give a good impression of the current state of the literature and other work in this area and provide a great base for this work to improve on.

    The wide range of neural network architectures gives an evaluation of methods that are becoming more popular within the MICCAI community.

    The range of OoD detection methods was good and helped make this study more robust when testing applications of SOoDD.

    Repeating the expeirments multiple times makes the results more robust to random variations in the training and evaluation process.

    The insights from the experimental results are strong, provide useful insights to others in the MICCAI community and can be useful for digital pathology and beyond.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Despite running the experiments for multiple runs, no indication of variance is presented over the runs. This would help indicate how robust each of the models is from the different training and evaluation random variations.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have included all source code in the supplementary materials. This should be included as a link to a repository in the final version of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses for some feedback.

    I think the authors should be made aware of the paper from Jaeger et al (2022) “A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification”. This paper does a similar study (with natural images) and purposed a unified single metric for evaluation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a robust study of SOoDD and MD using two digital pathology datasets, various neural network architectures, OoD methods, and Deep Ensembles. This paper is a valuable addition to the MICCAI community due to its wide range of methods and results, as well as the authors’ insightful contributions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors did a good job at addressing the points from reviewers.



Review #3

  • Please describe the contribution of the paper

    Authors have addressed the out-of-distribution (OoD) and misclassification problems in ML models applied to digital pathology (DP) images. On one hand, they have provided a way to construct datasets addressing both problems while avoiding the confusion between generalization vs OoD detection. On the other hand, they ran a comparative studies of state-of-the-art models from CNNs and Transformers with pretraining from ImageNet and DP. Finally, they have derived some insights from these experiments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Underscoring an often put together issue in the ML community: misclassification and out-of-distribution detection while these are two separate problems
    2. Provides two ways to create dedicated datasets from existing ones to evaluate separately and non ambiguously these two robustness dimensions
    3. Provides some insights and points research venues, backed by experiments, to researchers in Digital Pathology images.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The dataset construction for addressing the misclassification problem is not novel per se as it is entirely from another article

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the Introduction, this sentence « we limit our work to (non-semantic) OoD, referred to as covariate OoD.» should be removed as SOoD is part of the work.

    In the experiments, it will be good to apply an often neglected way to avoid misclassification and sometimes SOoD: converting the multi-class classification task into a multi-label classification. In this context, the model can output 0 or many co-existing classes at once. Here the list will contain only one element. Mostly, when no class is output, then either the model has detected OoD or it has not the confidence to make a choice. It will interested to see performances of CNNs and Transformers using the multi-label approach.

    Finally, the abbreviation DE has been used without definition, although from the context we can infer it (Deep model ensemble), it is better to define it clearly.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addressed the issue of robustness along two important dimensions and provide an approach to assess them. This is an important issue when deploying ML models.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I made 2 minor observations and another one about and a new experiment which is not a valid one that do not affect my decision. One minor observation has not been taken into account yet: Finally, the abbreviation DE has been used without definition, although from the context we can infer it (Deep model ensemble), it is better to define it clearly.




Author Feedback

We thank the reviewers for their feedback and appreciate the overall positive responses (6, 5, 2) with our comments:

R1: We correct the deficiencies and agree that this approach could be a very good one.

R3: We appreciate the reviewer’s enthusiasm and desire to see full-blown application to pathology (“…explore how pathologist insights could mitigate these issues”). While we fully agree, this falls outside of our scope. There seems to be a misunderstanding in “need of greater consistency between the problem and evaluations” since our problem (OoD and misclassified detection) is fully-aligned to its literature. Please see [1, 2] and the references in Tab. 2. Adopting OSR for OoD detection is also well-established [3, 4]. For misclassified detection, we direct to [5-7]. Also “…instances where some fail to classify correctly” seems to refer to misclassified detection. We believe our OoD problem statement and its discussion w.r.t. generalization is sufficient in the intro (““…clarify how it relate to OoD and generalization. The problem formulation lacks clarity”). We are also surprised by the reviewer’s broad “…novel insights proposed are not sufficiently supported by the experimental results…”. We respectfully disagree. Results in S3 are quantitatively substantiated by Tabs. 3-5. Specifically for each point:

  1. In the literature, an OoD sample is “semantic” or “covariate” depending on whether it falls under an iD label. In the former, it does not, and typically, a semantic shift is accompanied by a shift in the input space too. Semantic OoD is not a “non-traditional” OoD but a rudimentary category. In covariate OoD, the labels are still iD but the input distribution is different; hence, the case mentioned by the reviewer is covariate. We will make these concepts clearer.
  2. It permits quantitative assessment of OoD detection method performances, for e.g., by measuring AUROC to reflect iD-OoD separability. This is the established norm in literature [1-7] and is called “OoD detection” not “OoD classification”.
  3. No, OSR cannot be a generalization problem since there is no intersection in labels. For instance, a model predicting “cat vs dog” cannot generalize to birds. OSR is a semantic OoD task.
  4. These references do not seem to relate to OoD detection. While we are not the first to compare “transformer vs CNN” in OoD detection, we are the first to do so in the context of digital pathology.
  5. Again, these references do not relate to OoD detection. While a few histopathology studies do address this task, we explain in the intro their shortcomings and how our study is different. Notably, we are the first to compare UQ methods and advanced post-hoc detectors for histology OoD detection. The insights (page 7) are novel to the medical community and of great relevance.
  6. Multiple OSR settings is to reduce bias. The four-fold is arbitrary.
  7. The name of the dataset.
  8. We compute the validation acc% at every interval and downstep the learning rate upon failing to improve by the 2nd time, 4th time, etc.
  9. Macro-average over all configs.
  10. We clarify “accuracy” in the main text (page 5) and within the tables itself.
  11. No, this is not true - see discussion in page 8. There are a few exceptions, however, an in-depth analysis is not possible within the space constraint.
  12. Our main question is OoD detection, not classifier uncertainty. In this context, the latter is just one competing method among many, listed in Tab. 2. There is no special need to posit the detectors relative to such uncertainty. 3, 7, 8, 10, 13, 14, 15 are fixed

R4: Jaeger et al ref is now included.

[1] https://doi.org/10.48550/arXiv.2210.07242 [2] https://doi.org/10.48550/arXiv.2110.11334 [3] https://doi.org/10.48550/arXiv.2110.06207 [4] https://doi.org/10.48550/arXiv.2106.03917 [5] https://doi.org/10.48550/arXiv.2107.00649 [6] https://doi.org/10.48550/arXiv.2211.16158 [7] https://doi.org/10.48550/arXiv.2106




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There was one reviewer recommending rejection (R3), who gave an extensive feedback. The authors made an effort to respond to each of the (18!) raised concerns one by one, and as a result R3 now recommends acceptance, same as R1 and R4. I am happy to back this recommendation, congratulations!

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There was one reviewer recommending rejection (R3), who gave an extensive feedback. The authors made an effort to respond to each of the (18!) raised concerns one by one, and as a result R3 now recommends acceptance, same as R1 and R4. I am happy to back this recommendation, congratulations!



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper is not without its flaws, but its 1000 times more valuable to the MICCAI community than papers that merely “add a new loss and improve performance by 0.01%.” The authors could do better on the title.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper is not without its flaws, but its 1000 times more valuable to the MICCAI community than papers that merely “add a new loss and improve performance by 0.01%.” The authors could do better on the title.



back to top