Abstract

Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model’s reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample’s deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods. Code Available: https://github.com/bhattarailab/NERO

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0796_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/bhattarailab/NERO

Link to the Dataset(s)

Kvasir Dataset: https://datasets.simula.no/kvasir/ Gastrovision Dataset: https://github.com/DebeshJha/GastroVision

BibTex

@InProceedings{ChhAnj_NERO_MICCAI2025,
        author = { Chhetri, Anju and Korhonen, Jari and Gyawali, Prashnna and Bhattarai, Binod},
        title = { { NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance in Gastrointestinal Imaging } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {348 -- 358}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper deals with an important task in medical imaging and translation applications - Out of Distribution (OOD) detection. Specifically, the authors have proposed a novel methodology named NERO - that leverages neuron-level relevance for distingushing ID and OOD samples. Importantly, the authors form clusters (centroids) of ID classes based on the neuron relevance scores and then use a simple distance metric to evaluate whether the input sample is far enough from the centroids to be detected as OOD or not. If not, then a simple multi-class ID classification is performed. The authors benchmark and evaluate their proposed methodology on the domain of gastrointestinal imaging datasets of Kvasir and GastroVision and claim improvement our SOTA approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Below are some of the major strengths of the paper:

    1. The paper is well written with a good introduction and motivation aspect. The introduction also covers the existing stream of works predominantly employed in OOD Detection.

    2. The methodology details are clear to understand in most aspects and is easy to follow with most of the motivations regarding adopting neuron level relevance well explained.

    3. The methodology provides an additional capability of explainability along with the OOD detection, which is a plus when compared to the existing OOD detection techniques.

    4. The authors have demonstrated their idea using multiple network architectures - both CNNs and DeiT (Transformers) which is good to see how their idea can generalize across different architectures.

    5. The implementation details are clearly written and explained.

    6. The authors have covered and compared to most existing SOTA OOD detection approaches in different streams. Although a couple of major ones are still lacking.

    7. The author’s demonstration of visualization cases for explainability of OOD detection approach is impressive, especially when coupled with the mean relevancy score distributions for ID and OOD. This is probably one of the few works that have demonstrated this way of explainability along with the OOD detection capability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Below are some of the weakness, questions, and suggestions to improve the paper:

    1. There is a major lack of motivation and understanding about introducing the neuron for bias relevance. The authors simply indicate that this bias neuron is necessary, but the motivation is really not clear of why so. I really want the authors to clarify this, because in my understanding and experience, most of the works rely on the features extracted from the network within and don’t introduce any additional neurons / parameters specifically for OOD Detection.

    2. To top the above comment, the authors also indicated in their methdology that “Through empirical observations, we found that relevance scores distributed in the bias nodes exhibit strong discriminative power for OOD detection”. This is contrary to the feature space learning and neuron level relevance that I was expecting to perform better suited for discriminating power between OOD and ID detection. Moreover, as these are empirical observations, it is not clear what kind of observations - for e.g., whether the authors plotted the relevance score metrics for penultimate layer neurons and bias neurons and discovered this? Additional concern on this empirical observation is whether such observation can still be applied for a different medical domain dataset coming from e.g., radiology, dermatology, histopathology, etc. Whether such empirical observation will hold true for such domains?

    3. The benchmarking SOTA OOD techniques don’t include two major ones that have been heavily relied on feature space learning and distance metric utilization - falling in a similar stream that the proposed methodology of authors is. Please see the works [1,2], which I believe should have been included as well for benchmarking, but given that the authors have also compared to other works in this stream, I am not so demanding to be included in this version of the paper, but something that the authors should definitely consider for their future extensions.
      • [1] Chen, G., Peng, P., Wang, X., & Tian, Y. (2021). Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8065-8081.
      • [2] Vaze, S., Han, K., Vedaldi, A., & Zisserman, A. (2021). Open-set recognition: A good closed-set classifier is all you need?.
    4. The authors have lacked an ablation study in experiments of including / not including the bias neuron. I understand that this might be due to space constraints and the authors may have been inclined to demonstrate the explainability of their method, but I believe this ablation study is one of the major points of this paper that should have been included. Can the authors clarify if they did any experiments without the bias neuron introduction and how was the performance?

    5. The explainability maps demonstrated in Fig 3. are considered using the top four channels in terms of relevance score for the ID classes. Is there any reasoning why only the top four channels were selected and how was this number arrived at?

    6. The author’s approach demonstrated impressive results with the DeiT as the backbone and has surpassed all the methods for both the datasets. Specifically, the FPR95 score improves significantly which is hard to see for OOD detection methods. However, when using Resnet-18 as the backbone, the author’s proposed method doesn’t achieve such a good performance when compared to the other techniques. There is a lack of explanation regarding this discrepancy. Can the authors explain what made this performance difference so obvious for CNN and transformer? This could be an interesting analysis to be made and I wish the authors could include this either in this work or their future extensions.

    7. Why the authors only resorted to the penultimate layer of neurons for their approach? Have the authors thought about a collective way of considering the relevancy scores of all the neurons?

    8. For the explainability part shown in Fig 3., it’s good to demonstrate the mean relevancy score for the ID / OOD samples of the respective classes considered and the number of channels. However, a more comprehensive representation would be to consider all the ID samples and OOD samples from all the classes and demonstrate this difference. Specifically for future extension, I would really like to see if there is a starking difference between the mean relevancy scores for certain ID and OOD classes that are easily to be confused (i.e., near ID classes that fall as OOD).

    Minor:

    1. Paper title: The paper majorly concentrates evaluation in the gastro vision space and uses benchmarking only for gastro datasets. Moreover, as the introduction (paragraph 1 , lines 7-10) also motivates from the gastro vision or endoscopy angle, I believe the paper title should include OOD detection for endoscopy / GastroVision. The authors have not demonstrated how their approach fairs on other domains of medical imaging, although this may be a future work by the authors. Otherwise, the current paper title comes out as more generic and applicable to multiple medical domains, but that’s not demonstrated by the authors.

    2. Figure 1 flow: I believe Fig 1(c) and Fig 1(d) should be interchanged in their order as usually a left-to-top right and bottom approach is followed for figure flow. So, (c) should be on the top and (d) should be on the bottom.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I really liked reading this paper and the work is interesting with the idea being simple but intuitive. Moreover, it’s one of the few post-hoc methods in OOD detection which could also explain through relevancy maps and relevancy scores about the decision made for OOD detection / ID classification. However, there are certain shortcomings in terms of motivation for the bias neuron relevance inclusion and also the experiments section regarding this inclusion. Moreover, the empirical results suggested that bias neuron relevance was more important to distinguish the ID and OOD samples, so this becomes really important to know how and why this happened.

    I am happy to revise my rating if my concerns and doubts are addressed appropriately in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my major concern on including the bias neuron for relevance and it makes sense. Moreover, this is a very different stream of work which analyses the neuron level revelance for OOD detection and thus I think is highly interesting to the community.

    The authors have conducted enough experiemnts to demonstrate the superiority of their approach providing valuable explainability to the task of OOD detection which is a special point of this paper.

    The only limitation is the evaluation on different medical domain datasets was not conducted and was limited to gastrovision/endoscopy datasets. But, I believe this is due to space limitations and also a core medical domain problem which suffers from the issue of OOD. However, I do believe that this approach could be applied to different datasets in other medical domains.

    The clarity of the paper writing could be improved (especially the motivation behind the bias neuron) and some figures could be improved as has been suggested in the initial review. I do hope the authors improve their presentation in the final version of the paper.

    One final point is about the title of the paper. I strongly believe it should include OOD detection for endoscopy / gastrovision as the Introduction is motivated from that angle as well as the experiments are concentrated in this domain and it’s not a general application.

    Thus, I recommend for acceptance of this paper given that the above suggestions about clarity and paper title will be incorporated in the final version.



Review #2

  • Please describe the contribution of the paper

    The paper introduces NERO, a novel post-hoc out-of-distribution (OOD) detection method that leverages neuron-level relevance to distinguish between in-distribution and OOD samples. By clustering relevance scores for each class and formulating an OOD score that combines the minimum distance to class centroids, the authors not only provide quantitative improvements on gastrointestinal imaging benchmarks but also offer an explainable framework.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The work presents a perspective on OOD detection by focusing on neuron-level relevance rather than solely relying on activation magnitudes or feature space representations. This idea—grounded in techniques like Layer-Wise-Relevance-Propagation is well-motivated within the context of ensuring diagnostic reliability.

    2. One of the standout contributions is the integration of explainable AI. By providing visualizations of relevance maps and comparing ID versus OOD samples, the method offers deeper insights into the decision process, which is especially valuable in sensitive domains like medical imaging.

    3. The introduction does a commendable job of guiding even non-experts into the field. It clearly presents the state of the art and effectively highlights the existing gaps that this work aims to address.

    4. The authors validate NERO on two real-world datasets and two distinct architectures (ResNet-18 and DeiT). The experiments demonstrate competitive or superior performance compared to established methods, as evidenced by AUROC and FPR95 metrics.

    5. The paper provides a rigorous explanation of the relevance estimation procedure, the construction of class-specific centroids, and the formulation of the OOD score.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although the paper introduces a lambda term to balance the bias relevance, additional analysis or discussion on how much this term contributes to the overall performance of the method would help clarify its importance.

    2. The concept of selecting “bottom k channels” for further scaling would benefit from additional explanation or illustrative examples to help readers understand this parameter’s role in the OOD score.

    3. The paper primarily relies on LRP-0 for relevance estimation, but it does not provide a clear rationale for this choice. Including a discussion on alternative variants—such as the alpha–beta or gamma rules—and their potential impact on performance would offer valuable insight into the robustness and flexibility of the proposed approach.

    4. While the mathematical formulations are comprehensive, simplifying or adding intuitive explanations in parts could make the material more accessible to a broader readership, without compromising the technical rigor.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The introduction is well-written, and the methods section provides a clear explanation of the motivation and underlying mechanics of the proposed approach. However, the results section could be improved—particularly the presentation of findings related to the bottom-k channels, which remains unclear, as this concept is not introduced or explained in the methods section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel and explainable OOD detection method that shows strong empirical performance and is well-motivated, especially for medical imaging. Its use of neuron-level relevance and relevance map visualizations is a clear strength. However, the paper has notable weaknesses in clarity and methodological explanation—particularly around the choice of LRP-0, the role of the lambda term, and the unexplained bottom-k channel selection in the results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper is the introduction of NERO (Neuron-level Relevance-based Out-of-distribution Detection), a novel and explainable post-hoc OOD detection framework for medical imaging.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1- The paper introduces a new formulation for OOD detection by computing neuron-level relevance scores (using LRP-0) for each sample and clustering them per class to form relevance centroids. 2- The method is evaluated on two diverse gastrointestinal datasets (Kvasir-v2 and GastroVision) and across two backbone architectures (ResNet-18 and DeiT).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The method applies Principal Component Analysis (PCA) to reduce neuron-level relevance vectors. While effective, PCA is a linear method and may not preserve non-linear or subtle spatial dependencies in relevance maps, which could affect OOD scoring.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel and interpretable post-hoc OOD detection method (NERO) that leverages neuron-level relevance patterns for improved reliability in medical image analysis.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors provided a clear and well-organized rebuttal that effectively addresses the primary concerns raised by the reviewers. Although some points, like the choice of PCA and model architecture comparisons, remain areas for future exploration, these do not detract significantly from the overall contribution.




Author Feedback

We thank all reviewers for their constructive and clear feedback. Below, we address key concerns:

Use of LRP-0 for relevance estimation [R3Q3]: As described in Section 2, we focus our analysis on the penultimate layer, where LRP-0 provides conservative relevance propagation aligned with the model’s final decision, in contrast LRP variants like alpha-beta or gamma rules are more suitable for lower layers [22].

Inclusion of the bias neuron [R4Q1]: We follow the approach from [22] and include a bias neuron in our setup to represent the contribution of the bias term to the final output (Please see “Relevance Estimation” in Section 2). As discussed in [3], when neurons have non zero bias terms, part of the relevance is injected or absorbed through the bias, inclusion of the bias neuron enables us to attribute relevance to all components influencing the model’s output. While many OOD methods focus solely on internal features, our approach aims to construct an OOD score that holistically accounts for all contributing factors, including bias. We will make it further clear in the final version.

Empirical justification for the bias neuron [R4Q2]: As stated in Section 1, the core novelty of our method lies in the relevance distance metric based on neuron-level relevance. The bias neuron was added to further refine OOD detection. We plotted the relevance scores for both the feature and bias neurons and found that while both showed discriminative signals, the bias neuron’s performance was not superior to the feature relevance score. We will revise the text to avoid potential ambiguity. While we do not claim generalizability of this empirical observation across domains, as our method is generic and performed consistently in two different datasets, we expect similar trends in other medical data domains too.

Bias neuron ablation/lambda [R3Q1, R4Q4]: Per rebuttal guidelines, we do not include additional experimental results here, but we clarify that experiments w/ the bias neuron showed consistent improvements than those w/o, particularly in reducing FPR while maintaining or slightly improving AUROC. Due to space constraints, we omitted them in the current version, but will include them in the final version.

Explainability maps [R4Q5]: While we analyzed more channels initially, limiting to four offered the best balance between showing key findings and maintaining visual clarity (Please see Fig. 3).

DeiT vs ResNet-18 [R4Q6]: Thank you for your suggestion. Similar performance gaps between these backbones are observed across other feature-based OOD detection methods such as ViM, Mahalanobis, and NECO (Please see Table 1). This indicates that the observed differences align with broader trends.

Penultimate layer selection [R4Q7]: Due to the presence of skip connections and attention layers in CNNs and Transformers, the estimation of relevance score for lower layers is not straightforward. We have kept this study for our future work.

Use of PCA [R1Q1]: We chose PCA for its simplicity and effectiveness in our initial experiments. Exploring non-linear dimensionality reduction techniques might improve the performance, and is an interesting direction for future work.

Bottom-k channel selection [R3]: We have shown performance across all K-values, demonstrating our method’s competitive performance across different choices of K, and found the range [2N/5,3N/5] to be effective (Please see Fig. 2), with the absolute best value selected for reporting.

Clarity [R3Q4, R3Q2, R4Q4]: We thank the reviewers for their suggestions on improving clarity. We recognize the value of simplifying mathematical formulations (R3Q4), including additional baselines (R4Q3), and better illustrating the contribution of parameters (R3Q2, R4Q4), which we could not fully address due to space constraints.

To all reviewers, we will release our code upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    ; Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    ; The authors should clarify the points raised by the reviewers in their rebuttal.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept;

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I would like to back R4’s suggestion of changing the title to include the fact that the strategy has only been validated in endoscopic/gastro data, and can not be called “generic” so far. It also makes room for a nice extension if the authors want to extend their tool to other domains.



back to top