Abstract

Recent advancements in deep neural networks have shown promise in aiding disease diagnosis and medical decision-making. However, ensuring transparent decision-making processes of AI models in compliance with regulations requires a comprehensive understanding of the model’s internal workings. However, previous methods heavily rely on expensive pixel-wise annotated datasets for interpreting the model, presenting a significant drawback in medical domains. In this paper, we propose a novel medical neuron concept annotation method, named Mask-free Medical Model Interpretation (MAMMI), addresses these challenges. By using a vision-language model, our method relaxes the need for pixel-level masks for neuron concept annotation. MAMMI achieves superior performance compared to other interpretation methods, demonstrating its efficacy in providing rich representations for neurons in medical image analysis. Our experiments on a model trained on NIH chest X-rays validate the effectiveness of MAMMI, showcasing its potential for transparent clinical decision-making in the medical domain. The code is available at https://github.com/ailab-kyunghee/MAMMI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0914_paper.pdf

SharedIt Link: https://rdcu.be/dV55a

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_49

Supplementary Material: N/A

Link to the Code Repository

https://github.com/ailab-kyunghee/MAMMI

Link to the Dataset(s)

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community https://github.com/Deepwise-AILab/ChestX-Det10-Dataset?tab=readme-ov-file https://physionet.org/content/mimic-cxr/2.0.0/

BibTex

@InProceedings{Kim_MaskFree_MICCAI2024,
        author = { Kim, Hyeon Bae and Ahn, Yong Hyun and Kim, Seong Tae},
        title = { { Mask-Free Neuron Concept Annotation for Interpreting Neural Networks in Medical Domain } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {524 -- 533}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces MAMMI, a framework for annotating concepts to neurons in medical image classifiers. Unlike previous methods relying on pixel-wise annotations, MAMMI utilizes a vision-language model to annotate neurons without requiring masks, thus reducing the need for expensive datasets. The authors demonstrate the performance of their approach on chest x-ray classifiers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Understanding the contribution of different neurons in a deep-learning classifier is important, and this work extends that analysis to the chest X-ray classifier. • Eliminating the need for pixel-level annotation maps makes the method more usable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Some design choices seem arbitrary. For instance, they extract concepts by filtering nouns in a report without providing much justification. The ablation study presented in Table 1 is with two other completely non-medical concept sets. • It’s unclear how/why authors can attribute one concept to a neuron since deeper layers tend to have entangled representations of concepts and don’t correspond to a single concept. • Limited evaluation of different medical datasets. The authors claim that they propose mask-free annotation in the medical domain; however, they only provide experiments on a single dataset, NIH-14. • Only one pre-trained chest X-ray classifier is used for analysis (DenseNet121).

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Comments: The paper primarily evaluates MAMMI on a single dataset of chest X-rays (NIH-14). Have you considered validating MAMMI on diverse medical datasets covering various imaging modalities, diseases, and imaging challenges? Can you provide more justification for why a single concept is annotated for each neuron? Also, how is the θ_concept obtained? In Figure 2b, how is the adaptive distribution the same as the train distribution as claimed by authors on Page 6 of the ablation study on adaptive neuron representative image selection? Could the authors also provide more details about how Fig 3 was generated? The activation map seems to highlight a larger portion of the image compared to the ground truth.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper needs to be improved significantly in both analysis and writing. Additional datasets and models need to be evaluated, and design choices need to be justified with thorough ablation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I thank the authors for answering my questions during the rebuttal phase. After carefully reviewing the responses and other reviews, I would like to keep my original score. Concept sets created by extracting nouns need some validation, clinical or literature based. I agree with R6 that the faithfulness of explanation is not validated empirically. Assigning individual concepts to neurons isn’t ideal since there can be multiple medical concepts that each neuron can represent. This could lead to unfaithful explanation for a given neuron-concept pair.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a mask-free neuron concept annotation method, MAMMI, for interpreting neural networks for medical tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    There are extensive experiments, especially the ablations. The method is technically sound. The related work is introduced sufficiently. Generally, the writing is clear in the introduction and related work part.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Technical details regarding the noun extraction is not very clear.
    2. The comparison to FALCON and TSI might not be very fair, since the concept sets are different.
    3. A significant part of the proposed approach is following [1] (as stated on page 5, Sec. 3.3). The clinical impact (application novelty) of the method is not highlighted enough in the paper, as the methodological novelty might be a bit limited for MICCAI.
    4. The proposed Adaptive Neuron Representative Image Selection might undermine the faithfulness of explanations.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Writing: the introduction and related work of the submission is written well. However, the authors could consider improving the organization of the method part. As a technical paper, the method details should be described carefully.
    2. The extraction of nouns from the medical reports needs to be elaborated. Considering the large scale of the dataset, manual selection seems not feasible. Are any NLP techniques involved in the process, e.g., simple matching words in a pre-defined dictionary or any LLMs? Are the extracted concept lists validated by the clinicians? As they are the end users of the explanations. Is each concept only consisting of a word, or it can be multiple words?
    3. In Tables 4 and 5, the comparison with baselines might be unfair, since the concept sets are distinct. Is it common to make such a comparison in former works performing similar tasks?
    4. As stated in weaknesses, the authors could consider reformulating the paper to spotlight the clinical use of the suggested method, as they claim the pipeline is mainly following [1].
    5. I have some concerns about the faithfulness of explanations. In the suggested adaptive neuron representative image selection, although ‘adaptive’, there is still a hyperparameter \alpha decided by the method performance on the validation set, which is evaluated by the metric ‘cosine similarity score’ that is suggested to be the metric measuring explanation faithfulness. The operation highly relies on the assumption that the network being probed, i.e., the DenseNet121, is behaving as the ground truth expects (which is obtained from CLIP in [1, 14]). This assumption could not be generalized to arbitrary neural networks. As a consequence, it is hard to validate whether the model is wrong or the explanation is wrong - in this case, does the hyperparameter optimization in Fig. (a) really improve the faithfulness of the explanations?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has extensive experiments and good writing as well as organization in the first half. However, I do have some concerns regarding the method and experiment part, as stated in the comments. Therefore, I would suggest a weak rejection. I am happy to see rebuttals from the authors if I am understanding anything wrongly.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks to the authors for figuring out many of my concerns! However, my following questions are still not well answered:

    1. Concept sets. To my knowledge, many medical terms have multiple words. This could indicate that the interpretation grounded on words might not be very accurately satisfying the clinicians’ needs. I note that the interpretation of multiple concepts (but each concept is one word) is different from concepts that is the terminology consists of multiple words. In fact, explaining the network with such concepts is not unheard of in the community, like [2] noted by R4 or post-hoc concept bottleneck models. I acknowledge their application is different and I am only talking about the definition of concepts.

    2. The results in Tables 1 and 2 can barely demonstrate the methodological novelty of the paper, since it is evident that the CV method cannot be directly applied to medical datasets, given the different concept set.

    3. I still worry about the faithfulness. The example given in Fig. 3 might not be enough. My question is, should we expect the probed DenseNet121 to think like our humans? And tune the \theta to fit our assumption?

    As a consequence, I decided the keep my original grade - actually a borderline reject, which means that accepting the paper is also acceptable to me.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel method, Mask-free Medical Model Interpretation (MAMMI), for interpreting deep neural networks in the medical domain without relying on expensive pixel-wise annotated datasets. MAMMI utilizes a vision-language model to annotate neuron concepts, eliminating the need for pixel-level masks. Experiments on a model trained on NIH chest X-rays validate the effectiveness of MAMMI, highlighting its potential for enabling transparent clinical decision-making in the medical field.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The concept-based interpretation method is valuable for transparent model decision-making.
    2. Using a vision-language model to annotate concepts is interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The main claimed novelty, using a VL model to annotate concepts without requiring mask annotations, has been widely explored in the computer vision community [1-3]. The paper lacks substantial novel contributions.
    2. The paper focuses on interpretable medical AI, especially concept-based explanations, but fails to compare or discuss concept activation vector-based methods [4-5] in the introduction or related work section.
    3. The experimental baselines are limited. More VL baselines, such as [1-3], should be included.

    [1] label-free concept bottleneck models [2] Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification [3] Representing visual classification as a linear combination of words [4] Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision [5] An Explainable AI Life Cycle for Iterative Bias Correction of Deep Models.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Although the novelty appears weak, the application of mask-free concept-based explanations for medical AI is valuable in this field. The paper can be accepted if the authors address the three weaknesses. For W1, the authors should emphasize their genuine novelty compared to [1-3]. For W2, the authors should discuss [4-5] in their paper and analyse them. For W3, the authors should choose suitable baselines, such as [1-3], but not necessarily those specific papers. The authors can select baselines they consider appropriate.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I keep my original score.




Author Feedback

We thank all reviewers for their valuable comments and appreciate the constructive feedback. Reviewers recognize that our method is novel and valuable (R4), important (R5). Extensive experiments are presented (R6).

[R4] Comparison with Other Studies We clarify that the main contribution of our method is to interpret the off-the-shelf target model by understanding the concept of each neuron. TSI (Network Dissection)[11] is the most relevant study, which is the reason that we selected TSI as a baseline. CBM[1, 2] is studied to design interpretable architecture, which needs retraining. We appreciate that CBM studies are important but the objective and scope are different from this study. [3] explains input samples with words. CAV[4,5] needs additional human annotation for naming each concept vector which is expensive in the medical domain. CAV is also hard to identify related input regions for the concept. Our method can interpret neurons of medical models without additional training and these concept-annotated neurons and their activation maps can be used by looking at important neurons (e.g., Shapley value in Fig.3) to explain the decision in terms of ‘where and what’.

[R5, R6] Experimental Settings TSI used expensive mask-labeled datasets. Since both concept sets include ground truth labels, it is much easier for TSI (less number of concepts) to match ground truth concepts. Nevertheless, MAMMI showed better performance in Table 4. Also with the same concept set, MAMMI achieved better performance than TSI. In Table 5, all methods used the same MIMIC CXR report dataset as a concept set. Therefore comparisons were fair.

[R5] Model and Dataset Our method can interpret other architectures (model-agnostic), which is a merit of the method compared to TSI (only applicable to CNN due to the feature map matching with mask annotation). We also compared three CLIP models. We believe that the paper proved the validity of the method and Experiment section was appropriate for MICCAI. Also, we selected the CXR task where VL models (i.e., CLIP) are widely studied and the source for concepts (i.e., Reports) is publicly well established. MAMMI can be used for other domains with suitable VL models and concept sets.

[R5, R6] Concept Set Construction/Annotation We focus on interpreting neurons in deep layers where neurons represent high-level objects that can be interpreted by the nouns. For other layers, it can use other suitable concept sets. To construct concept sets, nouns are automatically extracted from MIMIC CXR reports by the TextBlob library. We clarify that a concept is a single word, but neurons can have multiple concepts that exceed Threshold θ_concept which is computed by the formula in Section 3.3. Our result shows that 35.2% of the penultimate neurons (361/1024) have multiple concepts in DenseNet.

[R6] Faithfulness of Explanations We clarify that concept annotation globally interprets a well-trained model[25] and it is evaluated by mpnet cos, F1 score, and hit-rate in addition to CLIP cos. Explanation is analyzed on correct prediction in Fig.3.

[R6] Clinical Impact Interpreting neurons of hidden layer concepts can help clinicians understand the model and basis of decision. This study shed light on the more reliable and interpretable use of DNN. Note that applying CV domain methods directly to CXR is not satisfactory (Table 1-2). This is due to the class-imbalance dataset and lack of clinical terms, which shows the importance of this study (adaptive image selection and medical concept set construction are important).

[R5] Questions on Figures Figure 2(b) shows the effectiveness of adaptive image selection, which is better than w/o adaptive selection[1,14] by considering class imbalance. In Figure 3, size of neuron activation map was 77 and we upscaled it to 224224. Our goal is to provide explanations to clinicians on regions of important neurons based on model interpretation not segmenting ground truth regions.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed recommendations. After careful revision of the raised concerns, as well as the authors responses, I believe this paper presents a useful contribution to the community. The authors are particularly encouraged to comment on the clarifications noted by the reviewers in their revised version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mixed recommendations. After careful revision of the raised concerns, as well as the authors responses, I believe this paper presents a useful contribution to the community. The authors are particularly encouraged to comment on the clarifications noted by the reviewers in their revised version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers found the rebuttal to be insufficient to address their concerns. Therefore, the decision is to reject at this moment.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers found the rebuttal to be insufficient to address their concerns. Therefore, the decision is to reject at this moment.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces Mask-free Medical Model Interpretation (MAMMI), an interesting method for interpreting deep neural networks in the medical field without needing costly pixel-wise annotated datasets. MAMMI leverages a vision-language model to annotate neuron concepts, bypassing the need for pixel-level masks. Overall, I believe this work addresses an interesting and valid research problem that has not been extensively studied in the medical field. While the image-level interpretation based on vision-language model is not completely new in the broad field of multi-modal AI, this work stands as the pioneering efforts to demonstrate its potential to medical implications. The paper could be of sufficient interest for the MICCAI audience and bring some insights.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces Mask-free Medical Model Interpretation (MAMMI), an interesting method for interpreting deep neural networks in the medical field without needing costly pixel-wise annotated datasets. MAMMI leverages a vision-language model to annotate neuron concepts, bypassing the need for pixel-level masks. Overall, I believe this work addresses an interesting and valid research problem that has not been extensively studied in the medical field. While the image-level interpretation based on vision-language model is not completely new in the broad field of multi-modal AI, this work stands as the pioneering efforts to demonstrate its potential to medical implications. The paper could be of sufficient interest for the MICCAI audience and bring some insights.



back to top