Abstract

The black-box nature of deep learning models has raised concerns about their interpretability for successful deployment in real-world clinical applications. To address the concerns, eXplainable Artificial Intelligence (XAI) aims to provide clear and understandable explanations of the decision-making process. In the medical domain, concepts such as attributes of lesions or abnormalities serve as key evidence for deriving diagnostic results. Existing concept-based models mainly depend on concepts that appear independently and require fine-grained concept annotations such as bounding boxes. However, a medical image usually contains multiple concepts, and the fine-grained concept annotations are difficult to acquire. In this paper, we aim to interpret representations in deep neural networks by aligning the axes of the latent space with known concepts of interest. We propose a novel Concept-Attention Whitening (CAW) framework for interpretable skin lesion diagnosis. CAW is comprised of a disease diagnosis branch and a concept alignment branch. In the former branch, we train a convolutional neural network (CNN) with an inserted CAW layer to perform skin lesion diagnosis. The CAW layer decorrelates features and aligns image features to conceptual meanings via an orthogonal matrix. In the latter branch, the orthogonal matrix is calculated under the guidance of the concept attention mask. We particularly introduce a weakly-supervised concept mask generator that only leverages coarse concept labels for filtering local regions that are relevant to certain concepts, improving the optimization of the orthogonal matrix. Extensive experiments on two public skin lesion diagnosis datasets demonstrated that CAW not only enhanced interpretability but also maintained a state-of-the-art diagnostic performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1272_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1272_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hou_ConceptAttention_MICCAI2024,
        author = { Hou, Junlin and Xu, Jilan and Chen, Hao},
        title = { { Concept-Attention Whitening for Interpretable Skin Lesion Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed an interpretable algorithm to predict lesions in medical images by first detecting clinical concepts which is understandable by clinical experts. It first uses a dataset to train a concept orthogonal matrix Q of concepts. Then Q is utilized to train a disease classification network. The authors proposed a novel CAW layer to utilize Q for disease classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposed an interpretable method where it utilized clinical concepts as classification reference. Clinical concepts may be used in clinical practices by clinicians for disease diagnosis.
    2. This paper creates a method that can utilize datasets with different advantages to refine the performance of a specific clinical task.
    3. This paper includes solid evaluation and comparison with other methods. It also includes a complete ablation study to illustrate the contribution of each part.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. No claims explicitly/implicitly indicates whom the method is interpretable to, which vague the purpose of this work.
    2. Clinical concepts are typically partially related to each other. The method utilized orthogonal matrix to eliminate the relationships between them.
    3. The math of the method is clear but the purpose of using an orthogonal matrix is unclear.
    4. Dataset section is not clearly written.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. No claims explicitly/implicitly indicates whom the method is interpretable to, which vague the purpose of this work.
    2. Clinical concepts are typically partially related to each other. The method utilized orthogonal matrix Q to represent concepts, which ignores the relationship between concepts.
    3. The purpose of using Q is to create orthogonal feature space for disease classification. However, another simple solution is to create heatmaps of each concepts and concatenate them together for final prediction. What is the performance difference between this method and the proposed method?
    4. Terms between “disease” and “concept” are not clear when introducing datasets. As stated in the method section. There should have 2 datasets, but both datasets introduced seems to be concept datasets because only “concepts” are mentioned and “diseases” does not appear. However, the number of concepts is not consistent between dataset introduction and figure 2 (i.e. dataset Derm7pt, # concept 2 v.s. 12).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. This author states the method to be interpretable but ignored who is the user and who is the method interpretable to.
    2. The math of the method is clear but the purpose of using an orthogonal matrix is unclear. A simpler method with heatmaps seems to be able to achieve the same purpose.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents an concept-attention whitening method for Interpretable Skin lesion diagnosis. The concept-attention layer inserted to a standard CNN and decorrelates features and it can align image features to conceptual meanings via an orthogonal matrix.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. In general, the paper is well written.
    2. The proposed method is validated on two public benchmarks and with extensive ablation studies.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It seems that the method only change one BatchNorm layer of a typical CNN, why does such codification improve model interpretability? It is also not much clear about how to use concepts to make models interpretable.
    2. From Fig. 1, the concept mask generator is actually a black-box classifier, generating class-activation maps that are binarized to help the concept alignment. This serves as the main contribution/novelty of the paper. One potential issue is the following mask are derived from the black-box classifier, how does the authors ensure that it can always create accurate guidance information to the alignment? It is well recognised that the class-activation map only focus on the most discriminative and coarse features.
    3. In the two datasets, are the concept annotations used for training? Or only for evaluating the concept detection performance?
    4. In Table 1, the results of CW and CAW methods are consistently inferior than the Black-box ResNet in terms of classification performance, can the authors give a explanation about reason about this?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Will be the paper code made publicly available?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    What is the abbreviation of CNN? please give a full name first.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see the weakness

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks for the author’s feedback. The authors have addressed some of my concerns, but there are still some concerns not solved. I would like to keep my rating.



Review #3

  • Please describe the contribution of the paper

    This paper deals with an explainable DNN strategy for skin lesion diagnosis. Instead of performing post-hoc explanations on already trained DNNs, as in most XAI strategies, the authors adopt a strategy where the hidden representations of the data are disentangled to represent interpretable concepts, as in [3]. Compared with [3], the main contribution of the paper is to use a weakly- supervised concept mask generator for filtering local regions that are relevant to certain concepts. Results are shown on the Derm7pt dataset [12], which contains about 1K images annotated with 7 clinical concepts related to melanoma skin lesions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This kind of application is of high interest for the MICCAI community.
    • The paper is well motivated and reads well.
    • The defended key idea intuitively makes sense.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method assessment has globally a good scientific value, but remains to superficial in my opinion.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned above, my main concern deals with the results:

    • The most important comparisons in the results section are those comparing the proposed method to [3] in Table 1 and those comparing the method with and without masks in Fig2 and Table 2. Although the best results were obtained using the proposed method, there is not a large difference between the compared scores. Since these results were obtained on only about 150 observations (15% of testing data when performing the cross-validation), I wonder whether these results are significant. I therefore believe that their impact would be much higher by adopting a K-folds strategy, or more generally a bootstrapping strategy. This would make it possible to evaluate the stability of the results and potentially their significativeness.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main idea defended by this paper is interesting, but the method assessment could be more convincing after a more in-depth statistical analysis of the results.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks for the comments.

[R1-W1. Statistical analysis] Following previous works, we report the results as mean_std of three random runs for a fair comparison. In future work, we will add more statistical analysis as suggested.

[R3-W1. Purpose of this work] Our goal is to interpret representations in DNNs by aligning the axes of the latent space with known concepts of interest. For example, concepts such as “blue whitish veil (BWV)” and “atypical pigment network (PN_ATP)” are important for diagnosing melanoma skin disease. Given the global image feature F=[f1,f2,…,fd] of dimension d, we aim to assign the first activation value f1 with concept “BWV”, f2 with “PN_ATP”, etc. In this way, we can identify the concepts that significantly contribute to the disease diagnosis, making the decision-making process interpretable.

[R3-W2. Relationship of concepts] As our goal is to build an exact biunique association between a hidden value of the image feature and a predefined concept, we apply orthogonalization to ensure the disentanglement of concepts. In future work, we plan to incorporate the inherent relationship of concepts by utilizing a nearly orthogonal matrix.

[R3-W3. Solution with heatmaps] Thanks for your suggestions. We will add this comparison in our revised paper.

[R3-W4. Dataset] In our work, the term “disease” refers to skin diseases such as “melanoma”, while “concept” denotes high-level attributes of lesions such as “streaks’’. In Derm7pt, we consider 2 diseases (nevus and melanoma) and 12 concepts from the 7-point checklist. These concept categories include pigment network (typical/atypical), blue whitish veil, vascular structures (regular/irregular), pigmentation (REG/IR), streaks (REG/IR), dots and globules (REG/IR), and regression structures. In SkinCon, there are 3 diseases (malignant, benign, and non-neoplastic) and 48 concepts, such as plaque, scale, and erosion. We select 22 concepts with at least 50 representative images. The number of concepts is consistent between dataset introduction and Fig 2. We will clarify this.

[R4-W1. Model interpretability] By replacing the BatchNorm layer with our CAW layer, we not only conduct feature normalization but also enhance the feature interpretability. Please refer to the response to [R3-W1] for a more detailed explanation.

[R4-W2. Concept mask] The optimal solution involves utilizing ground-truth concept masks to facilitate concept alignment. However, existing datasets lack such fine-grained annotations. To address this limitation, we employ a weakly-supervised learning approach to generate pseudo concept masks. Table 2 demonstrated the superiority of our concept masks in both disease diagnosis and concept detection compared to other alternatives in such a weakly supervised setting.

[R4-W3. Concept annotation] Our approach uses concept annotations for training in two aspects: 1) We use concept annotations to pre-train a concept classification network, which is then employed for generating concept masks. 2) At the disease classification model training stage, we rely on the concept dataset to conduct concept alignment, which is constructed by grouping the images with the same concept label.

[R4-W4. Classification performance] Table 1 shows that all compared XAI methods consistently exhibit inferior results than the black-box ResNet. The empirical finding in the accuracy-interpretability trade-off [1,2] suggests that more complex models, such as DNNs, sacrifice interpretability for higher accuracy, while interpretable models often exhibit inferior performance. Remarkably, our CAW method achieves comparable performance and even surpasses the black-box ResNet on Derm7pt (ACC, F1) and SkinCon (ACC), highlighting our model’s ability in improving interpretability while maintaining accuracy. [1] DARPA’s explainable artificial intelligence (XAI) program, 2019. [2] Explainable artificial intelligence: a comprehensive review, 2022.

[R4-9. Code] We will release the code.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduce concep-attention whitening method for interpretable skin lesion diagnosis. The paper received borderline reviews (1 weak accept and 2 weak reject) but only one reviewer participated in the post-rebuttal evaluation. Reviews suggest that the paper is well-written. However, there are concerns with the weakness of the method regarding performance degradation. Although, it is challenging to achieve both accuracy and interpretability, it should be more discussed what is strength of the proposed method compared with post-hoc interpretation methods which do not suffer from performance degradation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduce concep-attention whitening method for interpretable skin lesion diagnosis. The paper received borderline reviews (1 weak accept and 2 weak reject) but only one reviewer participated in the post-rebuttal evaluation. Reviews suggest that the paper is well-written. However, there are concerns with the weakness of the method regarding performance degradation. Although, it is challenging to achieve both accuracy and interpretability, it should be more discussed what is strength of the proposed method compared with post-hoc interpretation methods which do not suffer from performance degradation.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received borderline reviews, and unfortunately two reviewers did not engage in the rebuttal phase. After reviewing their comments, and the authors’ responses, I am recommending acceptance since all reviewers mentioned the contribution is interesting and important, and the rebuttal seems reasonable to me. I strongly suggest to the authors to incorporate answers to the reviewers’ questions in the revised version whenever possible.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received borderline reviews, and unfortunately two reviewers did not engage in the rebuttal phase. After reviewing their comments, and the authors’ responses, I am recommending acceptance since all reviewers mentioned the contribution is interesting and important, and the rebuttal seems reasonable to me. I strongly suggest to the authors to incorporate answers to the reviewers’ questions in the revised version whenever possible.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I notice that two of the reviewers have not looked at the rebuttal or chosen to respond. The paper is a still emerging of explainable AI and as such should be included in the accepted papers. The basic idea of using gravitating from features to concepts is important and most methods have tried to use a decision tree to explain the choices.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I notice that two of the reviewers have not looked at the rebuttal or chosen to respond. The paper is a still emerging of explainable AI and as such should be included in the accepted papers. The basic idea of using gravitating from features to concepts is important and most methods have tried to use a decision tree to explain the choices.



back to top