Abstract

Counterfactual generation is used to solve the problem of lack of interpreta-bility and insufficient data in deep diagnostic models. By synthesize counter-factual images based on an image-to-image generation model trained with unpaired data, we can interpret the output of a classification model according to a hypothetical class and enhance the training dataset. Recent counterfactu-al generation approaches based on autoencoders or generative adversarial models are difficult to train or produce realistic images due to the trade-off between image similarity and class difference. In this paper, we propose a new counterfactual generation method based on diffusion models. Our method combines the class-condition control from classifier-free guidance and the reference-image control with attention injection to transform the in-put images with unknown labels into a hypothesis class. Our methods can flexibly adjust the generation trade-off in the inference stage instead of the training stage, providing controllable visual explanations consistent with medical knowledge for clinicians. We demonstrate the effectiveness of our method on the ADNI structural MRI dataset for Alzheimer’s disease diagno-sis and conditional 3D image2image generation tasks. Our codes can be found at https://github.com/ladderlab-xjtu/ControlCG.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1911_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/ladderlab-xjtu/ControlCG

Link to the Dataset(s)

https://adni.loni.usc.edu/

BibTex

@InProceedings{Liu_Controllable_MICCAI2024,
        author = { Liu, Shiyu and Wang, Fan and Ren, Zehua and Lian, Chunfeng and Ma, Jianhua},
        title = { { Controllable Counterfactual Generation for Interpretable Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    A method for interpretable deep image classification is proposed, based on generating counterfactual explanatory images.

    The method is tested on 3D brain MRI from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Interpretation of classification is an important and relevant task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty is low, a grab-bag of methods is combined, diffusion model, classifier free guidance, attention injection, with no particular justification. The method states “We only modify the inference formulation.” from previous methods.

    Ultimately, the results of the method is unconvincing. Specifically:

    • Comparisons in Table 1, 2 show no clear advantage over other methods.

    • Figure 3 the authors state visual explanations are somehow consistent with anatomical knowledge, this is simply not true. The hallmark of AD in 3D brain anatomy is atrophy about the hippocampus, none is shown here.

    For clear visual examples of AD features in 3D brain MRI, specifically hippocampal atrophy, see examples [a] Toews, M., et al. (2010). Feature-based morphometry: Discovering group-related anatomical patterns. NeuroImage, 49(3), 2318-2327. [b] Chen, Yifan, et al. “Evaluating the association between brain atrophy, hypometabolism, and cognitive decline in Alzheimer’s disease: A PET/MRI study.” Aging (Albany NY) 13.5 (2021): 7228.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See above comments.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Results in Figure 3 are not consistent with known literature on AD in the human brain, contrary to authors statements.
    • Low novelty, justification of methodology
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Authors developed a new method based on diffusion models to generate visual counterfactual explanations for the classification task of Alzheimer’s Disease vs Cognitively Normal from 3D MRI images collected from the ADNI Dataset. They evaluated (1) the quality of the counterfactual images produced in terms of image similarity, class difference and realism using quantitative metrics (functionally-grounded evaluation) and (2) the impact on classification performances of data augmentation performed with the produced synthetic images. They compared results obtained using alternatives counterfactual generation methods and observed that the image counterfactuals produced obtained (1) better trade-off in terms of image similarity, class difference and realism and (2) better increase in classification when employed for data augmentation than alternatives counterfactual methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Authors proposed a novel algorithm to generate counterfactual images to overcome the lack of interpretability, crucial in high stakes scenario like medical imaging. The algorithm proposed is based on Diffusion Methods, a domain still relatively unexplored and in its initial stage of research for the generation of counterfactuals. https://doi.org/10.1016/j.inffus.2024.102301

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    MAJOR: 1) Authors need to clarify the data splitting procedure performed, in particular at which level they performed the selection of the first-visit image. They reported that they used the first-visit images of the ADNI1 dataset for training and validation, and the ADNI2 for testing. As ADNI is a longitudinal study, there are multiple images of the same subjects both in dataset “ADNI1” and “ADNI2”. If authors performed the selection of the fist-visit image separately on ADNI1 and ADNI2 there might be different images of the same subjects in the training and test set, and the reported classification performance might be highly biased due to the possible data leakage. 2) Authors did not performed a k-fold cross validation, that in this scenario (medical dataset of limited size) might be crucial to assess model generalization capabilities. 3) Authors did not mention any limitation and future development of the work. 4) Authors did not provide strong evidence to claim the alignment of the explanations with medical knowledge (qualitative evaluation on a randomly picked sample; no details on the expertise, years of experience, etc… of who performed this assessment). MINOR: 1) Several acronym’s definition are missing (e.g. DDPM, SSIM, MSE, FID) 2) Fig. 4 is missing (but this might be a typo, according to the context they should refers to Fig. 3)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    PRO: Authors employed a public dataset and provide details on the model implemented and the optimization. CONTRO: Final number of images in training/validation/test set not reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Authors limited the evaluation procedure of counterfactuals to some functionally-grounded explanation metrics (SSIM, MSE, FID). However, the evaluation of explainability is a multi-domain and multidisciplinary aspects. Performing the evaluation of the explainability of the system using systemic framework might be fundamental to foster XAI application in high stakes scenario. Future research might be in line of supporting the evaluation of the counterfactuals with user study.

    https://doi.org/10.1016/j.inffus.2024.102301 https://doi.org/10.1145/3583558

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Methodology: Authors provided descriptions on the dataset splitting procedure that might suggest data leakage issues and did not performed k-fold cross alidation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My concerns about data splitting procedure have been clarified.



Review #3

  • Please describe the contribution of the paper

    This work proposes a new counterfactual generation method based on diffusion models for interpretable medical image classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work can flexibly adjust the generation trade-off in inference stage instead of training stage, providing controllable visual explanations consistent with medical knowledge for clinicians.
    2. They demonstrate the effectiveness on the ADNI structural MRI dataset for Alzheimer’s disease diagnosis and conditional 3D image2image generation tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The number of subjects in ADNI dataset you used is unclear in the Dataset and Preprocessing section.
    2. There is no hyperparameter sensitivity analysis.
    3. The quality of figure 2 and 3 is low. It is hard to see the different between your method and baselines.
    4. The classification result should also add sensitivity and specificity.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The resolution of figure should be high. Need hyperparameter sensitivity analysis to demonstrate the robustness.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank the reviewers for their time and effort in reviewing our manuscript. We appreciate the constructive feedback which has significantly contributed to improving our work. Below, we have addressed each comment in detail.

[R1]-Low novelty, justification of methodology We acknowledge Reviewer 1’s concern regarding novelty and have clarified our methodology. We revised the statement “We only modify the inference formulation” to “We control the generation results in the inference stage by tuning the guidance scale instead of hyperparameter tuning for weights of the regularization terms in the loss function during the training stage.” This change is intended to emphasize our focus on controllability rather than simplicity, consistent with our title “Controllable Counterfactual Generation for Interpretable Medical Image Classification. Additionally, we expanded the Introduction and Conclusion sections to better articulate our justification, emphasizing our method’s ability to handle the trade-off between image similarity and class difference and manage both global and local structural changes in counterfactual generations, which other methods struggle with. These revisions are reflected in the updated manuscript.

[R1]-Unconvincing results in Table 1 and 2 We have clarified the purpose of Tables 1 and 2. Table 2 demonstrates that only our method achieves a promising trade-off in both image similarity and class difference. We will highlight the worst metrics for each method using a deletion line and red color to emphasize that these methods fail to achieve a balance. In Table 1, we demonstrate the controllability of our method by adjusting the guidance scale (w). For example, w=6.0 yields the best image similarity, while w=4.0 is optimal for class difference.

[R1,R4,R5]-Inconsistent results with anatomical knowledge in Figure 3 and low quality in Figure 2 and 3 We understand and appreciate reviewers’ concern regarding the reliability of our results. We realize that this may be due to our inappropriate presentation in the figures. We have reviewed the examples provided in the paper by Toews et al. (2010) and observe consistent anatomical features in our results, including the anterior poles of the enlarged lateral ventricles, regions of atrophied white matter, enlarged temporal poles of the later. We will replace Figures 2 and 3 with higher-resolution versions to ensure these features are clearly visible. The new Figure 3 will include 4 rows for original images and results from our methods and two competing methods, and 4 columns for significant AD-related features. Figure 2 will have 3 rows for guidance scales of 1, 3, and 5, and 4 columns for the same features.

[R4,R5]-Data splitting procedure We acknowledge the concern regarding dataset splitting and have clarified this in the manuscript. After removing subjects appearing in both ADNI-1 and ADNI-2 from ADNI-2, there are 200 AD and 231 CN subjects in ADNI-1, and 159 AD and 205 CN in ADNI-2. This revision ensures there is no data leakage.

[R4]-Classification results We agree that classification experiments on small datasets should be conducted under k-fold cross-validation. Our method focuses on conditional image-to-image generation and does not modify the DenseNet3D backbone for classification. Previous works with DenseNet3D on the ADNI dataset support its generalization and robustness. Although we cannot provide new experiment results due to rebuttal guidelines, we will offer a subject-level k-fold cross-validation in future experiments.

[R5]-Hyperparameter sensitivity analysis We appreciate the concern regarding hyperparameter sensitivity analysis. We have conducted an ablation study for a qualitative comparison of generation performance in Figure 2, a quantitative comparison in Table 1, and a downstream data augmentation task in Table 3. We will include this classification hyperparameter sensitivity analysis as appendix information in the updated manuscript.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the authors provided feedback, one reviewer increased their score. However, despite this improvement, there is still a recommendation for rejection, and the overall score remains low. Therefore, I suggest rejection.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the authors provided feedback, one reviewer increased their score. However, despite this improvement, there is still a recommendation for rejection, and the overall score remains low. Therefore, I suggest rejection.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There are mixed reviews (1 reject, 1 weak reject->accept, 1 weak accept). Two reviewers did not participate post rebuttal evaluation. Reviewer with reject decision does not provide strong justification for the decision (missing information regarding low novelty comments). The authors have addressed major concerns during rebuttal. This paper introduces a novel algorithm to generate counterfactual images to improve interpretability. It would be valuable to discuss the paper at MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There are mixed reviews (1 reject, 1 weak reject->accept, 1 weak accept). Two reviewers did not participate post rebuttal evaluation. Reviewer with reject decision does not provide strong justification for the decision (missing information regarding low novelty comments). The authors have addressed major concerns during rebuttal. This paper introduces a novel algorithm to generate counterfactual images to improve interpretability. It would be valuable to discuss the paper at MICCAI.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I believe the authors did a good job in the rebuttal and the paper deserves t obe accepted in MICCAI due to the novel methodological approach

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I believe the authors did a good job in the rebuttal and the paper deserves t obe accepted in MICCAI due to the novel methodological approach



back to top