Abstract

Causal generative modelling is gaining interest in medical imaging due to its ability to answer interventional and counterfactual queries. Most work focuses on generating counterfactual images that look plausible, using auxiliary classifiers to enforce effectiveness of simulated interventions. We investigate pitfalls in this approach, discovering the issue of attribute amplification, where unrelated attributes are spuriously affected during interventions, leading to biases across protected characteristics and disease status. We show that attribute amplification is caused by the use of hard labels in the counterfactual training process and propose soft counterfactual fine-tuning to mitigate this issue. Our method substantially reduces the amplification effect while maintaining effectiveness of generated images, demonstrated on a large chest X-ray dataset. Our work makes an important advancement towards more faithful and unbiased causal modelling in medical imaging. Code available at https://github.com/biomedia-mira/attribute-amplification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1002_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1002_supp.pdf

Link to the Code Repository

https://github.com/biomedia-mira/attribute-amplification

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xia_Mitigating_MICCAI2024,
        author = { Xia, Tian and Roschewitz, Mélanie and De Sousa Ribeiro, Fabio and Jones, Charles and Glocker, Ben},
        title = { { Mitigating attribute amplification in counterfactual image generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work analyses the effect of utilizing hard labels in counterfactual fine-tuning (CFT). They empirically show that the hard labels cause attribute amplification, which is an undesired aspect. Therefore, to decrease the effect of attribute amplification, they propose utilizing soft labels obtained from a predictor network for the CFT. They evaluate this on the MIMIC-CXR dataset in three scenarios: (1) AUC of predictor models trained on real data and tested on CFs, (2) AUC of predictor models trained on CFs and tested on real data, (3) visualizing the distribution shift using PCA mode 3.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well-written and easy to follow. Specifically, the introduction is enjoyable to read.

    • Counterfactual image generation is a well-motivated problem.

    • There are many discussions and analyses regarding the multi-attribute effects.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • There is not much methodological novelty. This work is an extension of a previous work [3], where hard labels are utilized for counterfactual fine-tuning. Here, soft labels obtained from a prediction model are employed instead.

    • How is it verified that the prediction models are able to correctly predict the attributes given only 2D chest xray? Has there been any verification by physicians to specify if the models are actually reliable for evaluation of CF images?

    • There is a discussion on page 4, regarding why attribute amplification is undesirable. However, I do not find this convincing. The example provided in the paper is where changing the gender of a patient from male to female increases the healthiness aspect of the patient. However, there may be some unknown causes (in the genome, for example) that are not explicitly modeled by the causal graph, but may actually be a cause. Is it really desirable to suppress such correlations by enforcing the model to explicitly follow the causal graph?

    • In [3], the results for the age attribute are also provided. Is there a reason why they were omitted in this work? Also, the MAE is not computed.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The color coding of the results provided in the tables make it a bit difficult to easily understand the results. Color blue defines both intervened and effectiveness, and red corresponds to both unintervened and amplification.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides interesting insights and analysis in counterfactual image generation for chest x-ray. However, my main concerns are the methodological novelty and the evaluation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I thank the authors for the feedback. The rebuttal addresses my concerns; therefore, I increase my rating. It would be great if the authors add the responses as part of the discussion to the final version, if the paper is accepted.



Review #2

  • Please describe the contribution of the paper

    The paper addresses the issue association with attribute amplification during counterfactual image synthesis. The main reason for attribute amplification, spurious attributes affected during intervention, is using hard labels while training and the authors propose to mitigate this by soft counterfactual fine tuning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and easy to follow. Additionally, the authors address an important concern associated with spurious attributes that can be affected while counterfactual synthesis.
    2. The proposed method shows significant improvement over the existing method DSCM. Fig 1 with intervention on do(disease=Pleaural Effusion) clearly shows this effectiveness.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors mention ‘when intervening on biological sex for a healthy male patient, a generated female counterfactual may appear healthier than the real image.’ What quantifies healthier or is this via visual inspection or the logit value from the classifier?
    2. Page 4, the authors mention ‘an assumed causal graph’. This can be included as a part of the main manuscript as it is an important to understand the underlying assumptions.
    3. The discussion on using counterfactual images for downstream task need to detailed (or could be included in appendix). Is this new predictor trained only on the synthetic samples? Also, how many samples and how does this number compare to the real training samples used in Table 1.
    4. Given the shortcoming of counterfactual approach, the absence of ground truth makes validation really difficult. How do the author plan to justify that direct effect obtained after interventation made on attributes such as sex and race is correct?
    5. Figure 1, the intervention made on the sex seems to change the identity of the person i.e., the image appears to have changed significantly. Are these behaviour common, I would recommend authors to add some more samples in the appendix including some failure cases. Additionally, would be useful to check the null case i.e. do(no intervention).
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to my comments above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    NA

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Thanks to the authors for clarifications.



Review #3

  • Please describe the contribution of the paper

    The authors address the issue of accidental attribute amplification in counterfactual (medical) image generation. The issue is the following: when generating counterfactual images - say, generating a ‘male version’ of a female chest x-ray - a previously proposed method based on deep structural causal modeling tended to spuriously also amplify the expression of other attributes than the one being intervened upon. In the example, the model might, besides making the patient male as intended, also make the patient “really sick”, if they were just a little sick in the original image. In this contribution, the authors trace this back to an issue with the training of the (HVAE component of) the structural causal model. They propose a better training method (‘soft counterfactual fine-tuning’) and demonstrate conclusively that it indeed fixes the problem.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses the very important topic of causal medical image modeling. While the authors focus on counterfactual image generation here, the issue is deeper and relates to the general quest for teasing apart different causes and (health) effects in medical image analysis. The method proposed by the authors appears very plausible, and their experiments demonstrate quite conclusively that it works very well. (These are some of the most convincing counterfactual chest x-ray images I have seen to date.) The paper is very well-written and well-structured, and all arguments are made very clearly despite the highly technical nature of the issue. This was a pleasure to read!

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of this piece is that it is quite incremental in nature: it iterates / improves upon one very specific (and rather small) aspect of one very specific previously proposed method. With that said, I do consider this an important improvement of the method, and the experiments and general exposition are excellent.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. I was initially a little unconvinced by the argument that attribute amplification is an actual problem (section 3). I could find no details about the attribute predictor used for Table 1 (Which model, trained on which dataset?), and the fact that some attribute classifier that might suffer from strong dataset-specific shortcut learning / non-(adversarial) robustness does not work on the generated counterfactuals anymore is not immediately concerning to me: the generated images might still be very realistic. With that said, the illustrations in Fig. 1 did convince me that there is an actual problem here, and I would recommend referring to this figure a bit earlier.

    2. The fact that attributes are intermingled in the latent space of the VAE - and thus tend to be changed in tandem - is not very surprising, is it? This is a known issue, see e.g. Klys et al., Yang et al., Weng et al.: https://arxiv.org/abs/1812.06190 https://arxiv.org/abs/2211.15231 https://arxiv.org/abs/2312.14223

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-written, technically excellent paper on an important topic of current interest. As I noted above, the only downside is that it is a bit incremental in nature.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    In light of the other reviews and the authors’ rebuttal, I stand by my initial judgment: this is an excellent, if incremental, contribution on a topic of strong current interest.




Author Feedback

We are grateful for the comments, highlighting that our paper is “well written and motivated” (R3,R4,R5), addresses a “very important concern/topic” (R4,R5), showing the “most convincing counterfactuals (CFs)” (R5) with “significant/important improvements” (R4,R5), and “excellent experiments and exposition” (R5).

[Novelty/contribution, R3,R5] A key contribution of our work is the identification of the attribute amplification problem. While spurious correlations are widely studied in the literature, we are not aware of any works that have previously discussed the amplification problem in counterfactual image synthesis. It was the discovery of the problem that allowed us to devise a simple yet effective solution. As such, we would encourage reviewers to consider the novelty and contribution of this paper as a combination of problem discovery, analysis, and effective mitigation of attribute amplification.

[Why attribute amplification is undesirable, R3] Attribute amplification creates spurious correlations between attributes that were assumed to be independent, and hence, violates the causal assumption of the data generating process. The reviewer is right, that there can be unknown factors and confounders, in which case the assumed graph might be wrong for the application at hand. Defining an appropriate graph is the fundamental problem of causal modelling. But once a graph is specified for a given problem (typically taking domain expertise and prior knowledge into account), the generative model should obey these assumptions and therefore attribute amplification is highly undesirable.

[Details about attribute predictors, R3,R5] The attribute predictors (ResNet34) used for assessing CFs are trained on real data with ground truth labels in a supervised fashion. Their reliability is confirmed in terms of high prediction accuracy on real test data (see Table 1). As suggested by R5, we will refer to Fig. 1 earlier to highlight the importance of attribute amplification. We will add details.

[Age attribute, R3] Attribute amplification concerns variables that are assumed to be independent, however, age and disease are (causally) related. Therefore we do not consider age in this work.

[Measuring amplification, R4] A change in attribute, such as being “healthier” after intervention on sex, is quantified by an increase in the attribute prediction accuracy for disease (Table 1) together with a visual distribution shift in PCA embeddings (Fig. A3). Inspecting logits, as suggested, is indeed another possibility that we have explored (but not included due to space limitations).

[CFs in downstream tasks, R4] For table 2, the attribute predictors are trained with CFs only. We randomly generate one CF per subject per attribute. We then use the same number of CFs for training as for the predictors trained on real images. We will clarify this.

[CF validation, R4] The reviewer is right that evaluating CFs is challenging and an open research problem. We opt to use attribute predictors to assess the quality, but visual assessment by experts is part of future work. If CFs are only used for training (e.g., data augmentation), their value may be assessed by the downstream task performance.

[Identity preservation, R4] This is a common challenge in CF synthesis. We argue that our CFs are of high quality (see also comment from R5), preserving many subtle details and anatomical features (ribs, clavicles) after intervention. But some interventions, such as sex, will naturally affect the perceived patient identity. Additional visuals are in the supplement, and we will add more in our GitHub repository.

[Adding assumed graph, R4] We agree it is helpful to have the causal graph on page 4. We try to move it to the main text.

[Shortcut learning papers, R5] Thanks, will add those. We agree that spurious correlations are well studied in shortcut learning literature. However, the link to attribute amplification in CF synthesis was not previously discussed.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    High quality paper that discovers, investigates, and mitigates a “new” problem in causal generative models. I would encourage the authors to incorporate as much of the feedback from the reviewers if possible when preparing the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    High quality paper that discovers, investigates, and mitigates a “new” problem in causal generative models. I would encourage the authors to incorporate as much of the feedback from the reviewers if possible when preparing the final version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top