Abstract

Counterfactuals in medical imaging are synthetic representations of how an individual’s medical image might appear under alternative, typically unobservable conditions, which have the potential to address data limitations and enhance interpretability. However, counterfactual images, which can be generated by causal generative models (CGMs), are inherently hypothetical — raising questions of how to properly validate that they are realistic and accurately reflect the intended modifications. A common approach for quantitatively evaluating CGM-generated counterfactuals involves using a discriminative model as a ‘pseudo-oracle’ to assess whether interventions on specific variables are effective. However, this method is not well-suited for in-depth error identification and analysis of CGMs. To address this limitation, we propose to leverage synthetic, ‘ground truth’ counterfactual datasets as a novel approach for debugging and evaluating CGMs. These synthetic datasets enable the computation of global performance metrics and precise localization of CGM failure modes. To further quantify failures, we introduce a novel metric, the Triangulation of Effectiveness and Amplification (TEA), which precisely quantifies the effectiveness of target variable interventions and the additional amplification of unintended effects. We test and validate our evaluation framework on two state-of-the-art CGMs where the results demonstrate the utility of synthetic datasets in identifying failure modes of CGMs, and highlight the potential of the proposed TEA metric as a robust tool for evaluation of their performance. Code and data are available at https://github.com/ucalgary-miplab/TEA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2090_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ucalgary-miplab/TEA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{StaEmm_Synthetic_MICCAI2025,
        author = { Stanley, Emma A. M. and Vigneshwaran, Vibujithan and Ohara, Erik Y. and Vamosi, Finn G. and Forkert, Nils D. and Wilms, Matthias},
        title = { { Synthetic Ground Truth Counterfactuals for Comprehensive Evaluation of Causal Generative Models in Medical Imaging } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {546 -- 555}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • Aims to tackle problem of lack of ground truth with which to evaluate counterfactual generations
    • Generate ‘ground truth’ datasets to compare counterfactuals generated by causal generative models
    • Devise new metric to quantify effectiveness of intervention on desired variables and amplification of variables that should remain constant
    • Show that pseudo-oracle evaluations can be misleading
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Motivates the need for this work well
    • Practical use of the SimBA tool
    • Good diagram in Fig 1 (although I have some confusion about the amplification quantity)
    • Good description of experimental setup
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • How do we guarantee that the SimBA generated ‘ground truth’ counterfactuals do not suffer from the same issues? Would be good to have more details on how these are generated
    • Notation in equation for A: should it be just E^2 rather than (E||z||_2)^2
    • What is the intuition behind the equation for A? Unclear why the diagram in Fig 1 is necessarily in 2D.
    • Does E not capture amplification too? Since it measure how close the generated and ground truth counterfactuals are, any amplification of unintended effects should be captured by this.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The problem of ground truth counterfactuals is indeed an important one but I do not believe this work helps with it convincingly. It feels like a repurposing of an existing tool, without enough methodological detail. Moreover, the “novel metric” is quite simple and I am not sure that the interpretation of it is correct.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a method of validating causal generative models in medical imaging, using a psuedo-oracle method based on simBA. SimBA (Simulated Bias in Artificial Medical Images) produces synthetic brain examples based on a subject, disease effects, and bias effects. Using pairs of biased and unbiased images generated from simBA they train two causal generative models (MACAW and HVAE) to produce the causally deformed counterfactual images from a baseline image. They then propose a new metric “Triangulation of effectiveness and amplification” (TEA) to evaluate the quality of the generated images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall this paper attempts to tackle a difficult problem in medical image analysis. The proposed evaluation scheme, TEA, is useful for determining if the model is making the correct intervention and if the model is amplifying spurious attributes. This provides a simple mechanism for evaluating a CGM beyond an MSE. The fact that MACAW produced better counterfactuals according to the psuedo-model but worse according to the TEA is particularly compelling.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Major: The quality of the results is dependent on the quality of the simBA generated images and only deformations available in simBA can be validated. As the proposed metric only applies when a ground truth or pseudo ground truth counterfactual image is available, most situations will not be able to use it. Minor: The utility of the TEA is not fully explored. Examples of known spurious attribute amplification would better justify the results.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors attempt to tackle an interesting and relevant problem in validating causally generated images. However, in order to calculate the metric, pseudo ground truth is required which severely limits the applicability on a wider scale. Further qualitative exploration of incorrectly synthesized images would help establish the metrics utility, but I understand the space constraints available for the submission.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper addresses a critical limitation in validating counterfactual images generated by causal generative models (CGMs)—namely, the lack of reliable evaluation metrics. Instead of relying solely on discriminative models as pseudo-oracles, the authors propose an innovative and flexible evaluation framework that leverages synthetic, ground-truth counterfactual datasets for more effective debugging and assessment of CGMs. A particularly noteworthy contribution is the introduction of the Triangulation of Effectiveness and Amplification (TEA) metric, which provides a more comprehensive evaluation than existing metrics by simultaneously quantifying the effectiveness of target variable interventions and the extent of unintended effects.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors present a clearly defined set of research assumptions and a well-structured approach. The design and processing of the dataset, the experimental setup, and the analysis are all thoughtfully conducted to support the underlying assumptions, resulting in an overall compelling and well-substantiated study.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) While the use of synthetic datasets provides a controlled environment for evaluation, it would have further strengthened the study if the proposed framework had also been validated on real-world medical imaging data. 2) In addition, the evaluation of target effects appears to be conducted in a unidirectional manner. Extending the analysis to include bidirectional interventions could offer deeper insights into the causal behavior of the model and further validate the effectiveness of the proposed TEA metric.

    [Minor comment] 1) A minor typographical error was found on page 6, where “quantitiative” should be corrected to “quantitative.”

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see the comments

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for the thoughtful comments and appreciate their recognition of the utility of our novel framework for evaluating causal generative models (CGMs) using true counterfactuals. We see our work as a timely contribution to a field that is increasingly focused on causal AI and counterfactual generation, yet still primarily relies on pseudo-validation methods. As proof of concept, we applied our framework to two state-of-the-art CGMs, revealing important, previously unknown differences in their counterfactual generation, particularly in terms of effectiveness and amplification. While we used T1-weighted brain MRI data in this work, the framework is broadly applicable to other medical imaging modalities.

We would like to clarify that the fundamental benefit of using synthetic, yet realistic SimBA data in our setup lies in the ability to generate reliable, true counterfactuals, which is not typically achievable in a real imaging setup. Thus, attempting to validate the proposed TEA metric on real images would not be helpful in our initial exploration of this framework, where the purpose is to demonstrate how we can use ground truth counterfactuals to (1) help identify failure modes of CGMs early in the development process, (2) ensure CGMs work as intended, and (3) facilitate insightful and objective method comparisons. While we see SimBA+TEA working perfectly in concert in our paper, we would like to highlight that it is indeed possible to use TEA independently of synthetic data, assuming access to a strong pseudo-ground truth/baseline as mentioned in our conclusion (eg. travelling subjects, [3]).

We appreciate R2’s comment about being limited to morphological deformations in SimBA, but would like to point out that the flexibility of SimBA as a data generation tool also allows for other effects (eg. intensity-based, [14]) to be tested and validated within the TEA framework. In response to R3’s concern about ground truth counterfactuals, SimBA images are generated with spatially localized, independent morphological effects that are introduced to specific regions of interest with a controlled degree of deformation (via a mathematically justified scheme using the Log-Euclidian framework [16]). These “target effects” can either be included or excluded from a given image – thus, we can be confident that our counterfactuals represent ground truths, since this is precisely how we define them. This also enabled us to conduct experiments with both addition and removal of target effects (bidirectionality – mentioned by R1). Due to space limitations, we chose to focus on only one to illustrate the applicability of our method, however, results we obtained for the opposite direction are similar.

Regarding R3’s comments about the equations/figures of the TEA metric, we would like to clarify that they represent operations between vectors in the high dimensional space (192x192=36864) of the associated images. By having the vector difference between the original image and its true counterfactual, we assume orthogonality of wanted and unwanted effects. Fig 1 represents this in 2D to facilitate reader comprehension. E does not capture amplification since target effects are exclusively parameterized in SimBA by the vector connecting the ground truth counterfactuals, hence the error component orthogonal to this line represents all unwanted effects (amplification).

Moreover, while SimBA is an existing tool (for bias analyses in medical AI) as R3 correctly points out, we repurpose it in a fully novel application of comprehensively evaluating CGMs. We also see the criticized simplicity of TEA as a desirable feature, as it is intuitive and does not involve training external models with additional data (as done for pseudo-oracles) to evaluate the properties of counterfactual interventions in CGMs. Hence, our approach successfully circumvents common problems and challenges of previous methods.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I agree with all the reviewers that this paper addresses a very important and critical problem of ground truth counterfactuals evaluation of causal generative models in medical imaging. However, I also have the same concern with Reviewer #2 and Reviewer #3 that the pseudo ground truth is required to compute the proposed metric, which is often not available and limits its broader applicability in real-world applications. The rebuttal clarifies that the purpose of the ground truth counterfactuals is to (1) help identify failure modes of CGMs early in the development process, (2) ensure CGMs work as intended, and (3) facilitate insightful and objective method comparisons. However, most CGMs are designed for specific tasks. How to guarantee the failure modes identified using synthetic SimBA data can also be applied to other tasks? Moreover, even if SimBA+TEA determines the best CGM model objectively, the best CGM model is not guaranteed to be the best model for other tasks and datasets. I think the rebuttal is not clear in terms of these questions regarding the applicability and scalability. In summary, the problem that the paper tried to tackle is interesting and critical, but the solution is immature.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This rebuttal do not fully adress all the reviewers’ concerns.



back to top