Abstract

Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient’s age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1947_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XiaTia_Segmentorguided_MICCAI2025,
        author = { Xia, Tian and Sinclair, Matthew and Schuh, Andreas and De Sousa Ribeiro, Fabio and Mehta, Raghav and Rasal, Rajat and Puyol-Antón, Esther and Gerber, Samuel and Petersen, Kersten and Schaap, Michiel and Glocker, Ben},
        title = { { Segmentor-guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {525 -- 535}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a method for improved counterfactual image generation by using a segmentation model to guide the fine-tuning process. The approach is particularly useful in medical image processing, where the focus is on observing local changes in specific morphological structures rather than global changes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Proposes a novel method for fine-tuning counterfactual generation models using an external segmentation model as guidance.

    2. The writing is clear, well-structured, and effectively motivated.

    3. The comparison with No-CFT and Reg-CFT baselines is insightful and adds significant value to the paper.

    4. The paper presents an interesting use case involving coronary artery disease.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. No causal DAG is provided, and the method assumes that parent variables are independent. This weakens the causal interpretation of the model and aligns it more closely with conditional generation approaches.

    2. It is unclear which model was used as the pseudo-oracle for measuring effectiveness. If the same segmentation model used during fine-tuning was also used to evaluate effectiveness, this could introduce bias and be unfair to the baseline comparisons. This point should be clarified in the paper.

    3. Incorporating additional evaluation metrics, such as realism and composition, would strengthen the analysis and provide a more comprehensive assessment of the model’s performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    See weakness above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a novel technique for fine-tuning counterfactual generation; however, certain weaknesses must be addressed to ensure the validity of the claims before acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal clarifies several points that were previously unclear to me. I believe it meets the standards for acceptance at MICCAI.



Review #2

  • Please describe the contribution of the paper

    The authors describe an improvement of a previously proposed causal counterfactual image generation method (deep structural causal models, DSCMs). The improvement concerns the use of an improved and more robust guidance model used for counterfactual fine-tuning, which ensures that the causal model produces coherent counterfactual images. The method is evaluated in two case studies: modifying the size/area of left/right lung or heart in frontal chest x-ray recordings, and similar structural changes in coronary computed tomography angiography (CCTA), with good empirical results.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is generally well-written and describes a clear and actionable improvement to a specific counterfactual image generation method. Empirically, the proposed approach appears to work very well, as would be expected. The proposed method really just seems essentially universally superior to the prior approach using a ‘vanilla’ classifier for counterfactual finetuning.

    The dual problems of finely targeted counterfactual image generation and causal image modeling are important, and any advances in these directions - such as the one described here - is generally highly welcome.

    Two interesting - though small - case studies in region-specific counterfactual image modification are presented.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    My primary concern with the manuscript is with its relatively limited novelty. If I am not missing anything, this is exactly the method from reference [6] in the manuscript, except that the vanilla regression/classification model used for counterfactual fine-tuning is exchanged for a segmentation model instead. As such, it seems to me that this is basically a rediscovery and highly specific instance of the more general finding that classification and regression models are more robust when a previous segmentation step restricts the model’s input to relevant regions of the input image. This has been reported several times in the literature [1-4]; since the authors are currently not citing any of these works, I suspect that they are unaware of these previous findings. (Notice that this is not particularly surprising: a binary classification label provides literally 1 bit of information to the training procedure, leaving much room for shortcut learning to occur, whereas a ground-truth segmentation mask contains much more information, thereby usefully constraining the learning process.)

    Now I generally welcome any advances in this area, since I personally consider it a promising and under-appreciated avenue for improving model robustness. I am, however, doubtful about whether this rather simple tweak to a previously described method is really enough for a full publication.

    My second main concern is with the lack of comparison to any baseline methods. Counterfactual fine-tuning essentially seems to me like a method to enforce separation / disentanglement of concepts in the latent space of a generative model [5]. This has been treated extensively in the (non-causal) literature. As it currently stands, the practical benefits of the (technically rather complex) causal approach presented here remain to be demonstrated. Would the counterfactual images be (in which sense?) less useful if generated by any of the various other proposed approaches (e.g. [6, 7])? What about a simple conditional diffusion model (e.g. [8, 9]) with segmentation-based guidance?

    References: [1] Hooper et al., A case for reframing automated medical image classification as segmentation, https://proceedings.neurips.cc/paper_files/paper/2023/hash/ad6a3bd12095fdca71c306871bdec400-Abstract-Conference.html [2] Luo et al., Rethinking Annotation Granularity for Overcoming Shortcuts in Deep Learning–based Radiograph Diagnosis: A Multicenter Study, https://pubs.rsna.org/doi/full/10.1148/ryai.210299 [3] Saad et al., Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity, https://proceedings.mlr.press/v182/saab22a.html [4] Aslani et al., Optimising Chest X-Rays for Image Analysis by Identifying and Removing Confounding Factors, https://link.springer.com/chapter/10.1007/978-981-16-6775-6_20 [5] Klys et al., Learning Latent Subspaces in Variational Autoencoders, https://proceedings.neurips.cc/paper_files/paper/2018/file/73e5080f0f3804cb9cf470a8ce895dac-Paper.pdf [6] Sun et al., Inherently Interpretable Multi-Label Classification Using Class-Specific Counterfactuals, https://proceedings.mlr.press/v227/sun24a/sun24a.pdf [7] Cetin et al., Attri-VAE: Attribute-based interpretable representations of medical images with variational autoencoders, https://doi.org/10.1016/j.compmedimag.2022.102158 [8] Weng et al., Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/11674.pdf [9] Sobieski et al., Rethinking Visual Counterfactual Explanations Through Region Constraint, https://openreview.net/forum?id=gqeXXrIMr0

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The introduction is quite dense to read; an overview figure that clarifies the differences between the proposed approach and some prior approaches might be helpful.

    The paragraph on the bottom of page 4 was difficult for me to follow. “Counterfactual parents” would include e.g. the changed lung area “parent” of the image? I think one issue with this paragraph is that it starts out formulating things in a very general framing, but towards the end it only applies to the highly specific case where the parents are features quantifying a structure’s area. The meaning and purpose of pa_hat_x are not explicitly introduced. Figure 1 is missing the key information that all of this is only used for CFT. (The figure shows how pa_hat_x is obtained, but not what is done with it, nor that essentially pa_hat_x == structure area.)

    The “loss function l” is mentioned but never specified.

    “We manually selected around 85k subjects” - it seems highly unlikely that this was indeed done manually?

    Many of the references are wrongly capitalized (e.g. “vae”, “gan”, etc.).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In principle, this is a good, well-written contribution. However, due to the combination of 1) limited novelty (it is known that segmentation-guided classification is generally more robust; this is not discussed appropriately), and 2) lack of comparison to any non-causal baseline strategies to achieve the same aim, I am currently tending towards rejection.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed most of my concerns in their rebuttal. I still think the novelty is somewhat limited, but if they appropriately discuss this in their final version, I think the translation to the generative setting of the earlier findings on the improved robustness of segmentation-guided models is a very useful contribution. This is a good paper.



Review #3

  • Please describe the contribution of the paper

    The paper introduces Seg‑CFT, a counterfactual fine‑tuning scheme for Deep Structural Causal Models (DSCMs). Instead of the usual regressor/classifier guidance, the authors plug in a frozen segmentation network during training: a generated image is segmented; simple scalars (e.g. lung area) are computed from that mask; and the loss forces those scalars toward desired counterfactual targets. This preserves the convenience of scalar interventions while injecting spatial awareness, leading to locally correct edits. Experiments on chest X‑rays (PadChest) and CCTA show lower MAPE/MAE and fewer unintended global changes than the state‑of‑the‑art Reg‑CFTs .

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Simple but elegant idea: turn any structure‑specific scalar into a spatially grounded supervision signal by letting a segmentor close the loop.
    2. Two quite different medical datasets; both quantitative (Table 1) and qualitative (Figs 2–3) show consistent gains in performance.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The three structure‑specific variables in each dataset are assumed independent “for simplicity” ​. That ignores obvious anatomy and may inflate effectiveness numbers. The authors though acknowledge this in conclusion.
    2. The PadChest masks come from an external model (torchxrayvision). If those masks are biased, Seg‑CFT may simply inherit the bias. An ablation with noisier/stronger segmentors would strengthen the claim.
    3. Only Reg‑CFT is compared. Related diffusion‑based editing approaches are cited but not benchmarked;
    4. Only single‑attribute interventions are tested; multi‑attribute interactions are not.
    5. Occasional typos (“diease”, “meanining”)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The segmentor guided counterfactual image synthesis is novel

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I am satisfied with the rebuttal.




Author Feedback

We thank the reviewers for their constructive feedback. We are encouraged that R1 and R3 find our work “novel” (R1,R3), “simple but elegant” (R3) and “insightful” (R1). We believe R6 makes several unjustified statements, which we address below among others. (1) Novelty of Seg-CFT (R6) Thanks for providing citations [1–4]. We initially did not cite these, as they focus on discriminative rather than generative models. But we appreciate that these works serve as a motivation for our design choices, and we will include them. While our method is simple by design, it introduces novel segmentation-derived structure-specific scalars as intervenable parent variables in a DSCM—bringing a distinct form of supervision and causal control. The novelty has been highlighted by other reviewers. (2) Comparison to Non-Causal or Diffusion Methods (R3, R6) Our comparisons to No-CFT and Reg-CFT are deliberate: all methods share the same generative backbone and differ only in supervision, allowing us to isolate the effect of segmentation-guided fine-tuning. Since variables are conditionally independent, please note that No-CFT effectively acts as the conditional baseline suggested by R6. We did not benchmark diffusion-based methods, as our focus is on guidance strategies, not the generative architecture. A recent benchmark (Melistas et al., NeurIPS 2024) shows HVAEs outperform diffusion models and highlight the benefit of causal over non-causal editing. Yulun Wu et al. (ICLR 2025; VCI) also argue that diffusion models violate abduction by resampling exogenous noise. But we agree that diffusion models are worth considering in future work. (3) Causal Graph Clarification (R1, R3, R6) In fact, we did use a more complex DAG in our PadChest implementation, with sex and age as parents of LLA, RLA, and HA. The figure was removed due to space constraints, and unfortunately we missed to clarify this in the text. We will add this in the revision. (4) Evaluation Segmentor (R1) We confirm that the segmentor used for evaluation is not the same as the one used for fine-tuning. Each is trained separately. We agree with R1 that this is important to not bias the results. (5) Evaluation Metrics (R1) We agree that adding further metrics such as realism and composition would add value. These metrics are easy to add as they are already implemented. We would expect these to further confirm the benefit of Seg-CFT, which is already superior on the most relevant metric of effectiveness. (6) Multi-Attribute Interventions (R3) Our method fully supports multi-attribute interventions, but we focused on single-variable edits for clarity, as multi-variable evaluation is more complex—especially under nonlinear interactions. We agree this is a valuable direction for future work. (7) Segmentor Bias (R3) Excellent suggestion. We agree Seg-CFT may inherit bias from the pretrained segmentor, and robustness to segmentor quality is an important direction for further investigation that we will highlight in the discussion. (8) Dataset Scale and Manual Curation (R6) R6 questioned whether we manually checked 85k images. We confirm that the PadChest dataset was manually curated. We developed a bespoke and efficient interface displaying 50 images at a time in a 5×10 grid, allowing annotators to quickly identify and flag low-quality images. The whole process took around 3 weeks and ~3 hours per day. We disagree with the reviewer’s comment that the scale of our experiments is small, given the size and considerable effort to curate 85k samples. Other minor comments will be addressed accordingly.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There is a consensus to Accept after the rebuttal. An effective rebuttal.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors received constructive criticism, responded thoughtfully and effectively, and ultimately convinced all reviewers of their work’s merit. The final consensus is that this is a well-written paper with a useful and novel contribution for the MICCAI community.



back to top