Abstract

Segmenting anatomical structures in medical images plays an important role in the quantitative assessment of various diseases. However, accurate segmentation becomes significantly more challenging in the presence of disease. Disease patterns can alter the appearance of surrounding healthy tissues, introduce ambiguous boundaries, or even obscure critical anatomical structures. As such, segmentation models trained on real-world datasets may struggle to provide good anatomical segmentation, leading to potential misdiagnosis. In this paper, we generate counterfactual (CF) images to simulate how the same anatomy would appear in the absence of disease without altering the underlying structure. We then use these CF images to segment structures of interest, without requiring any changes to the underlying segmentation model. Our experiments on two real-world clinical chest X-ray datasets show that the use of counterfactual images improves anatomical segmentation, thereby aiding downstream clinical decision-making.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4597_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/biomedia-mira/CF-Seg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{MehRag_CFSeg_MICCAI2025,
        author = { Mehta, Raghav and De Sousa Ribeiro, Fabio and Xia, Tian and Roschewitz, Mélanie and Santhirasekaram, Ainkaran and Marshall, Dominic C. and Glocker, Ben},
        title = { { CF-Seg: Counterfactuals meet Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors present CF-seg which is a framework that applies pretrained counterfactual image generator to generate pseudo-healthy chest X-rays from images with pleural effusion to improve lung segmentation performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The thought of generating pseudo normal chest X-rays from X-rays with pathology makes sense. The framework can be applied with various models off the shelf which is a plus.

    2. The resulting lung segmentation is more quantitatively and qualitatively similar to lung segmentations provided by experts. This is shown by the CF-Seg’s ability to avoid under segmenting the lung with pleural effusion.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The justification for using only chest x-rays with pleural effusions is not provided. Authors simply mention “any of the many other diseases associated with lung opacities could have been considered”. As the choice of lung pathology is critical to this study, more justification on the selection of pleural effusion must be provided. For example, pneumothorax can be considered which are usually visible in the outer regions of lung. Alternatively what about lung nodules which can be small to detect etc. Showing the efficacy of CF-seg across multiple scenarios will make this paper much stronger.

    2. The CheXMask is known to under segment the lung as authors point out as well. The U-net baseline is trained using CheXMask so it does not provide additional insights other than it is similar to CheXMask predictions. A stronger and more insightful baseline will be an U-net model trained with human annotations.

    3. Technical novelty is limited: it is a good application of existing methods that generate normals from X-rays such as [31] that authors cite.

    4. Another key set of evaluations that are missing is comparison to other generative models - authors claim using counterfactuals helps segment anatomy, how about other approaches for X-ray synthesis of normal cases?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The biggest criticism is the weak evaluation set up that focuses only on one pathology (pleural effusion) paired with one anatomy (lung). The experimental results do not adequately support claims made in the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Authors claim their approach has significant clinical potential towards enabling applications such as brain tissue segmentation in the presence of lesions aids in assessing aging-related changes, and coronary artery segmentation in the presence of plaque supports arterial blockage quantification. This is a fair claim but is currently not based on any empirical evidence.

    Regarding the training method, I agree now more with the authors’ claimed value of training on silver standard ground truth to prevent under segmentation of lungs.

    Regarding the author’s comment that the paper’s focus is more its clinical impact than methodology and the paper “aligns more closely with the MICCAI community than with the broader ML community”, the main results provided in the paper are 1) preference study with human experts and 2) quantitative analysis of segmentations using metrics like dice including comparisons to human experts. I do not think claiming that the paper’s main focus is its clinical impact increases the contribution of the paper: if this was the case, doesn’t this further emphasize the need for results showing that this method generalizes to other applications? If not, I think the claim actually limits the scope of the paper to only PE on frontal PA/AP images.

    After going through authors’ response, my verdict is slightly more in favor of the paper but between accept/reject, I lean more towards reject.



Review #2

  • Please describe the contribution of the paper

    Authors propose using conterfactual images to improve segmentations in pleural effusion cases and conduct a diverse validation based on user-studies, segmentation performance and population studies

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the problem addressed is a significant issue in CXR lung segmentation where opacities often hinder correct automatic lung segmentation
    • the validation method is quite diverse and interesting, from user studies to segmentation performance and to population distributions
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The fact that this is such a specific problem of CXR lung segmentation means that it might be difficult to find parallels where this technique is equally needed and has such a high impact
    • Authors limit themselves to either normal or PE images (it is unclear whether these images are PE only or if they have other findings) so there remains a question of model performance on the remaining images which will be changed by the DSCM. The removal of cardiomegaly may be particularly problematic as it would change the lung-cardiac silhouette outline.
    • As much as I like the diversity in the evaluation, I believe comparison to more recent segmentation models is crucial. UNet is a workhorse but more recent methods (nnUNet or transformer-based) probably struggle less with PE cases. It would be important to show that DSCM still plays a beneficial role for more recent and complex segmentation models.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    there is a repetition in page 2: “anatomical anatomical”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting use of counterfactual images for improved segmentation but validation could be improved to strengthen the case for the proposed method

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel counterfactual generation method to improve segmentation performance in the presence of disease pathology using structural casual models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The study is extremely well-motivated and clinically relevant as the presence of disease pathology (e.g., lung opacities) can impact segmentation performance.
    • While DSCMs have been used for counterfactual generation in prior work, their utility in improving segmentation performance in presence of disease pathology is novel and highly relevant.
    • The inclusion of two radiologist reader study further strengthens the findings of the paper.
    • Validating segmentation performance using lung volumes as a proxy has strong clinical utility. Particularly, in experiment 3, where ground-truth annotations are not available, the authors go above and beyond to validate their proposed method across a larger, unannotated dataset.
    • The paper is well-written and clear. The figures and tables are well-described and easy to interpret.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The authors focus their experiments on pleural effusion, which results in visible and clearly defined lung opacities. A more comprehensive analysis to other disease pathologies (e.g., pneumothorax, etc.) would demonstrate the generalizability of the proposed methods to more complex tasks.
    • In experiment 1, the reader study can be improved by including inter-rater variability and Cohen’s kappa to discuss if there were any differences in preferences between both radiologists.
    • In experiment 2, including sub-analysis of lung volumes split by pleural effusion vs no finding would strengthen conclusions made in experiment 3. Additionally, what is the reason behind the bimodal distribution of expert annotations (Fig. 4, left lung, pad chest)?
    • Inclusion of statistical comparisons in all three experiments (e.g., t-test) would strengthen the findings.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-motivated, is clinically relevant, and has a potential for high impact. Counterfactual generation is an active (and rapidly growing) research area in medical imaging and the authors propose a novel application of causality and generative models to improve segmentation performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have address all of my concerns, providing clarity over their choice of a single pathology (pleural effusion). While this continues to be a limitation, the utility of DSCM to generate counterfactuals to improve generalizability of segmentation models in the presence of pathologies is an strong application of generative AI and highly relevant to the challenges in medical image analysis. Therefore, I recommend acceptance.




Author Feedback

Application of CF-Seg (R1, R2, R3): We primarily applied our method to lung segmentation in the presence of PE, the disease of interest for our clinical collaboration. To demonstrate its generalizability, we evaluated our method on two publicly available, large-scale datasets. While the method is applicable to other lung conditions, such as pneumothorax and lung nodules, these conditions and applications are currently explored as part of future work. Given the promising results obtained for PE, we believe it is timely to present these results to the community, which, together with the release of our code and manual annotations, will open many new applications. Our approach has significant clinical potential for segmenting anatomical structures affected by disease that do not change the underlying structure of interest. For instance, brain tissue segmentation in the presence of lesions aids in assessing aging-related changes, and coronary artery segmentation in the presence of plaque supports arterial blockage quantification.

Baseline and training method (R1, R2): nnU-Net could have been used instead of the standard U-Net, but its advantages—mainly in data augmentation and architecture optimization—do not address the issue of training on image/label pairs where the structure of interest is undersegmented due to pathology. As noted by R2 and confirmed in Experiment 1 and Figures 4 and 6, CheXMask, a “silver standard” dataset, tends to undersegment lungs in diseased cases. Our model is trained on CheXMask, the only dataset with segmentation masks for PE images. Therefore, replacing U-Net with nnU-Net would not necessarily lead to improvements due to the limitations of this dataset. This underscores the strength of our method, which produces clinically useful segmentation masks despite being trained on undersegmenting “ground truth”, regardless of the segmentation architecture. A potential improvement would require a large-scale dataset with expert-annotated lung segmentations for PE images, though such annotations are time and resource-intensive.

Model Performance on images with other diseases (R1): Thank you for asking a question about the effect of DSCM on images with diseases other than PE. To clarify, our set of PE test images includes images with multiple findings (e.g., PE + cardiomegaly). This is important as it demonstrates the robustness of CF-Seg to the presence of other findings. To ensure that the DSCM does not change the structure of interest in the presence of other diseases (e.g., lung-cardiac silhouette outline in cardiomegaly), we trained the model using exclusively healthy images and images with PE only (i.e., no other disease).

Technical Novelty (R2): Our method demonstrates strong performance in clinically relevant lung segmentation for PE, despite being trained solely on the silver-standard CheXMask dataset without access to ground-truth segmentations. This underscores the potential clinical impact when combining existing ML techniques. Accordingly, our focus was on clinical evaluation rather than benchmarking various methods for healthy image synthesis. We believe our work aligns more closely with the MICCAI community than with the broader ML community.

Statistical Tests (R3): For Table 1 in Experiment-2, p-values (between U-Net and CF-Seg) are < 0.05 for the right lung (column 1) and both lungs (column 3), while this is not the case for the left lung (column 2). All reported results in Experiment-2 are for PE images only. For Experiment-3, the difference in mean volume between PE and NF, for CheXMask (mimic: 859, padchest: 1177) and U-Net (mimic: 893, padchest: 1189) is comparatively higher than CF-Seg (mimic: 398, padchest: 306). Mean volume differences are significant for all three (CheXMask, UNet, and CF-Seg). We couldn’t calculate Cohen’s kappa for Experiment-1 within the rebuttal period, due to the unavailability of clinicians. We will add this to the revised manuscript.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top