Abstract

Counterfactual medical image generation enables clinicians to explore clinical hypotheses, such as predicting disease progression, facilitating their decision-making. While existing methods can generate visually plausible images from disease progression prompts, they produce silent predictions that lack interpretation to verify how the generation reflects the hypothesized progression—a critical gap for medical applications that require traceable reasoning. In this paper, we propose Interpretable Counterfactual Generation (ICG), a novel task requiring the joint generation of counterfactual images that reflect the clinical hypothesis and interpretation texts that outline the visual changes induced by the hypothesis. To enable ICG, we present ICG-CXR, the first dataset pairing longitudinal medical images with hypothetical progression prompts and textual interpretations. We further introduce ProgEmu, an autoregressive model that unifies the generation of counterfactual images and textual interpretations. Extensive experimental results demonstrate the superiority of ProgEmu in generating progression-aligned counterfactuals and interpretations, showing significant potential in enhancing clinical decision support and medical education. Project resources are available at: https://progemu.github.io

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1872_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1872_supp.zip

Link to the Code Repository

https://progemu.github.io

Link to the Dataset(s)

https://progemu.github.io

BibTex

@InProceedings{MaChe_Towards_MICCAI2025,
        author = { Ma, Chenglong and Ji, Yuanfeng and Ye, Jin and Zhang, Lu and Chen, Ying and Li, Tianbin and Li, Mingjie and He, Junjun and Shan, Hongming},
        title = { { Towards Interpretable Counterfactual Generation via Multimodal Autoregression } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {610 -- 620}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a novel task called Interpretable Counterfactual Generation (ICG), which requires jointly generating counterfactual medical images and textual interpretations that explain the changes induced by a given disease progression prompt. Authors first curate a large-scale longitudinal chest X-ray dataset (ICG-CXR) with paired progression prompts and interpretation texts, to enable ICG. Then proposes ProgEmu, a multimodal autoregressive model that unifies the generation of counterfactual images and interpretation texts within a single framework, ensuring causal alignment between the generated images and texts.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Addresses a critical gap in existing counterfactual generation methods, which produce silent predictions without interpretations to verify the generated changes by introducing the ICG task and ICG-CXR dataset
    • Proposes an unified multimodal autoregressive approach of ProgEmu by joint learning of image and text
    • Provides quantitative and qualitative evaluations that demonstrate the effectiveness of ProgEmu in generating high-quality counterfactual images and interpretations aligned with the input progression prompts
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Uses of only GPT4 may introduces biases. Needs to explore other domain specific VLM or general VLM for comparative study
    • Experiments are limited to the chest X-ray modality. Could be explored in other medical domain also.
    • Lack of validation through real-world studies or feedback from medical professionals
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors claims to make the dataset public after acceptance. However, the lack of publicly available code or pre-trained models may hinder immediate reproducibility.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major strengths of the paper is proposing a novel task and dataset, and introducing an effective unified multimodal autoregressive model that achieves state-of-the-art performance in generating interpretable counterfactuals. It has the potential to enhance clinical decision-making and medical education.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper has two main contributions. First, it introduces a large-scale longitudinal CXR dataset for counterfactual generation with both image and text. Second, this paper develop a multimodal autoregressive model based on this dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strength: 1) Novelty. This paper introduces a longitudinal generation of both CXR images and paired text. This tool could be more meaningful to the clinical use.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Here are several weakness of this project. 1) The longitudinal dataset was curated based on MIMIC-CXR and CheXpertPlus. Though the split of train/test ensures no overlapping. There is still a mixture of data sources. To prove the generalizability, this paper should at least find one other testing dataset. 2) the evaluation metrics of different methods. For the generated text evaluation, this paper only used traditional BLEU-3, METEOR, and ROUGE-L, but these metrics might not be perfect for evaluating the context of generated texts. Thus, metrics such as LLM-based or human reader, will be more suitable in this case when comparing whether the generated text contains meaningful information.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper needs to improve the evaluation of models. Though it proposed a novel method in longitudinal generation, it requires holistic evaluation to prove its hypothesis.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This work primarily contributes to the interpretability of counterfactual medical image generation, which helps clinicians better understand and trust the work of these systems.Interpretable counterfactual generation provides a multimodal solution for any disease diagnosis along with the explanation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ICG, requiring the joint generation of counterfactual images and interpretation texts. Curation of the ICG-CXR dataset, to address the data scarcity for the ICG task. ProgEmu, a multimodal autoregressive model that unifies image and text generation for interpretable counterfactual generation. State-of-the-art results for generation of progression-aligned counterfactual images and interpretation texts, highlighting its potential for counterfactual medical analysis.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Dip in AUROC than existing methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Curation of the ICG-CXR dataset, to address the data scarcity for the ICG task.
    2. introduced ProgEmu, a multimodal autoregressive model that unifies image and text generation for interpretable counterfactual generation.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their thoughtful and constructive feedback. They found our contributions novel and valuable. Below, we address the key concerns.

  1. Model Performance (Reviewer 1) Thanks for the careful note on the performance comparison. We’d like to clarify that while BiomedJourney achieves marginally higher AUROC (our 0.7921 vs. 0.8385), our method excels in all other metrics while also providing interpretations. This small trade-off is compensated by better visual quality (FID↓: our 29.21 vs. 36.62), prompt alignment (CLIP-I↑: 35.24 vs. 34.59), and balanced pathological feature preservation (F1 score↑: our 0.8914 vs. 0.8411). We believe this represents a favorable balance between fidelity and interpretability.

  2. Use of GPT-4 and Potential Bias (Reviewer 2) Great question! We’d like to emphasize that, in our pipeline, GPT is only used to extract and summarize disease progression and radiological differences from existing radiology reports written by human experts, without directly comparing the images. This design ensures that the curated data are grounded in clinically validated information rather than being solely determined by a model. As such, potential bias mainly originates from the source reports themselves. Nonetheless, we agree that employing multiple LLMs/VLMs may enhance the quality control for dataset curation.

  3. Modality and Dataset Limitations (Reviewer 2, 3) We appreciate both reviewers’ constructive comments regarding dataset and modality extension. While our train/test split ensures no patient overlap, we acknowledge that cross-dataset validation would strengthen our work. However, the scarcity of longitudinal medical data with paired interpretations presents significant challenges. The creation and release of ICG-CXR itself addresses a critical data gap in the field. We focused on chest X-rays due to their clinical prevalence and relatively abundant longitudinal data, though our framework is modality-agnostic and can be adapted to other modalities in principle. As noted in Section 5, further extension and validation is a valuable direction and will be part of our future work.

  4. Clinical Validation (Reviewer 2, 3) We used standard metrics to align with prior works [1], ensuring fair benchmarking. We agree these do not fully capture clinical relevance. To partially address this, we included qualitative examples (Figs. 3, 4) with attention visualization to show semantic consistency between image and text. Evaluation via human or LLM-based ratings is important, which we’ll emphasize in the final version as an important direction for our future work.

We thank all reviewers again for their recognition of our work, and hope our clarifications address the remaining concerns. The ICG-CXR dataset and the codebase will be made publicly accessible to the community.

[1] CXR-IRGen: an integrated vision and language model for the generation of clinically accurate chest x-ray image-report pairs.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top