Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering “what-if” questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations, as validated by experienced radiologists. Our code and dataset are publicly available at https://github.com/hgminn/InstructX2X.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1216_paper.pdf

SharedIt Link: https://rdcu.be/eHxeo

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05325-1_27

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/hgminn/InstructX2X

Link to the Dataset(s)

N/A

BibTex

@InProceedings{MinHyu_InstructX2X_MICCAI2025,
        author = { Min, Hyungi AND You, Taeseung AND Lee, Hangyeul AND Cho, Yeongjae AND Cho, Sungzoon},
        title = { { InstructX2X: An Interpretable Local Editing Model for Counterfactual Medical Image Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {279 -- 288}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces InstructX2X, a method for generating counterfactual medical images by utilizing a region-specific editing approach. This method uses editing instructions and guidance maps derived from relevance maps and anatomical pseudo-masks to prevent unintended changes outside the region of interest. Additionally, the authors release MIMIC-EDIT-INSTRUCTION dataset derived from MIMIC-Diff-VQA, built from expert-verified medical VQA pairs, to assess the instruction-based editing for counterfactual medical image generation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Relevance map formulation is easy to understand and the method of combining with dataset-derived bounding boxes per pathology is intuitive to understand.
2. Paper is well scoped and well motivated. As scoped in the paper, motivation for interpretable local editing for counterfactual generation for the X-ray domain make sense.
3. MIMIC-EDIT-INSTRUCTION can be a good contribution to the community and repurposing MIMIC-DIFF-VQA into MIMIC-EDIT-INSTRUCTION makes sense.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Important details regarding how expert-verified MIMIC-DIFF-VQA texts are converted to MIMIC-EDIT-INSTRUCTION texts are missing. How are the three ops (add/remove/change-the-level) applied to convert original MIMIC-DIFF-VQA texts to MIMIC-EDIT-INSTRUCTION? Is it done via LLMs? via regex or a NLP pipeline? The dataset is a key contribution of the paper but this important detail is missing.
2. The construction of guidance maps derive only from bounding boxes from one dataset. It remains unclear how this method will generalize beyond this domain and this dataset.
3. Radiologist assessment is limited and does not provide meaningful insights. The two radiologists evaluated 40 image pairs. How the 40 cases are sampled is not documented. Why only five findings out of all findings are selected remains unclear. The Likert scale of 3.59 and 3.45 for performance/interpretability does not seem very strong result either and fails to provide clear insights.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The region-specific editing mechanism is well motivated and technically simple to understand. Dataset to validate the counter-factual generation is a strong contribution of the paper but important details of the dataset construction are missing. Moreover, the evaluation is weak: domain generalization beyond this dataset remains unclear and the radiologist assessment does not reveal particularly clear strengths of the approach. As the dataset is potentially the strongest contribution of the paper, without the details of how the dataset is constructed, it is difficult to accept the paper.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors claim they will share the code that will reproduce the dataset.

Regarding the method, authors claim their method supports user-defined mask specifications. This is true and while the method can generalize across datasets, whether the benefits claimed in this paper will generalize to other datasets remain unknown.

Regarding qualitative assessment, authors shared how the dataset is constructed in the rebuttal. This is an important detail that should be included as part of the manuscript. If this paper was to be published, I still want this detail to be included in the paper.

Review #2

Please describe the contribution of the paper

This paper proposes InstructX2X, a novel interpretable local editing model for counterfactual medical image generation that addresses two key limitations of existing methods: unintended modifications and lack of interpretability. It uses a Region-Specific Editing approach that restricts modifications to specific regions, preventing unintended changes while providing a Guidance Map for visual explanations of the editing process. Additionally, MIMIC-EDIT-INSTRUCTION, a new dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs is also introduced to perform experiments.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Uses Region-Specific Editing approach that prevents unintended modifications by precisely editing target regions, addressing a critical limitation of existing methods
- Provides a Guidance Map that adds interpretability by directly revealing the decision mechanism, eliminating the need for post-hoc explanations of uncertain reliability
- Prepares MIMIC-EDIT-INSTRUCTION dataset which is derived from expert-verified medical VQA pairs, ensuring clinical precision and reliable editing descriptions
- Claims to achieve state-of-the-art performance across multiple evaluation metrics, including CMIG, KL divergence, and FID
- Model’s capabilities and interpretability are validated by experienced radiologists through qualitative assessments
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Does not provide a detailed comparison or analysis of the model’s performance on different types of editing instructions (e.g., add, remove, change level)
- Experiments and metrics focus primarily on chest X-ray images, and the model’s generalizability to other medical imaging modalities is not explored
- Does not discuss the potential limitations or failure cases of the Region-Specific Editing approach, such as scenarios where the target region is not well-defined or overlaps with other anatomical structures
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major strengths of the paper is its novel Region-Specific Editing approach, which addresses the critical limitations of unintended modifications and lack of interpretability in counterfactual medical image generation. It also introduces a new the MIMIC-EDIT-INSTRUCTION dataset, derived from expert-verified medical VQA pairs, provides a reliable foundation for future work in this area. Although there are some limitations and areas for further exploration, this research work might be helpful to the field of counterfactual medical image generation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors clarified some of the concerns and acknowledged considering some in the future. Authors should include these concerns and limitations clearly in the manuscript for clarity.

Author Feedback

We thank the reviewers for their constructive feedback. We appreciate this opportunity for clarification.

R2-W1: Performance across editing types

We appreciate this suggestion. Our dataset’s complex multi-operation editing instructions make per-operation analysis challenging with current image-based evaluation metric. We will explore this valuable direction in future work.

R2-W2: Generalizability concerns

InstructX2X is a domain-agnostic counterfactual image editing framework that could be applied to other medical contexts with high-quality counterfactual datasets. The primary constraint is the scarcity of appropriate evaluation datasets in other imaging modalities (e.g., brain MRI, histopathology). Our approach would be beneficial in diverse medical imaging domains by providing interpretable visualization while preserving unrelated structures—critical across medical imaging. See R3-W2 for more details.

R2-W3: Potential failure case

Our approach aggregates dataset-wide bounding boxes to create standardized pseudo masks for each pathology, offering improved attribute preservation. However, it may potentially miss atypical presentations occurring outside conventional anatomical locations. Future work will explore patient-specific anatomical guidance while maintaining region-specific editing benefits.

R3-W1: Dataset construction details

We converted MIMIC-DIFF-VQA to MIMIC-EDIT-INSTRUCTION using a rule-based approach. MIMIC-DIFF-VQA already follows a rule-based templated structure (e.g., “main image has additional finding of X than reference image”), making it straightforward to transform “difference” answers into our three intuitive operations (add/remove/change). This approach preserves clinical validity without introducing potential inaccuracies common in LLM-generated instructions (as noted in Section 3.2). We will release our conversion code with the dataset.

R3-W2: Guidance map generalizability

As noted in R2-W2, Region-Specific Editing is not bounded to a single dataset or domain. Our method explicitly “supports user-defined mask specifications” (Sec. 3.3), designed for broader applicability. This design principle suggests cross-domain adaptability without architectural modifications, directly addressing generalizability concerns. User-defined masks allow anatomical customization for specific datasets, enabling integration of domain knowledge from various medical fields. Our approach could be extended to other CXR datasets by substituting anatomical annotations (e.g., NIH-CXR, VinDr-CXR), though these datasets currently lack the counterfactual images necessary for evaluation. This could extend to different medical imaging domains by adapting the anatomical guidance to domain-specific requirements.

R3-W3: Radiologist assessment details

Our radiologist evaluation yielded Likert scores of 3.59 and 3.45, indicating that our model generally adheres to intended modifications—meaningful given the challenging nature of medical image generation where even moderate expert approval signifies progress. These scores are consistent with benchmark evaluations in the field; for example, RoentGen reported mean radiologist assessment scores of 3.41 and 3.29 (in 5-point Likert scale) for their text-image alignment task.

For radiologist assessment, we selected 40 samples by: (1) filtering the holdout set to match instruction-mentioned findings with CheXpert labels for reliability, (2) selecting cases with the five key pathologies, and (3) sampling to maintain representative distribution across edit complexity (23 single, 17 complex-mixed), finding types (22 atelectasis, 21 pleural effusion, 8 cardiomegaly, 5 edema, 2 pneumothorax), and edit operations (34 add, 22 remove, 2 change).

The 5 findings were selected following established practice in CXR research (Bluethgen et al., 2025; Gu et al., 2023). These pathologies represent prevalent findings and are standard for evaluation, enabling direct comparability with benchmark models.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers were satistfied with the rebuttal. The authors should address the remaining concerns in the camera-ready version.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Reviews are positive and I think the approach is in general an interesting contribution to the field of counterfactual image generation and making the dataset available is a huge plus. However, critical causal baselines are missing and the technical novelty is limited.

back to top

InstructX2X: An Interpretable Local Editing Model for Counterfactual Medical Image Generation

Author(s):