Abstract

Multi-modal imaging plays a crucial role in clinical diagnosis and medical research. However, its widespread adoption is hindered by significant time and hardware costs. Medical image translation, which aims to synthesize missing modalities from available data, presents a promising solution. Nevertheless, existing models often struggle to maintain the structural consistency required for clinical applications. To address these challenges, we introduce DisDiff, a novel disentangled adversarial diffusion framework designed to preserve anatomical structure while enhancing synthesis quality. DisDiff incorporates a Disentangled module that decouples content and style factors within image features, thereby enabling the generation of anatomically precise images. By utilizing these disentangled representations as conditional inputs, DisDiff not only accelerates the learning process compared to traditional diffusion- based models, but also improves image quality and enhances training efficiency. Furthermore, we propose a content discriminator module to enforce anatomical consistency, effectively addressing the lack of explicit structural guidance in conventional diffusion models. Experimental evaluations on multi-contrast MRI translation demonstrate that DisDiff substantially outperforms existing methods in both image quality and structural preservation, positioning it as a promising solution for real-world clinical applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1733_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaYip_DisDiff_MICCAI2025,
        author = { Zhang, Yipin and Yu, Ziqi and Zhang, Xiange and Zhang, Shengjie and Chen, Xiang and Yang, Haibo and Zhang, Xiao-Yong},
        title = { { DisDiff: Disentanglement Diffusion Network for MR Imaging Translation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {152 -- 162}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces DisDiff, a novel disentangled adversarial diffusion framework for medical image translation, focusing on multi-contrast MRI tasks. The model aims to address challenges like structural consistency and inefficiency in traditional diffusion and GAN-based approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript is clearly structured, with logical flow and sufficient technical details for replication

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The novelty is moderate. The core idea is actually replacing SynDiff’s CycleGAN with MUNIT, which does not represent a significant conceptual leap.
    2. Ambiguous Dataset Splits. The method claims validation set evaluation (every 5 epochs) in “Implementation Details”, but only training/test splits are explicitly defined, raising reproducibility concerns.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    please see strengths and weaknesses above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    please see strengths and weaknesses above.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a method to guide diffusion models by decoupling information. Based on this, the paper introduces a content discriminator to constrain structural consistency, and a cycle-consistency mechanism can be formed through the cycle reconstruction process. This method shows significant performance advantages compared to many SOTA image translation methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) This paper proposes a novel method for medical image translation. The motivation of judging the cyclic consistency of the image content is well illustrated, and it brings inspiration for other image transfer-based tasks. (2) Sufficient comparative experiments and interesting results. The comparative experiments in the paper include methods based on GAN, Transformer, and diffusion models, which provide strong support for the results. In addition, for Figure 3, the authors attempt to present the content consistency of images in different modalities from a visual perspective, and this is good.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. There are some inconsistencies in this paper. For example, the authors mentioned that Dc is a discriminator used for content consistency, but in Figure 1, the module connected to Dc is Em that is actually the attribute encoder mentioned in the text.
    2. In addition, as shown in the diffusion module in Figure 1, the input data for training the diffusion model are the noisy mji or mij, and the output data are the reconstructed original images mj and mi. When using this module for inference, the input image should be the noisy original image, and the output image should be the translated image. I think there is an ambiguity in the training and inference logic here. The author can explain this point.
    3. This paper mentions several times that the decoupling method guiding the diffusion model can improve the learning efficiency, and also states that DisDiff outperforms other state-of-the-art methods in terms of both image quality and efficiency. However, among the experimental evaluation metrics, there are only PSNR and SSIM, which are both objective metrics for evaluating the quality of the generated images, rather than efficiency metrics.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has certain novelty, but there are some inconsistencies and ambiguities in it. Moreover, the evaluation of the inferring speed of the model and the verification of efficiency improvement are not reported.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a diffusion model based method for medical image translation. It uses a disentanglement module to extract anatomical features and modality-specific features separately, to improve translation result and interpretability of the approach. An adversarial diffusion model is used to generate target modality images given the extracted features from the disentanglement module. The proposed approach is tested on two brain MRI datasets for T1 and T2 translation and shows competitive performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method is compared with a wide range of baseline and SOTA methods to demonstrate its effectiveness.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Method description is relatively unclear. Many notations are undefined. Some notations are used inconsistently across the paper. The description of the diffusion model for image translation is a bit confusing.

    Section 2.3: E_{\phi_A} and E_{\phi_B} in (8) and D_{\phi_A} and D_{\phi_B} in (9) are all undefined. It seems E represents encoder and D the decoder. What do \phi_A and \phi_B refer to?

    In (8), what does z_A and m_A stand for respectively? Which denotes content features and which denotes attribute features? Please clarify.

    The method for image translation is quite confusing. The paper states “At each time step t, the generator GθA, GθB produces denoised estimates ˜x0 A, ˜x0 B of the target modality images” The task of image translation is predicting one modality from the other. The algorithm should take one modality as input and outputs the other modality. However, the above sentence suggests that the model predicts two modalities simultaneously, which is confusing.

    Furthermore, the paper states “cross-modal synthesis is achieved by applying the disentanglement decoder with content features exchanged” and “The diffusion module reconstructs target modality images from content features provided by the disentanglement module.” The former sentence suggests that the target modality image is generated by the disentanglement decoder, while the latter, as well as the rest of the paper, indicates that the output is generated by the diffusion model. Which is the correct understanding? Please clarify.

    Inconsistent usage of notations:

    • In Section 2.2, z_c and z_a are used to denote content and attribute features, but in Section 2.3, z and m are used for such purpose.
    • In Section 2.3, A and B are used to represented two different modalites, but in Section 2.2, i and j are used for each modality.

    Fig 2: The images are too small to distinguish the details. It might be better to show just a few selected methods.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the method is novel and interesting and the experiments are extensive, the method description and use of mathematical symbols can be improved in clarity.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my questions and promised to correct the inconsistent notations in the final version.




Author Feedback

We want to thank the reviewers for the constructive comments. The following is our point-by-point response to the comments.

Reviewer 1: The novelty is moderate & Ambiguous dataset splits Response: We respectfully disagree that our contribution is merely a “swap” of SynDiff’s CycleGAN with MUNIT. DisDiff contributes a task-specific, structure-preserving disentangled representation learning framework for diffusion-based medical image synthesis, introduces a novel conditioning paradigm, and enforces explicit anatomical consistency: (1) Anatomy-aware disentanglement and structure-based conditional generation paradigm within a diffusion framework. SynDiff, like other SOTA methods, treats content and appearance implicitly in a shared latent space. DisDiff instead (i) learns an explicit content–style separation guided by the content discriminator D_c, and (ii) uses the extracted content code alone to condition the generator. This design delivers modality-agnostic conditioning—anatomy-supervised disentanglement constrains the latent representation to faithfully preserve geometric detail. (2) In contrast to natural image tasks, anatomical consistency is crucial in medical imaging. D_c is trained to enforce alignment between synthesized and input content domains, introducing an explicit structure-preserving constraint during training, which is absent in prior diffusion-based frameworks. (3) The improvements achieved by DisDiff go beyond the integration of SynDiff and MUNIT from extensive numerical results and ablation studies, which could reveal our novelty from one hand.

For dataset splits, we randomly split dataset into 90%/10% for training and validation. We will clarify it in the revised version.

Reviewer 2: Inconsistencies in Figure 1 & Ambiguity in training and inference logic & Efficiency problem Response: We will clarify the inconsistency in the labeling of Dc in Fig.1 in final version.

During training, the diffusion model takes the cross-domain image m_(i→j) as input and learns to reconstruct image m_i, thereby training to generate image from domain #j to domain #i; During inference, we input an image from domain #j, applying the trained diffusion model to generate the image in domain #i, referred as the cross-domain translation. We will revise the manuscript accordingly in the final version.

Regarding efficiency, both methods share the same diffusion-based inference; thus, we demonstrate efficiency from the training perspective: with a batch size of 4, DisDiff converges in ~1.2 days, while SynDiff needs ~2.3 days for similar performance.

Reviewer 3: Undefined and inconsistent notations & Cross-modal image generation & Figure 2 is too small

Response: We clarify the notations as follows: In Eq. (8), (9), E_(ϕ_A )/D_(ϕ_A ) and E_(ϕ_B )/D_(ϕ_B ) denote encoders/decoders for domains A and B respectively. Subscripts ϕ_A and ϕ_B are network parameters. And z_A/z_B denote content features, while m_A/ m_B denote attribute (style) features. The sentence “the generator G_(θ_A ), G_(θ_B ) produces denoised estimates x ̃_A^0, x ̃_B^0” is intended to describe both A→B and B→A tasks together instead of performing translation in both directions simultaneously. We will clarify it in the final version.

The sentence “cross-modal synthesis is achieved by applying the disentanglement decoder” refers to the training phase of the disentanglement module, where a cross-domain image (e.g., m_(i→j)) is generated as an intermediate result. This image is unlabeled and not used as final output. Instead, it is passed to diffusion model, which reconstructs the source-domain image m_i from it. This process helps the model learn how to translate modality #j to modality #i while preserving content. Therefore, while the disentanglement decoder acts as an initial translator, the final image generation is performed by the diffusion module.

And we will resolve the notation inconsistencies and replace the low resolution figure in final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal generally addressed concerns of R1. Now all reviewers seems to agree to accept this work. However, I believe this paper still can be imprived potentially with different assessment criteria and evaluation of efficiency. The multi-/dual-net information disentanglement framework has been well explored several years ago, and I can’t take the statement “The code will be released if the paper is accepted.” as the code can be shared anonymously through platforms like anonymous.4open.science. Although this is a borderline case that can be considered for publishing in MICCAI, yet due to the large volume of submissions, we can’t accept all.



back to top