Abstract

Multi-modal magnetic resonance imaging (MRI) provides rich, complementary information for analyzing diseases. However, the practical challenges of acquiring multiple MRI modalities, such as cost, scan time, and safety considerations, often result in incomplete datasets. This affects both the quality of diagnosis and the performance of deep learning models trained on such data. Recent advancements in generative adversarial networks (GANs) and denoising diffusion models have shown promise in natural and medical image-to-image translation tasks. However, the complexity of training GANs and the computational expense associated with diffusion models hinder their development and application in this task. To address these issues, we introduce a Cross-conditioned Diffusion Model (CDM) for medical image-to-image translation. The core idea of CDM is to use the distribution of target modalities as guidance to improve synthesis quality, while achieving higher generation efficiency compared to conventional diffusion models. First, we propose a Modality-specific Representation Model (MRM) to model the distribution of target modalities. Then, we design a Modality-decoupled Diffusion Network (MDN) to efficiently and effectively learn the distribution from MRM. Finally, a Cross-conditioned UNet (C-UNet) with a Condition Embedding module is designed to synthesize the target modalities with the source modalities as input and the target distribution for guidance. Extensive experiments conducted on the BraTS2023 and UPenn-GBM benchmark datasets demonstrate the superiority of our method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0714_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0714_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xin_Crossconditioned_MICCAI2024,
        author = { Xing, Zhaohu and Yang, Sicheng and Chen, Sixiang and Ye, Tian and Yang, Yijun and Qin, Jing and Zhu, Lei},
        title = { { Cross-conditioned Diffusion Model for Medical Image to Image Translation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposed a Cross-conditioned Diffusion Model(CDM) for medical image-to-image translation. CDM use the distribution of target modalities in latent variable space as guidance to tanslate the soure image to target image.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The pictures show in the paper are well done.

    2. Comparisons were conducted on rich data sets.

    3. Although there are logical errors, the writing can highlight the key points of the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There is a lack of important ablation experiments to prove the role of the diffusion model, This ablation experiment is as follows: After traversing all target data, the FE module in Modality-specific Representation Model Training will generate latent variables corresponding to all target images, uniformly sample the set of these latent variables, and put the sampling results into Condition Embedding in Cross-conditioned UNet Training. And then show what is the experimental results.

    2. The expression: “Instead of directly sampling the target modalities like the conventional diffusion model, C… as input.” After reading the paper, I know the mean of the paper is: Instead of directly sampling the target modalities in image domain like the conventional diffusion model, CDM first samples the distribution of target modalities in latent variable space, then this distribution is used as a condition to generate the target modalities in image domain. I suggest that the author revise the statement so that there are no semantic logical errors.

    3. There is a lack of motivational explanation for designing random crop. The author can explain it from mathematical theory, or cite references, or design ablation experiments to prove that random crop performs better than directly inputting the original image. The random crop seems used the methods from: He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009. It is best to have an ablation experiment to prove it. I can give an example of an ablation experiment for your reference: Input the uncropped image directly into the encoder and then decode it. Use the diffusion model to learn the distribution of variables in the low-dimensional space encoded by the encoder. Then use diffusion to sample (generate) the point from the distribution, use the sampled point to guide the source image into the target image, and then see what the experimental results are.

    4. Because the diffusion model, as a generative model, has the instability of hyperparameters and the uncertainty of the generated results. It should include the mean and variance of this paper’s data (Table1 and Table2), not a single numerical value.

    5. The sample number exists in training and testing phases. sample number will influence the final results. This paper does not distinguish the impact of the sample number in the training phase from the impact of the sample number in the testing phase on the experimental results.

    6. The description in section 2.1: “the target distribution y0 is predicted by MDN from a normal Gaussian noise” is not clear, the diffusion model method is to generate a distribution of normal images and sample from this distribution. This sentence should be expressed as: the distribution to which y0 belongs will be predicted, rather than that the distribution y0 will be predicted.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. There is a logical error in the expression: “Instead of directly sampling the target modalities like the conventional diffusion model, CDM first samples the distribution of target modalities. This distribution is then used as a condition to generate the target modalities, with the source modalities as input.” After reading the paper, I know the mean of the paper is: Instead of directly sampling the target modalities image domain like the conventional diffusion model, CDM first samples the distribution of target modalities in latent variable space, then this distribution is used as a condition to generate the target modalities in image domain.

    2. Please explain the motivation of sampling random crop. It is best to have an ablation experiment to prove it. I can give an example of an ablation experiment for your reference: Input the uncropped image directly into the encoder and then decode it. Use the diffusion model to learn the distribution of variables in the low-dimensional space encoded by the encoder. Then use diffusion to sample (generate) the point from the distribution, use the sampled point to guide the source image into the target image, and then see what the experimental results are.

    3. The author needs to provide separate detailed explanations in the paper on what’s the sample number during the training process and what’s the sample number during the testing process.

    Please also see the “6. Please list the main weaknesses of the paper.” in the reviewer results.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The description in this paper is not rigorous, sometimes ambiguous.
    2. There is a lack of important ablation experiments on the effectiveness of diffusion models.
    3. Lack of explanation of the motivations for certain sub-steps in the design.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I think the author solved some of my confusions in the rebuttal process, so I raised my score.



Review #2

  • Please describe the contribution of the paper

    This work proposes a new deep learning network, with components including modality-specific representation model and modality-decoupled capability, to improve MRI image generation of another contrast.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed network incorporates novel representation model for learning the data distribution.
    2. The experiments were performed on two datasets and sufficient experiments were conducted to validate the proposed technique.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The technique is promising in terms of the metrics used in this work. However, as in many generative models, image fidelity is a major problem. Fig 4, especially zoomed-in patches in the second column, shows clearly loss of details. This problem should be discussed and the metrics used in this work (PSNR, SSIM and MAE) actually can have problems catching this effect (loss of fine details).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The “Avgerage scores” in Table 1 and Table 2 might not make sense and are not necessary. T1c and T2f have different contrasts. Reporting their results respectively should be fine.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    An interesting model to learn data distribution has been proposed. The authors tried to perform extensive experiments to validate their technique.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The author proposed a Cross-conditioned Diffusion Model for medical image-to-image translation. First, the model learns the distribution of target modalities through a Modality-specific Representation Model (MRM). Then, the target distribution is modeled from the output of the MRM using a light diffusion network, called Modality Decoupled Diffusion network (MDN). Finally, a Cross-conditioned UNet (C-UNet) with a condition embedding module is used to receive the source modalities and distribution sampled by the MDN to generate the target modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper uses a novel approach to learn the distribution of the target modalities with the MRM module, then model it with the MDN module, and finally generate the target modalities with C-UNet.
    2. This essay is well-structured, logical, fluent, and relatively informative.
    3. The proposed method performs well and has a small number of parameters.
    4. In the experimental section, the comparison and ablation experiments are complete and the workload is substantial.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The introduction section lacks a summary of the contribution of the work in this paper.
    2. The ‘Modality-decoupled Diffusion Network (MDN) Training’ part lacks more training details and loss functions.
    3. Figure 4 is not clear and aesthetically pleasing enough.
    4. The sample number is explored in the ablation experiment, but this concept does not appear in the previous methodology. And the reason for the optimal sample number is not analyzed in the ablation experiment.
    5. The text ‘The T1 and T2 modalities are utilized to generate the T1c and T2f modalities’ does not specify whether it is one-to-one image generation or many-to-many image generation.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The method section needs to be written in more detail.
    2. The principles and formulas of the diffusion model are suggested to be written more specifically.
    3. The innovations of CDM are suggested to be written in a more specific and focused way.
    4. Complement the article contributions in the last paragraph of the introduction.
    5. The language and writing style of the article needs to be more authentic.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is evaluated in three main aspects: innovation, usefulness and completeness, and then scored in relation to the strengths and weaknesses of the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The paper is evaluated in three main aspects: innovation, usefulness and completeness, and then scored in relation to the strengths and weaknesses of the work.




Author Feedback

We appreciated the favorable comments on the novelty of our method (Reviewer#1, Reviewer#2), and high-quality experiments (Reviewer#1, Reviewer#2, Reviewer#3). Below, we clarify the main issues raised by reviewers. Reviewer#1 Q1: Lack of a summary. Thanks for your suggestion. We will add a summary to the introduction section. Q2: The MDN Training lacks training details and loss functions. We use the same settings (input size=256, learning rate=1e-4, and batch size=12) for MRM, MDN, and C-UNet training, with L2 loss for MDN. Q3: Figure 4 is not clear. We will re-plot this figure. Q4: The sampling number is not mentioned in methodology. Like conventional diffusion models, MDN predicts the clear target distribution from random noise through iterative sampling. While bigger sampling number increases accuracy, they also raise computational costs. We found that the sampling number exceeds 30, improvement is marginal, so we set the sampling number to 30 to balance performance and speed. Q5: Is CDM a one-to-one or a many-to-many image generation method? Our CDM is a many-to-many image generation method. Q6: About writing. Thanks for your constructive comments. We will carefully revise all related sections.

Reviewer#3 Q1: Current metrics may have problems catching loss of details (image fidelity). Following the TMI22 paper “ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis,” we use PSNR, SSIM, and MAE for comprehensive assessment. Based on your suggestion, we introduce metrics like LPIPS for more effective evaluation. For example, on BraTS2023 dataset, our method achieves (T1c: 0.054;T2f: 0.055), Diffusion attains: (T1c: 0.079;T2f: 0.066). A lower LPIPS represents better generative quality. Q2: The “Avgerage scores” are not necessary. Following your suggestion, we will remove the average scores.

Reviewer#4 Q1: Proving the role of the diffusion model. Thanks for your valuable guidance. Following your suggestions, we have ablated the role of the Diffusion model. The results on the BraTS2023 dataset are: (T1c: 32.02, 0.928,0.0106;T2f: 29.38,0.922,0.0152), which are lower than our SOTA. We think the reason is the existence of a domain gap between training and testing target images, and the diffusion model can adaptively learn the representation closer to the testing target images. Q2&Q6: Logical inconsistencies: We will meticulously review and revise all the highlighted sentences. Q3: About random mask in MRM. The motivation of random mask is that it enhances the model’s semantic understanding for the image by reconstructing the original image from its masked version. According to your suggestion, we remove the random mask operator in MRM and train corresponding MDN and C-UNet. The results on the BraTS2023 dataset are: (T1c: 32.52, 0.940,0.0104;T2f: 30.22,0.926,0.0142), demonstrating random mask can improve the representation learning capability. We also cited reference. Q4: Including variance. We have saved all the predictions, which allows us to easily derive the variance data and include it in the manuscript. Due to page limitations, we report a subset of the results (three metrics for two methods). For the BraTS2023 dataset, our method achieves: (T1c: 33.08±1.69, 0.948±0.008, 0.0098±0.002;T2f: 30.76±1.21, 0.934±0.014, 0.0136±0.002). Diffusion attains: (T1c: 31.98±1.75, 0.930±0.01, 0.0109±0.004;T2f: 29.22±1.27, 0.921±0.018, 0.0155±0.005). These outcomes demonstrate the stable performance of our method across each generation process. Q5: Sampling number in training and testing phases. For MDN training, the sampling number is selected randomly from the range [0, T] to add varying noise levels to the latent variables of the target images. Once the MDN is trained, it is frozen and applied to the C-UNet training and inference processes. Hence, we maintain a consistent sampling number during C-UNet training and inference. According to your suggestion, we will complement more settings for the sampling number.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After rebuttal, the three reviewers all agree to (weak) accept this paper. I generally hold positive opinions on this paper: novelty, code release and promising results.

    However, the literature overview is not well done, for example, the cross-modality medical image synthesis/translation works. Also, the details how to leverage the target distribution to affect the cross-conditioned UNet is not clear to me. For example, is there any (target) distribution different for different subjects? How it finally behave towards the final synthetic target modality?

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After rebuttal, the three reviewers all agree to (weak) accept this paper. I generally hold positive opinions on this paper: novelty, code release and promising results.

    However, the literature overview is not well done, for example, the cross-modality medical image synthesis/translation works. Also, the details how to leverage the target distribution to affect the cross-conditioned UNet is not clear to me. For example, is there any (target) distribution different for different subjects? How it finally behave towards the final synthetic target modality?



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top