Abstract

Time-of-flight magnetic resonance angiography (TOF-MRA) is widely recognized as the gold standard for non-invasive assessment of cerebrovascular lesions. However, its long scanning times and susceptibility to motion artifacts often result in image blurring and loss of diagnostic information. To address these limitations, the synthesis of TOF-MRA images from multi-modal MR images has emerged as an effective solution. In this paper, we propose a novel Multi-Modal Diffusion Model (MMDM) for TOF-MRA image synthesis, which fully leverages complementary anatomical and pathological information from multi-modal MR images to enhance synthesis performance. Specifically, we introduce modality-specific diffusion modules, each of which independently models the deterministic mapping from a source domain to the target domain, preserving modality-specific prior knowledge. Then, we propose a cross-modal dynamic fusion module to integrate multi-path diffusion features. Additionally, we present a Maximum Intensity Projection (MIP) loss, which constrains the consistency of adjacent slices in the maximum intensity projection space, addressing the issue of vascular discontinuities caused by 2D training. Finally, we propose a Noise-adaptive Weighting Strategy (NAWS) that dynamically balances the multi-objective loss weights based on the data distribution of the diffusion model, ensuring stable convergence during training. Experimental results demonstrate that our method significantly outperforms existing approaches on both the original images and MIP images. Our code is available at https://github.com/taozh2017/MMDM-Syn.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0751_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YuTia_Diffusionbased_MICCAI2025,
        author = { Yu, Tianen and Song, Xinyu and Xiang, Lei and Zhou, Tao},
        title = { { Diffusion-based Multi-modal MR Fusion for TOF-MRA Image Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {172 -- 182}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a diffusion-based pipeline for the synthesis of TOR-MRA images from multiple MR contrasts. They propose a cross-modality framework to integrate multi-modal MR images to enhance image-to-image translation performance. Further, they propose the integration of neighboring 2D slices to improve slice consistency along the z axis. Lastly, they propose an adaptive loss weighting strategy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation of the work is clearly stated, and the abstract and the introduction give a good overview. The research gap is well pointed out. Figure 1 gives generally a good overview of the overall pipeline. The authors compare against a range of other image-to-image translation methods, and the ablation study is extensive.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper lacks some methodological descriptions necessary to understand the proposed framework. More specifically, the authors should address the following points:

    1. In Figure 1, It is not clear how the “cross-modal dynamic fusion” block works. It is unclear how x_0^hat is computed with the diagram on the top right of the figure.
    2. Section 2.1 is not described in enough detail. How are delta^tilde_t, c_1, c_2, and c_3 defined? It is not clear what mu^tilde^theta_t is, and how x_t-1 is computed. Furthermore, the architecture and the input to the model epsilon_theta is not described. It is unclear how the source modality slice x^m is passed as input condition. The cross-modal dynamic fusion module epsilon^theta_0 is not described.
    3. In equation 4, each slice should be in R^(1xWxH).
    4. In Section 2.2, the authors state that x^hat_0 and x_0 are reversed along the channel dimension. What is the motivation to do so, and what is the impact of this tweak?
    5. In Equation 6, it is unclear how x_0^hat was obtained. I suggest to add an equation where this is explicitly stated.
    6. In equation 7, sigma_m is not defined.
    7. In the comparing methods, the authors compare against DDPM. However, the basic DDPM is an unconditional image synthesis model. How was this model tweaked to solve the image-to-image translation task?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper is well-motivated and the experiments as well as the ablation studies are well defined, the methodological description lacks important details crucial to understand the pipeline and ensure reproducitility

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel multi-modality framework for synthesizing TOF-MRA images from T1, T2, and FLAIR MRI scans. The authors introduce a design that employs three separate Brownian Bridge Diffusion Models (BBDMs), each dedicated to a specific input modality, in order to preserve modality-specific priors. These outputs are later combined through a dynamic fusion module. To address the issue of vascular discontinuity in the generated images, the authors propose a MIP-based loss function. Additionally, to stabilize the training of the BBDMs, they introduce a Noise-Adaptive Weighting Strategy (NAWS), which adjusts the loss contribution based on the noise distribution across the diffusion process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper addresses a clinically valuable task—generating TOF-MRA scans from routine structural MRI modalities (T1, T2, FLAIR)—potentially reducing reliance on time-consuming angiographic acquisitions.
    2. The method demonstrates strong performance in both native TOF-MRA views and MIP projections, with improved continuity and vascular detail compared to conventional GAN and diffusion baselines.
    3. Comprehensive ablation studies are provided to validate the individual contributions of MIP loss, NAWS, and the fusion strategy, offering good insight into the value of each design component.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The authors use three separate diffusion models (one per modality), which increases computational complexity and assumes all modalities are present at inference. Recent work like AMM-Diff [1] has shown that a single diffusion model with a shared fusion encoder can flexibly handle variable inputs and generate high-quality outputs. It would be valuable to discuss why this approach was not considered , and whether a single multi-channel input model (e.g., stacking T1, T2, FLAIR) could achieve similar performance considering modality preserving prior as well. This is especially relevant given the clinical reality of missing or incomplete modality sets.
    2. The MIP loss is presented as a novel contribution, yet a closely related idea was previously proposed in SPOCKMIP [2], where MIP projections were used to guide vessel segmentation via multi-axis supervision. While the application domains differ (segmentation vs. synthesis), both rely on MIP as a structural prior to improve vascular continuity. This prior work should be cited and discussed, especially as it employs MIP in a more comprehensive, multi-view manner.
    3. The paper mentions a dynamic fusion module that combines the outputs from the three modality-specific diffusion models, but this component is under-explained. There is no architectural or mathematical detail, nor any ablation evaluating different fusion strategies. Given that fusion plays a central role in balancing modality-specific information, can authors explain a bit more on that part?
    4. The use of three separate diffusion models likely increases computational cost, but runtime or efficiency details are missing. Can authors comment on that?

    [1] Kebaili, Aghiles, et al. “AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation.” arXiv preprint arXiv:2501.12840 (2025). [2] Radhakrishna, C., Chintalapati, K. V., Kumar, S. C. H. R., Sutrave, R., Mattern, H., Speck, O., … & Chatterjee, S. (2024). SPOCKMIP: Segmentation of Vessels in MRAs with Enhanced Continuity using Maximum Intensity Projection as Loss. arXiv preprint arXiv:2407.08655.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave a weak accept as the method is promising and performs well, but key components like fusion and efficiency need more clarity, and the novelty of the MIP loss should be better positioned against prior work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors of this paper propose a method to address the challenges in acquiring TOF-MRA images, which are often hard to obtain due to motion artifacts and prolonged scanning times. The method, primarily based on diffusion models, generates these images using multimodal MRI data as guidance. By using a Brownian bridge backbone and a dynamic fusion module, it effectively fuses representations of these modalities to produce TOF-MRA images. It uses a dual-loss approach, combining the maximum intensity projection loss and the diffusion model loss, both of which are optimized using a noise-adaptive weighting strategy. This method allows for the generation of anatomically continuous vascular structures in the rendered 3D volumes, resulting in high-resolution and detailed volumes. The loss weighting strategy improves the stability of the model’s training process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The framework uses modality-specific deterministic modules, built on the Brownian Bridge Diffusion Model, to extract latent space representations of the TOF-MRA modality. These are subsequently fused within a cross-modal fusion module, operating directly in the TOF-MRA space instead of learning representations that would only increase accumulated errors. These deterministic modules effectively preserve vascular details and incorporate prior knowledge by operating directly in the TOF-MRA space. The use of Maximum Intensity Projection loss ensures lowered artifacts in the vessels of the 3D volumes, resulting in high anatomical coherence. Additionally, the paper includes comprehensive experiments, with thorough baseline comparisons, quantitative and qualitative evaluations. The loss terms have also been validated via ablations, showing improved performance with their final proposed method. This work shows promise in terms of clinical adaption in future, where structurally sound MRI images can be used to generate TOF-MRA images.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The proposed method has shown very little improvement in performance metrics as compared to baselines such as BBDM and ResViT. The drawbacks of using a BBDM backbone for the modality-specific diffusion modules, such as scalability and training stability have not been discussed in detail.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Here are additional comments that I would prefer be addressed:

    What are the computational implications of employing multiple Brownian Bridge diffusion modules? If these modules are close to deterministic, could they be replaced with a computationally less expensive image translation task without compromising performance?

    What are some of the challenges one might face from using the noise-adaptive weighting strategy, particularly the second term in Equation 10, when handling a large number of modalities?

    In equation 10, if the first term pushes the output of the multimodal fusion module to be close to x_0 and the second term pushes each of the individual modalities’ representations \hat{x}_0 to be close to x_0, how advantageous is the dynamic fusion module? The operation performed by this module on the three representations to produce \hat{x}_0 needs to be explicitly stated and explained.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Though there are some details missing in the paper, the proposed framework provides a novel approach of approximating and integrating intermediate representations of multi-modal data and shows promise in future work, involving a wide range of data types.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Thanks for constructive comments, and we will release the code to provide more implementation details.

Common comment Q1: Cross-modal dynamic fusion A1: It contains three independent Restormer blocks. Each block takes a modality’s TOF prediction (\hat{x}^m_0) and the original images of the other two modalities as input. The outputs from these blocks are combined and input into a UNet, producing a joint TOF prediction (\hat{x}_0). This design dynamically fuses the three predictions to avoid poor results from missing information. In Fig. 1 (top right), the inputs to the Restormer blocks are highlighted with red, green, and blue ellipses.

To R1: Q1: Symbolic representation A1: Sec. 2.1 up to Eq. 3 is a summary of BBDM. The constants \tilde\delta_t, c_1, c_2, and c_3 are related to the time step but are not provided due to space constraints. More details can be found in Ref. 12. From Eq. 2, we have x_{t-1} = \tilde\mu_t + \tilde\delta_t * \epsilon, but x_0 in \tilde\mu_t is unknown. Therefore, we can fit \tilde\mu_t by \tilde\mu^theta_t using model \epsilon_theta, and then compute x_{t-1}. In Ref. 12, the architecture is UNet, and the inputs are (x_t, t). The slice x^m of the source modality is concatenated with x_t and inputted in our method.

Q2: Dimension reversal in MIP loss A2: Different projection directions may lose different information. The channel dimension is reversed here to fully capture the high signal of finer blood vessels.

Q3: Definitions of variables A3: We will provide more deatils for variables in the final version.

Q4: Experiment detail A4: As described in Sec. 3.1, we use a gating mechanism to integrate multimodal data. For DDPM, we concatenate the integrated data as a condition and input it with x_t.

For R2:
Q1: Performance improvement A1: In clinical applications, MIP views are commonly used for diagnosing vascular diseases. Our MIP images, whether in terms of quantitative metrics or qualitative image results, far outperform other models.

Q2: Scalability and stability A2: Our method can be extended by adding more BBDM modules. Regarding training stability, our experience shows that other loss function balancing strategies in multi-objective problems (such as direct summation, setting a primary loss and aligning other losses with it) result in slower convergence and less stability compared to our strategy.

Q3: Meaning of BBDM A3: The use of multiple BBDM modules aims to fully integrate the information from multimodal data and other methods do not perform well.

Q4: Large number of modalities’ challenge A4: As the number of modalities increases, the complexity of balancing the loss function correspondingly rises. This can lead to a swift decline in the coefficient of the second term, which in turn magnifies the influence of data bias on the coefficient, thereby making the model more challenging to fine-tune. In most practical scenarios, we typically deal with no more than four modalities. Therefore, the complexity and impact remain within acceptable limits.

Q5: Fusion advantage A5: The ablation study confirms that the generation results of three modalities outperform those of two modalities. Our experiments also demonstrate that the results of independent predictions from three modalities are relatively poorer.

For R3: Q1: Shared encoder A1: The application scenario in this study paper is different from AMM-Diff and it performs poorly when multiple modalities are missing. The other comparison methods involve concatenating multi-channel inputs, have already shown that models relying solely on single multi-channel inputs tend to underperform

Q2: SPOCKMIP A2: We will cite it in the final paper.

Q3: Computational cost A3: By flexibly adjusting the model depth and employing more efficient parallel computation strategies, our model does not have a disadvantage in runtime compared to other diffusion models.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top