Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Deep learning models have emerged as a powerful tool for various medical applications. However, their success depends on large, high-quality datasets that are challenging to obtain due to privacy concerns and costly annotation. Generative models, such as diffusion models, offer a potential solution by synthesizing medical images, but their practical adoption is hindered by long inference times. In this paper, we propose the use of an optimal transport flow matching approach to accelerate image generation. By introducing a straighter mapping between the source and target distribution, our method significantly reduces inference time while preserving and further enhancing the quality of the outputs. Furthermore, this approach is highly adaptable, supporting various medical imaging modalities, conditioning mechanisms (such as class labels and masks), and different spatial dimensions, including 2D and 3D. Beyond image generation, it can also be applied to related tasks such as image enhancement. Our results demonstrate the efficiency and versatility of this framework, making it a promising advancement for medical imaging applications. Code is available on: \url{https://github.com/milad1378yz/MOTFM}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1056_paper.pdf

SharedIt Link: https://rdcu.be/eHxeh

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05325-1_21

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/milad1378yz/MOTFM

Link to the Dataset(s)

https://github.com/milad1378yz/MOTFM

BibTex

@InProceedings{YazMil_Flow_MICCAI2025,
        author = { Yazdani, Milad AND Medghalchi, Yasamin AND Ashrafian, Pooria AND Hacihaliloglu, Ilker AND Shahriari, Dena},
        title = { { Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {216 -- 226}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a novel method titled Medical Optimal Transport Flow Matching (MOTFM), which adapts the Optimal Transport Flow Matching (OTFM) to the medical imaging domain. Unlike stochastic sampling approaches like DDPM, OTFM employs a deterministic, nearly optimal mapping from pure Gaussian noise to the data distribution, which is argued to improve image quality. Additionally, the authors introduce a dual-UNet architecture to incorporate class and mask conditions, claiming superior performance over conditioning methods such as SPADE and ControlNet. The method is evaluated on both echocardiographic and brain MRI datasets, demonstrating its applicability across different generative settings: unconditional, class-conditional, and mask-conditional.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is clearly written and well-organized, making it accessible to readers.
- It provides a detailed analysis of the results, with both qualitative and quantitative evaluations.
- The methodology is conceptually straightforward and easy to follow.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited novelty: While the adaptation of OTFM to medical imaging is novel in terms of application, the core advantages of OTFM over DDPM have already been explored in previous works. The paper does not sufficiently argue why flow matching is particularly beneficial for medical images or how it addresses challenges unique to this domain, such as data scarcity, high precision requirements, or domain-specific priors. Without such analysis, the application appears as a straightforward extension rather than a principled innovation.
2. Unfair or Incomplete Comparison to SPADE: The comparison between the proposed dual-UNet architecture and SPADE is potentially flawed. It is unclear whether SPADE was used in its original form (as a spatially adaptive normalization layer within a CNN) or as part of a GAN framework. The core contribution of SPADE is CNN module, which can be use qquite effectively under other generative models eg. in 1 . Moreover, training a lightweight SPADE-based model might be more efficient than a dual-UNet setup. A fair comparison would require applying both conditioning strategies under the same backbone and sampling procedure (e.g., within DDPM or OTFM), which is not clearly demonstrated.
Similar issues arise in the comparison with ControlNet. The evaluation lacks a controlled setup where the only variable is the conditioning mechanism. To properly support the claim of improved conditioning, the authors should perform two orthogonal comparisons:

(1) Evaluate MOTFM vs. DDPM under the same conditioning strategy (e.g., concatenation-based or attention conditioning).

(2) Evaluate different conditioning methods (e.g., dual-UNet vs. SPADE vs. ControlNet) under the same generative backbone (e.g., MOTFM or DDPM). Without these decoupled evaluations, it is difficult to attribute performance gains to either the generative model or the conditioning scheme.
1. Kim, Jonghun, and Hyunjin Park. “Adaptive latent diffusion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A validation of a well-known method in medical imaging is interesting. But lack of novelty and unfair comparison hinder the paper significantly.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper presents a generative framework for medical image synthesis using optimal transport flow matching, which improves inference speed compared to diffusion models. It supports versatile conditioning strategies (unconditional, class-conditional and mask-conditional) for generation, enabling flexibility across different generative tasks. The paper demonstrates the applicability of the method to 2D (ultrasound) and 3D (MRI) medical imaging. Comparative results showcase the proposed method outperforms or matches diffusion baselines such as the DDPM and ControlNet, as well as conditional GANs such as SPADE in image quality, inference time, and downstream task performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Novel application: The proposed work is one of the first works to introduce optimal transport flow matching for medical image synthesis, offering a novel generative approach which is faster compared to diffusion models, with a simpler training objective compared to GANs. This approach also has a potential for better mode coverage and improved diversity and has better scalability to high dimensions (such as 3D medical data). 2) The method significantly reduces the number of inference steps compared to standard diffusion models, addressing a common bottleneck in generative modeling. 3) The framework supports multiple conditioning strategies, which in turn makes it adaptable to various image generation methods and can adapt to a wide range of data (prior) or label availability for synthesis. 4) The framework is applied to both 2D and 3D data, making it generalizable across modalities and dimensionalities. 5) The paper includes quite a comprehensive quantitative and qualitative evaluation against several baselines with multiple evaluation metrics, as well as applications on downstream classification and segmentation tasks. 6) While not a primary focus, the authors showcase that the method is broadly applicable through the task of speckle noise removal in ultrasound, demonstrating potential beyond image synthesis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Weaknesses: 1) The authors state that “flow matching directly maximized the likelihood”; however, this is partially true for some flow matching models, while other models minimize the distance. The authors should be more precise about which flavour of flow matching they are referencing. 2) While the benefits of faster inference are highlighted, the paper does not clearly articulate why this is particularly important for medical imaging. For example, how does reduced inference time support clinical integration or deployment? 3) The paper does not mention known limitations of flow matching approaches, such as lower sample diversity compared to diffusion models in some contexts, or the sensitivity to conditioning noise and training stability. Can the authors comment whether they observed reduced sample diversity and/or how they mitigated this and can a brief limitation paragraph be included to contextualize any trade-offs compared to GANs or diffusion models? 4) The paper introduces a separate UNet for mask encoding but provides no justification or architectural details. It’s unclear if the architecture matches the main network, if weights are shared, or how spatial alignment is managed. Additionally, the term “zero convolutions” is unclear - do the authors mean layers initialized to zero 5) Some statements such as “recover X1 from X0 in one step” assume ideal conditions (linear optimal transport paths, perfect velocity estimation), which are not always the case in practice. While later acknowledged, these oversimplifications could mislead readers and should be more directly stated. 6) When extending to 3D synthesis, the only adaptation mentioned is replacing 2D layers with 3D layers. There’s no information about model depth, memory handling, or architectural changes needed for high-quality 3D generation. 7) It’s unclear whether all baselines (e.g., ControlNet, SPADE, DDPM) were trained under equal conditions or simply used off-the-shelf. This affects the fairness and interpretability of comparative results. Moreover, what inception-like model is used for 3D-FID - was this a model pretrained on 3D medical data or? 8) Can the authors comment whether the proposed model can generate plausible outputs for anatomy not well represented in the training data? 9) Since MSD contains scans from 19 centers, did the authors examine domain generalization to different sites and scanners? 10) Have the authors looked into how the training on synthetic + real data compares to only real data or only synthetic data for downstream tasks? 11) The class-conditioning method is vaguely described. It’s mentioned to involve cross-attention, but no details are given. 12) While MOTFM is faster, it’s not discussed whether additional steps beyond 1 or 10 yield diminishing returns in quality. Also, it’s not clear how deterministic the outputs are - this is important for reproducibility. How stable are the outputs given the same or slighly varied inputs, especially in the mask-conditioning setting? 13) How does the model behave with incomplete or corrupted conditioning inputs, such as partial masks? In addition, can the authors comment on any common failure cases - low quality generated images or artifacts that have been generated?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel application of flow matching generation for medical image synthesis with a moderately strong empirical evaluation and a promise of versatility - generalizing across modalities with a support for diverse conditioning schemes and downstream applications. However, there are some technical inaccuracies and a lack of clarity when it comes to some methodological choices. The authors also do not discuss any limitations.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

This paper presents a valuable and well-executed domain-specific contribution, while the existing weaknesses are addressable in future work and not fundamental. The proposed work is a first-time application of OTFM to medical imaging, showcasing clear benefits for faster inference, anatomical fidelity, and training scalability. Moreover, the authors addressed nearly all major points in a constructive manned: the authors have clarified the architecture and the training pipeline, justified questions on inference time, agreed to revise oversimplified claims, and acknowledged limitations and robustness issues, with plans to expand and comment on these in the camera-ready version of the paper. While some ablation studies are not performed (conditioning vs. backbone disentanglement), key trends remain consistent, and model performance is retained under fair settings. It is recommended to authors to address any limitations and include some comments on the failure case analysis in the final version of the paper

Review #3

Please describe the contribution of the paper

The authors present a Flow Matching generative model for image generation of echocardiography and MRI Brain images. By leveraging Flow Matching, they achieve good generative metrics and beat precious SOTA methods on a segmentation downstream task.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

State-of-the-art approach (Flow matching) Good performance metrics. Method evaluated on 2 different datasets. Code and trained will be made public.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Major oversight in the related works regarding the field of echocardiogram generation, especially [a] which covers the same motivation and similar methods, while extending the modelling process to videos (but the paper also covers image generation).

[a] Reynaud H, Meng Q, Dombrowski M, Ghosh A, Day T, Gomez A, Leeson P, Kainz B. Echonet-synthetic: Privacy-preserving video generation for safe medical data sharing. In International Conference on Medical Image Computing and Computer-Assisted Intervention 2024 Oct 3 (pp. 285-295). Cham: Springer Nature Switzerland.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The series of works [a, b, c] appear highly relevant and should be incorporated into the related works section. These are peer-reviewed contributions that offer context and technical parallels more appropriate than some currently cited non-peer-reviewed sources (e.g., [2]). In particular, [a] seems especially pertinent to the present submission and warrants direct discussion.

The omission of these works represents a significant gap in the paper’s positioning within the existing literature. Some of the claims may need to be revised once the authors are more familiar with this body of work. I will reconsider my evaluation accordingly.

Additionally, the following sentence requires clarification: “For efficiency, we used the DDIM scheduler [15] during DDPM sampling.” It is unclear whether this refers to replacing the DDPM sampler entirely or only for inference, as opposed to training.

Finally, Table 2 shows that the best results do not consistently correspond to the same model configuration. This variability is neither acknowledged nor explained, and the paper would benefit from a brief discussion or justification.

[a] Reynaud H, Meng Q, Dombrowski M, Ghosh A, Day T, Gomez A, Leeson P, Kainz B. Echonet-synthetic: Privacy-preserving video generation for safe medical data sharing. In International Conference on Medical Image Computing and Computer-Assisted Intervention 2024 Oct 3 (pp. 285-295). Cham: Springer Nature Switzerland. [b] Reynaud H, Qiao M, Dombrowski M, Day T, Razavi R, Gomez A, Leeson P, Kainz B. Feature-conditioned cascaded video diffusion models for precise echocardiogram synthesis. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2023 Oct 1 (pp. 142-152). Cham: Springer Nature Switzerland. [c] Reynaud H, Vlontzos A, Dombrowski M, Gilligan Lee C, Beqiri A, Leeson P, Kainz B. D’artagnan: Counterfactual video generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2022 Sep 16 (pp. 599-609). Cham: Springer Nature Switzerland.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is sound, and the experiments are (1) well documented and (2) demonstrate good performance. However, the authors do not compare their work with several highly relevant prior studies. While the application is interesting, the methods themselves are not novel per se, only their use in this particular context.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We appreciate the thoughtful, positive feedback. We have carefully considered and addressed the comments.

One common comment on novelty-Inference Time for R1.2 & R3.1: While OTFM has been explored before, our work presents a principled and domain-specific first-time adaptation to medical imaging, where inference time, anatomical fidelity, and training scalability are critical. Unlike prior uses in natural images, we demonstrate that fast, high-quality synthesis enables scalable generation of diverse medical data, which is essential for addressing data scarcity and supporting robust AI model development. Medical imaging requires high spatial precision and domain alignment; we show that MOTFM provides more anatomically coherent outputs, especially in conditional tasks. These improvements are not just computational, they directly address medical imaging challenges that DDPMs struggle with due to their slow, unstable sampling. Our framework is modality-agnostic, supports 2D/3D, and eliminates post-processing, making it deployable in real-world pipelines. DDPMs have also been applied to medical image denoising tasks; in Fig4, we show that our method achieves this in just 10 steps or less, highlighting a use case in point of care ultrasound where fast inference is especially valuable. Moreover, fast generation is critical for interactive model development, simulation-based training, and future applications such as on-device or near-real-time enhancement tools in decentralized or point-of-care settings. In these workflows, reduced latency supports smoother integration into clinical systems, particularly in environments with limited computational resources or where responsiveness is essential. We will revise the manuscript to emphasize these.

R1 1, 3, 5: Writing, references, and limitations can be revised. 4: The second U-Net has the same architecture and is trained end-to-end with separate weights. Zero convolution refers to a zero-initialized layer. 6:In the 3D version, we use 3D kernels, maintain U-Net depth and width (32-512 channels), add residual blocks and use 8 cross-attention transformer layers with flash attention, all trainable on a single GPU (RTX 4090, 24 GB). 7:All baselines were retrained under the same conditions. For 3D-FID, we use a 3D ResNet pretrained on natural data. 8,13:Performance can be limited by incomplete masks; future work will address partial inputs. Our method remains stable under noisy masks. 9:Domain generalization will be studied in future. 10:Adding synthetic data improved accuracy for both DDPM and MOTFM, with greater improvement for MOTFM. This was not included due to page limit. 11:Class labels (one-hot) are incorporated via cross-attention. 12:After time step of 10, quality plateaus (Table 1/Fig. 2). All images are sampled from Gaussian noise, introducing natural diversity. Results vary slightly across samples but remain aligned with the conditioning mask.

R2 The suggested references will be added and discussed. The DDIM scheduler is used only for sampling at inference time, while noise is added during training using the DDPM scheduler. Table 2 results are consistent, with MOTFM outperforming DDPM in all scenarios.

R3

Ref. [1] integrates SPADE into a DDPM framework. In our work, SPADE is used in its original GAN-based form as a baseline. While SPADE can be integrated into DDPMs, our primary goal is not to benchmark conditioning methods but to evaluate the impact of switching from diffusion to flow matching while holding conditioning constant. Dual UNet is a modular component and could be replaced by SPADE or other conditioning approaches.

In Table 1, we compare DDPM and MOTFM using the same backbone and cross-attention conditioning to isolate the effect of the generative process. For point (2), evaluation of all conditioning methods (e.g., SPADE vs. dual-UNet) under both MOTFM and DDPM can be future work.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper proposes MOTFM as a faster alternative to diffusion-based medical image synthesis, but it does not provide sufficient empirical support for its core claim of improved sampling efficiency. The comparisons rely on DDPM with DDIM sampling, yet omit more competitive fast diffusion baselines such as DPM-Solver or Consistency Models, which are specifically designed for high-quality generation in very few steps. This omission, along with reviewer-raised concerns about novelty, conditioning design, and missing prior work, limits the strength of the contribution. As a result, the paper does not convincingly demonstrate superiority over existing fast generative approaches.

back to top

Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality

Author(s):