Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Current deep learning approaches for medical image synthesis require training multiple specialized models for different modality conversions, leading to inefficient parameter utilization. In this work, we propose a unified text-conditioned latent diffusion framework that achieves one-to-many medical image synthesis through two key innovations: (1) With text-guided dynamic gating, a shared latent space construction using pre-trained modality-specific encoders is proposed, reducing model parameters compared to training several separate models. (2) An adaptive hybrid frequency processor combining wavelet decomposition and Fourier analysis is designed to preserve both local textures and global anatomical structures. Our comprehensive experimental evaluation in various datasets validates that this framework is capable of transforming a single medical imaging modality into multiple target modalities using only one model, surpassing existing methods based on Generative Adversarial Networks and diffusion models in terms of generation quality. The success of this work establishes a new paradigm for efficient multi-modal medical image synthesis through latent space unification and frequency-aware diffusion, significantly advancing the practicality of virtual medical image generation systems.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1178_paper.pdf

SharedIt Link: https://rdcu.be/eHxem

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05325-1_25

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zyj15416/One-to-Many-Medical-Image-Synthesis

Link to the Dataset(s)

CiM Dataset: https://doi.org/10.1186/s13550-021-00830-6 BraTS 2019 Dataset: https://www.med.upenn.edu/cbica/brats-2019/

BibTex

@InProceedings{ZhaYou_HighFidelity_MICCAI2025,
        author = { Zhang, Youjian AND Huang, Jian AND Wang, Jie AND Li, Zezhou AND Wang, Zhongya AND Zhou, Guanqun AND Zhang, Zhicheng AND Yu, Gang},
        title = { { High-Fidelity Unified One-to-Many Medical Image Synthesis via Text-Conditioned Latent Diffusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {258 -- 267}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a unified text-conditioned latent diffusion framework for one-to-many medical image synthesis. The framework leverages text-guided dynamic gating and pre-trained modality-specific encoders to construct a shared latent space. Additionally, an adaptive hybrid frequency processor, which combines wavelet decomposition and Fourier analysis, is introduced to simultaneously preserve local textures and global anatomical structures during image synthesis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The most notable innovation of the paper lies in the design of the adaptive hybrid frequency processor, which effectively combines wavelet decomposition and Fourier transform to retain both local textures and global structural information. This presents an effective solution for capturing multi-scale features in medical imaging.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The text-guidance mechanism in this paper is relatively weak. It relies solely on simple modality labels as text inputs, which lack semantic richness and do not represent meaningful natural language conditioning. While the idea of text-guided generation is promising, the current implementation does not fully explore or demonstrate the potential of semantic conditioning in a novel way. The paper reports improvements in image synthesis metrics, but does not include statistical significance testing. As a result, it is unclear whether the observed gains are statistically meaningful or fall within the margin of error.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the proposed frequency fusion mechanism is technically sound and contributes to performance improvements, the overall novelty of the paper is limited—particularly regarding the text-conditioning strategy, which is underdeveloped and lacks semantic depth. Furthermore, the absence of statistical significance testing weakens the experimental rigor. Therefore, despite encouraging quantitative results, I recommend rejection of the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author has addressed my questions and concerns.

Review #2

Please describe the contribution of the paper

This paper offers a diffusion-model-based one-to-many image translation approach, employing text conditioning and adaptive frequency processing.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper, to my knowledge, offers a novel approach, by introducing the combined text-conditioning and frequency processing in the latent space. This is methodologically interesting and offers the possibility for image translation to multiple target domains simultaneously.
- The evaluation of the introduced novel components using an ablation study is done well and helps understand the importance of different parts of the method, such as the BERT encoding.
- The performance of the method is compared to multiple approaches and a thorough evaluation is carried out (quantitative on one dataset, qualitative on two).
- The paper is well-written. Code is available online.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The related work section falls very short. In this part the authors basically only talk about diffusion models vs. other generative approaches. However, I would argue, that using a diffusion model is not the main contribution of the work. A research on other frequency analysis-based approaches or text-conditioning for image translation would be helpful to understand the level of novelty of the work.
- Some details are not stated very clearly: there are multiple enocder-decoder structures trained for each modality, and not only one for all modalities (seems like this the Fig. 1). Also, why is WT applied to E0 and FFT to E1? What is the intuition behind this. Overall, more details on the frequency analysis are needed. Also, the authors claim big novelty in their one-to-many approach, however, there are many one-to-many approaches existing. Also, is the method only applicable to 2D slices of the 3D images?
- Evaluation: I wonder, why no quantitative results are given for the BRATS dataset, even though qualitatively, the synthesized images look good. Also, using Pix2Pix is rather and unfair comparison to a GAN-based approach, since the authors use a VQ-GAN in their latent architecture themselves. So comparing to a plain VQ-GAN might be a more fair comparison.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper offers a well-evaluated and performing one-to-many translation approach, introducing a good amount of novelty. The mention weaknesses should, however, be addressed in the rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I was already leaning towards accept before the rebuttal. Many of my questions were addressed in the rebuttal.

Review #3

Please describe the contribution of the paper

This paper proposes a unified diffusion-based model to synthesize multiple medical imaging modalities (MRI, PET, etc.) from a single modality via a shared latent space and frequency-aware processing. It uses text-conditioned prompts for flexible image synthesis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel applicaion: This paper propose to use a diffusion model-based unified modal to synthesize multiple medical imaging modalities from a single source modality.
- Adaptive frequency fusion module: This paper use a novel way to combine the condition and noisy images, by processing them separately in frequency domain and fusion them with attention mechanism.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It’s not mentioned what encoder is used for mapping CT to latent space.
- The modalities used to do the experiments, i.e. model input and output, are not clearly described before showing the experimental results.
- There are three critical components in the adaptive frequency fusion module, including the FFT, WT and attention mechanism. The ablation study now only show the combine results of removing all of them. It would be better to have ablation for each of them.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method has novelty and the experimental results are good. Minor clarity issues exist but are easily correctable.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
There is still an inconsistency in the manuscript, and a newly identified concern:
1. The manuscript states that “modality-specific VQ-GAN models were pre-trained, except for the CT modality, which required only encoding without reconstruction.” However, in the rebuttal, the authors clarify that the CT encoder was also pre-trained and used in a frozen state. This contradicts the original description and introduces ambiguity. A precise clarification of this point is essential to support reproducibility.
2. A newly identified concern arose upon further reading. The manuscript states that all image data underwent Z-score standardization, which typically scales values to a small range (e.g., -3 to 3). However, the reported Mean Absolute Error (MAE) are relatively high. In the worst case it is as high as 19.8, and even in the best case it is 5.40. These values seem inconsistent with the expected scale of normalized data. I hope the authors can clarify this aspect in a future revision.

Author Feedback

We express our great appreciate to all comments and questions from reviewers. Q1: More details (R2, R3) 1) For each target modality (T), our model used independent frozen VQGAN with pre-trained weight. 2) For the source modality (S), only its frozen encoder with pre-trained weight was used. 3) Experimental modalities are: CiM (S=CT, T={T1, FLAIR, PET}) and BraTS (S=T1, T={T1CE, FLAIR, T2}). 4) More details will be clarified in the future revised version. Q2: Rationale for the choice of WT and FFT (R2, R3) We appreciate reviewers’ attention to our frequency mechanism and agree that individual component ablation is valuable. 1) Inspired by prior work (Wftnet, ICASSP, 2024), we used Wavelet Transform (WT) for the noisy input (E0) to preserve local details and Fast Fourier Transform (FFT) for the conditional input (E1) to capture global structure. 2) In the future revised version, we will show detailed ablation studies on each component. Q3: Quantitative Evaluation (R2, R4) 1) Limited to space, we have preliminarily shown basic quantitative metrics (MAE/SSIM) for the BraTS dataset below the respective modal images in Fig. 3. 2) We have completed Wilcoxon signed-rank tests and Holm-Bonferroni correction. Our method achieves statistically significant performance improvements over comparative methods on both datasets (p < 0.05). Detailed metrics and statistical tests will be included in the supplementary material. Q4: Text Guidance Mechanism (R4) We appreciate the reviewer’s insightful comments on our text-conditioned strategy. 1) Although our current work uses concise modality labels as the text guidance, this is intended to validate the feasibility of a highly extensible text-to-feature framework, rather than define its semantic boundaries. 2) Ablation studies (Table 2) show that BERT can greatly improve the performance of one-to-many generation compared to simpler methods. 3) Our text-conditioning mechanism ensures that the U-Net’s computational graph is decoupled from the complexity of the input text. Therefore, when BERT is utilized to parse richer medical descriptions (e.g., lesions, anatomical sites), our U-Net can seamlessly integrate this enhanced semantic guidance, demonstrating its inherent scalability for more complex, text-driven synthesis tasks. Q5: Novelty of the proposed one-to-many approach (R2) We appreciate the reviewer’s question regarding the novelty of our “one-to-many” approach. 1) Existing one-to-many approaches often rely on multi-head decoders (e.g., StarGAN-v2) or modality-specific layers (e.g., ALDM (WACV 2024)), causing parameters to scale with the number of modalities. 2) In contrast, leveraging a VQ latent space and a text-gating mechanism, we do not introduce structural redundancy to the diffusion U-Net, allowing the number of parameters of our core diffusion-based generative model to remain constant irrespective of the number of target modalities. Q6: 2D/3D Applicability (R2) 1)Due to the limited computing power, our method is currently applied to 2D images. 2) While our framework’s dimension-agnostic design enables direct extension to 3D volumetric synthesis. Q7: Pix2Pix Comparison Fairness (R2) We appreciate the reviewer’s comments on VQ-GAN usage and the Pix2Pix comparison. 1)Our architecture employs frozen VQ-GAN components solely for latent space representation; the actual conditional translation is performed by our diffusion model, which is distinct from GAN-based methods like Pix2Pix. 2)A standard VQ-GAN, without modification, is not inherently suited for direct conditional image-to-image translation. 3) We used Pix2Pix as a comparison method since it’s a GAN-based method designed for conditional translation, providing a fair and relevant benchmark for our performance. Q8: Related Work (R2) We appreciate the reviewer’s comments. We will supplement our related work section with a discussion of frequency analysis and text-conditioned image translation in the future version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This is a borderline paper. The concerns from Reviewer #3 are not resolved:
1. The manuscript states that “modality-specific VQ-GAN models were pre-trained, except for the CT modality, which required only encoding without reconstruction.” However, in the rebuttal, the authors clarify that the CT encoder was also pre-trained and used in a frozen state. This contradicts the original description and introduces ambiguity. A precise clarification of this point is essential to support reproducibility.
2. The manuscript states that all image data underwent Z-score standardization, which typically scales values to a small range (e.g., -3 to 3). However, the reported Mean Absolute Error (MAE) are relatively high. In the worst case it is as high as 19.8, and even in the best case it is 5.40. These values seem inconsistent with the expected scale of normalized data. I hope the authors can clarify this aspect in a future revision.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

I recommend `accept’ considering all reviews.

R3 raised new concerns during the rebuttal. However, I think the concerns could be clarifed in the camera-ready version. Specifically, the authors didn’t specifically mention that CT is used in a frozen state. The MAE is high probably the authors used another scale during the evaluation stage.

back to top

High-Fidelity Unified One-to-Many Medical Image Synthesis via Text-Conditioned Latent Diffusion

Author(s):