Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multi-modal brain imaging, such as MRI, CT, and PET, is popularly used in clinical practice and has significantly advanced our understanding of cognition and neurological diseases. Due to scan time and cost, these imaging modalities are not always available. Cross-modality synthesis can alleviate this issue. However, the existing cross-modality synthesis methods are typically task- or modality-specific, leading to performance degradation when applied to heterogeneous real-world imaging data. Here, we propose UniSyn, a unified framework capable of synthesizing target imaging modalities using specified acquisition parameters from any available ones. Specifically, UniSyn first learns robust metadata representations through image-text alignment on large-scale multimodal neuroimaging datasets. Then, a cross-modality synthesis framework is introduced to leverage the learned metadata representations for guiding the generation of metadata-specified target images. To enhance interpretable metadata-driven control over image synthesis across diverse imaging modalities, we introduce a dual-parameter arithmetic operation that transforms text features into two key parameters, for capturing contrast variations and intensity shifts induced by different modalities and subjects, respectively. Extensive experiments on multi-institutional brain imaging datasets demonstrate that our UniSyn surpasses the existing cross-modality synthesis approaches in synthesis fidelity and accuracy, enabling the generation of missing imaging modalities tailored to specific clinical and research needs.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2865_paper.pdf

SharedIt Link: https://rdcu.be/eHwR5

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_64

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanYul_Unisyn_MICCAI2025,
        author = { Wang, Yulin AND Xiong, Honglin AND Sun, Kaicong AND Liu, Jiameng AND Lin, Xin AND Chen, Ziyi AND He, Yuanzhe AND Wang, Qian AND Shen, Dinggang},
        title = { { Unisyn: A Generative Foundation Model for Universal Medical Image Synthesis across MRI, CT and PET } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {673 -- 682}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of this work lies in the integration of metadata into the image generation process, where a dual-parameter arithmetic operation is employed to convert textual information into two key parameters that are subsequently utilized in the generation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This study incorporates metadata, such as patient information and image descriptions, into the image generation process. By employing a contrastive learning framework, joint training of images and textual information is conducted to extract more discriminative cross-modal features. Furthermore, the textual information is decomposed into two parameters representing contrast variation and intensity shift, which are then integrated into the generation process to enhance the quality and diversity of the synthesized images.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The paper does not provide a detailed explanation of how the parameters α and β are obtained, nor does it elaborate on the actual significance of these two parameters. Furthermore, it lacks a thorough justification for why the transformation αf + β can effectively convert features into the target modality. 2) The paper does not include ablation studies on the contrastive learning pretraining to validate its effectiveness.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of incorporating metadata into image generation proposed in the paper is novel, and the use of CLIP in the training stage is also commendable. However, during the image generation phase, the paper lacks a detailed explanation of the parameters α and β, making it difficult to understand how metadata positively contributes to the generation process. A more comprehensive clarification from the authors would be appreciated. Overall, while the methodology and perspective are innovative, the paper suffers from insufficient explanation in key areas, thus I recommend a weak accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Thanks for the author’s reply. I tend to accept this paper.

Review #2

Please describe the contribution of the paper

This paper presents UniSyn, a cross-modality image synthesis framework that is guided by both image features of the source modality image and text features of the target and source modality images.The authors therefore propose a two-step training approach in which they (1) employ NeuroCLIP, a contrastive learning-based framework used to jointly train image and text encoders with aligned features, and (2) employ a generative model that encodes the source image and source and target text prompts, processes them using a dual-parameter arithmetic operation and decodes the target image. The authors claim that their method is capable of generating a variety of target modality images from any source image, outperforming existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is generally well written and quite easy to follow. It additionally tackles a relevant topic to the MICCAI community, namely cross-modality image synthesis, and addresses a generalization issue that most previous methods that are trained for specific modality pairs have. The main strength of this paper are:
- The idea of using text features from the text-encoder trained in the first stage to (1) remove modality-specific features (contrast) of the source modality to obtain a modality-agnostic representation (content) and (2) add a new contrast by using the text features of the source modality is interesting. The propsed dual-parameter arithmetic operation seems to fit that task quite well (as demonstrated in the ablation).
- The presented method seems to outperform the comparing methods in both quantitative and qualitative results.
- The presentation of the method is well supported by really helpful pictures that make the paper easy to understand.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While I generally like the idea presented in this paper, it also has some flaws:
- I really miss some information on the preprocessing of the used data. Simply stating that all images were skull-stripped and registered to the T1w image is not enough. What is the resolution of the images and do they all have the same resolution? How were the different images normalized?
- I find it difficult to understand the dimensionality of the different representations (image and text encodings). I think it would really help the reader of this paper to (a) include this information in a figure or (b) explicitly state it in the text.
- It remained unclear to me how the text for the target modality image is composed. While I understand that age and sex remain similar, I did not really understand how other parameters such as TR, TE, TI and FA (for MRI) or TV and TC (for CT) are set. While you can simply use the parameters of the target image in a paired evaluation setting, how would these parameters be chosen in a case where this is not available? By a human expert?
- It would be interesting to add an ablation that shows if changing e.g. the repetition or echo times for the same scanner would lead to realistic results. This could also give some insight into how meaningful the used text encodings are, something that I’m currently missing some information about.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
I have some additional comments:
- Even though the authors don’t state to release any code, I think they did a good job of trying to provide enough information about the network architectures used.
- I didn’t really understand the choice of the “Generator Decoder” network architecture. Can you add some rationale for this design choice?
- Making the best scores bold in Table 1 and 2 would further improve readability.
- In Table II the authors do some mappings e.g. CT-PET or T1-PET, but its not specified which PET images are synthesized (if I understood correctly, the dataset contains AV45-Pet and FDG-PET scans). Did you always synthesize the PET scan of which the ground truth was available?
- There is a grammar mistake in Section 3.2. It is “trained from scratch” not “trained from the scratch”.
- Please cite the Adam paper when stating that you use this optimizer.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, I think this is a well-written paper that presents a method that has potential for even more applications than presented. While the novelty of this method is limited, it cleverly combines existing approaches to provide a framework for an interesting and relevant problem, effectively generalizing cross-modality image generation across multiple modalities. While this paper is not yet perfect and could be further improved, I think it could be accepted if some of my concerns could be addressed and some minor information could be added to the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My main concerns have properly been addressed in the rebuttal and clarifications can easily be added in a camera ready version of the manuscript. I’d be happy to see this paper at MICCAI.

Review #3

Please describe the contribution of the paper

This paper introduces a CLIP-based text-guided cross-modality synthesis framework comprising two key stages: first, training NeuroCLIP on large-scale image-text pairs, and second, training a Generative Foundation Model (GFM) guided by the learned text embeddings to generate anatomically and contrast-consistent target modality images. The proposed model demonstrates superior performance over comparison methods in cross-modality translation tasks between CT, PET, and MRI.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is clearly structured, and the methods are well-described and straightforward to follow.

The method leverages extensive cross-institutional brain imaging datasets, contributing to large-scale cross-modality synthesis.

The dual-parameter arithmetic operation cleverly and effectively guides modality translation, ensuring generation results are consistent with target text metadata while preserving the anatomical structure of source images.

Comprehensive and thorough evaluations on the modality translation tasks (CT, PET, MRI) effectively demonstrate the robustness and validity of the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The description of the test datasets lacks sufficient clarity and detail.

Quantitative comparisons in Tables 1 and 2 do not clearly indicate statistical significance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The color choices for different segments within the same pie chart in Fig. 2 are very similar. Consider using more distinctive colors to enhance visual clarity and differentiation.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed approach provides an effective and well-reasoned solution for text-guided cross-modality synthesis.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have provided a reasonable rebuttal addressing the main concerns, particularly regarding the dataset details and statistical significance of results.

Author Feedback

We thank the reviewers for their insightful feedback. We have considered all suggestions and provide our responses and clarifications below. Response to R1 and R2

Core Method: Text Parameters (α, β), Dimensionality, and Contrastive Learning (Responses to R1.1, R2.2, R1.2)

Acquisition and Significance of α, β: The text prompt is encoded into a 1536D representation; its first half (768D) is α, the second is β. Source image features, f (768D), represent structural content. The transformation αf + β is an affine feature-space transformation. α scales to modulate contrast; β shifts to adjust intensity. This direct feature manipulation disentangles modality-specific contrast/intensity (via α, β) from anatomical content (f), enabling precise, interpretable conversion of source features to the target modality.

Dimensionality Clarification: The text encoding is 1536D (α, β are 768D each); source image features f are 768D. Figure 1(a) and the methods section illustrate this.

Necessity of Contrastive Learning: Contrastive learning is crucial for the text encoder to capture imaging parameters. These text representations enable cross-modality synthesis and ensure generated images match specified parameters. Thus, it’s foundational; its effectiveness is shown by the synthesized images’ quality and parameter-specificity. A direct ablation would compromise our core objective. Response to R2 and R3

Dataset Processing and Description (Responses to R2.1, R3.1)

Data Preprocessing:

Skull stripping was applied to all images.

Modalities were registered to the subject’s T1w image, matching its original resolution.

Training intensities were truncated (0.01-99.95 percentiles) and normalized to [0, 1].

Test Dataset Clarity: We apologize for confusion from original subsection titles. Training, validation, and test datasets (7:1:2 split) derive from GFM and NeuroCLIP datasets (GFM is a NeuroCLIP subset). Original titles “GFM training dataset” and “NeuroCLIP training dataset” described the overall datasets for training, validation, and testing.

Imaging Parameter Selection and Text Prompt Construction (Responses to R2.3, R2.4)

Parameter Selection: Imaging parameters in text prompts were chosen with radiologists to select those significantly influencing contrast and commonly adjusted clinically.

Parameter Setting: For paired evaluations, target ground truth parameters were used. In unpaired applications, experts (e.g., radiologists) can specify parameters as needed.

Ablation Study on Single Parameter Variation: Our dataset lacks paired images varying only a single imaging parameter. However, our overall evaluation (with simultaneous multi-parameter variations) shows model sensitivity to parameter combinations and ability to synthesize images consistent with text descriptions, indirectly supporting text encoding effectiveness.

Responses to Other Comments (Responses to R2 and R3’s additional comments, R3.2)

Generator Decoder Architecture: Chosen for its proven effectiveness in image generation, especially reconstructing details from modulated features.

Table and Figure Readability: We agree that bolding the best scores in tables and changing color choices in Fig.2 can improve readability. We will adopt this in future work.

PET Image Specificity (Table I): For the PET image synthesis tasks, we synthesized FDG-PET images for the ZS dataset and AV45-PET images for the HS dataset, as ground truth was available for these modalities. However, by modifying the textual imaging parameters, our method can generate any desired target modality even in the absence of ground truth.

Grammar and Citation: We appreciate the correction (“trained from scratch”) and the Adam optimizer citation suggestion.

Statistical Significance (Tables 1, 2): We understand the importance of statistical significance. Our analysis confirms our method’s statistically significant improvements (p ≤ 0.05) over competitors in key metrics (PSNR, SSIM).

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Responsible reviewers and well written rebuttal, I’d like to accept this work.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Unisyn: A Generative Foundation Model for Universal Medical Image Synthesis across MRI, CT and PET

Author(s):