Abstract

Separating shared and independent features is crucial for multi-phase contrast-enhanced (CE) MRI synthesis. However, existing methods use deep autoencoder generators with low parameter efficiency and lack interpretable training strategies. In this paper, we propose Flip Distribution Alignment Variational Autoencoder (FDA-VAE), a lightweight feature-decoupled VAE model for multi-phase CE MRI synthesis. Our method encodes input and target images into two latent distributions that are symmetric concerning a standard normal distribution, effectively separating shared and independent features. The Y-shaped bidirectional training strategy further enhances the interpretability of feature separation. Experimental results show that compared to existing deep autoencoder-based end-to-end synthesis methods, FDA-VAE significantly reduces model parameters and inference time while effectively improving synthesis quality. The source code is publicly available at https://github.com/QianMuXiao/FDA-VAE

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1536_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/QianMuXiao/FDA-VAE

Link to the Dataset(s)

LLD-MMRI dataset: https://github.com/LMMMEng/LLD-MMRI-Dataset

BibTex

@InProceedings{KuiXia_Flip_MICCAI2025,
        author = { Kui, Xiaoyan and Xiao, Qianmu and Li, Qinsong and Ji, Zexin and Zhang, Jielin and Zou, Beiji},
        title = { { Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {215 -- 225}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces flip distribution alignment VAE for multi-phase contrast-enhanced MRI synthesis. The methods trains on paired pre/post contrast MRIs of the same anatomy. The core idea is to encourage their latent distributions to be symmetric around the origin of the latent space for disentanglement. The paper is overall well-written.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. I commend the authors for placing a strong emphasis on computational efficiency. Deep learning models nowadays are becoming increasingly complex and resource-intensive, it is refreshing to see a method that is both lightweight and well designed. This framework shows that meaningful contributions in medical image synthesis can still be achieved through elegant model design, rather than sheer model size. This is particularly important for clinical applications where computational resources and inference time may be limited.

    2. The paper is clearly written and easy to follow. Despite some methodological ambiguities that I noted below, the overall narrative flow is coherent and well-motivated. The clarity of presentation is a strength.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. I find the idea of enforcing latent space symmetry interesting and intriguing. However, the paper does not provide sufficient evidence or in-depth analysis to demonstrate why this form of alignment is necessary or beneficial. The authors argue that, without explicit constraints, the latent distributions of input-output image pairs misalign early in training and eventually collapse into shared representations, suppressing modality-specific features. While this to me is a compelling hypothesis, there are no targeted experiments or visualizations to support it. For example, it would be valuable to show how the latent distributions evolve during training with and without the proposed constraint. As it stands, it’s unclear whether the flip-alignment strategy truly resolves the issue or simply functions as a heuristic regularizer.

    2. Figure 5(b) also raises concerns. The overlap between the pre-contrast and CV phase distributions suggests that the model may not be effectively separating shared and independent features as intended. This casts doubt on whether the training converges as expected and whether the model achieves the feature disentanglement and symmetry that is central to its motivation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. A central motivation of the paper is that enforcing symmetry in the latent space improves feature alignment and disentanglement. However, this hypothesis remains unvalidated to me. I strongly encourage the authors to perform in-depth analysis (or provide some more insights) of the latent space to better support their claims. For example, visualizing the latent distributions using t-SNE or PCA—comparing the proposed method to a standard VAE—would help illustrate whether the desired structural properties (e.g., symmetry, separation of shared and unique features) are actually achieved during training.

    2. The paper reports improvements in synthesis quality only based on mean values of each metric. The results would be strengthened by reporting standard deviations and applying statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests).

    3. In my mind, I think the most interesting property of this learned latent space is the mean and sign of the space. Given the current set up, does it mean that absolute value of the latent mean encodes shared features, while the sign captures contrast-specific differences? This is an intriguing formulation, but it raises questions simultaneously. For example, if the pre- and post-contrast images are similar, both latent representations will tend toward the origin. In that case, does the distance from the origin also carry meaningful information? Could the magnitude of deviation encode something about contrast intensity or dynamic range? These are important conceptual questions that, if investigated, could deepen our understanding of how the model encodes shared and unique information. While this may fall beyond the current scope, it could form a valuable direction for future work.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an intriguing approach to representation learning and image translation using a lightweight VAE architecture. The idea of enforcing latent space symmetry is novel and conceptually interesting, and the paper is overall clearly written. However, the work falls somewhat short in validating the core assumptions behind its design (see comments in weaknesses). While the empirical performance is promising, it remains unclear what aspects of the model are primarily driving these improvements. Nonetheless, I believe the method offers a valuable and fresh perspective, and merits further exploration.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contributions of this paper are the introduction of Flip Distribution Alignment (FDA) and Y-shaped bidirectional training, intending to separate shared and independent imaging features. The authors assessed the proposed framework, FDA-VAE, in the context of multi-phase MRI synthesis. They compared its performance against established image synthesis models, including Pix2Pix, ResVit, TransUnet, PTNet, and I2I-Mamba. The results demonstrated that FDA-VAE outperformed these models across various synthesis tasks involving four T1 contrast-enhanced phases, all while maintaining a relatively lightweight model size. Additionally, an ablation study revealed that the combination of VAE, FDA, and bidirectional training (referred to as FDA-VAE) outperformed VAE with FDA, which, in turn, outperformed VAE.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel formulation (FDA & bidirectional training): The Flip Distribution Alignment (FDA) is a simple-to-understand yet efficient constraint. By explicitly (dis)aligning the latent distributions of different MRI phases, the model can better separate the shared and independent imaging features. The Y-shaped bidirectional training introduced two independent decoders, which showed to further enhance the overall synthesis quality.
    • Comprehensive evaluation: The paper presents a comprehensive evaluation of FDA-VAE against popular state-of-the-art methods (Pix2Pix, ResVit, TransUnet, PTNet, and I2I-Mamba) across six different tasks. The use of multiple metrics (PSNR, SSIM, and LPIPS) provides a robust assessment of the model’s performance. The comparison of parameter size and inference time further highlights the efficiency of FDA-VAE.
    • Lightweight model: The authors demonstrate that FDA-VAE achieves comparable, and often superior, synthesis quality while using significantly fewer parameters (11.78M) compared to models like ResVit and TransUnet (over 100M parameters). This makes it a more practical solution for applications with limited storage and computational resources.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Statements: — “Compared to existing deep AE generators, our backbone has fewer layers and a narrower model width.”: The paper does not provide hyperparameter information of the proposed backbone, making the statement inconcrete. — “This misalignment disrupts feature correspondence and increases divergence, making feature transformation difficult and degrading synthesis quality. As training progresses, the lack of alignment causes the distributions to collapse onto each other, overemphasizing shared features while suppressing modality-specific information, reducing independent feature distinctiveness.”: This statement could be elaborated in more detail or cited.
    • Other architectures: FDA and bidirectional training strategy are only implemented in the backbone VAE. Implementing FDA and bidirectional training strategy in other architectures could give insights into the potential of FDA and bidirectional training strategy.
    • GAN loss function: It is unclear how the GAN loss function is used in the bidirectional training strategy, as the standard GAN loss function requires a separate discriminator (classification) model.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The paper is generally well-written and organized. The introduction clearly outlines the problem and the proposed solution. The method section provides a detailed explanation of the FDA-VAE framework. The experimental setup and results are presented in a clear and concise manner. However, some sections could benefit from further clarification. For example, the explanation of the Y-shaped bidirectional training could be more intuitive. Overall, the paper is easy to follow and understand, but a few minor improvements could enhance its clarity and organization.
    • Illustrations: The figures and table are clear and helpful in understanding the paper.
    • FDA: Could the FDA loss function be improved by enforcing \mu_A » 0?
    • Metrics: It would be valuable to add one of the most intuitive metrics, mean absolute error (MAE).
    • Experiment setups: It is unclear which checkpoint was selected: the last epoch (40th) or the best epoch? If it was selected from the best epoch, then how was the best epoch determined? Also, were 40 epochs sufficient, or could the model have benefited from training for more epochs?
    • LPIPS evaluation metric: By default, LPIPS is implemented for natural, 3-dimensional, RGB images. How did the authors implement LPIPS for 1-dimensional, grayscale images?
    • Figure 4: The error heatmaps in Figure 4 could benefit from a colorbar.
    • Writing: — Page 2: Change “To address this problem, we propose …” to “To address these problems, we propose …”. — Page 2: Change “It’s” to “It is”. — Page 5: Title of Subsection 3.1, change “Experience Setups” to “Experiment Setups”. — Page 8: Remove “(SOTA)”, as the abbreviation is not used in the remainder of the paper.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clear and well-written, presenting a sound theoretical foundation and a proposed framework that holds promise for future research. However, several points could be improved.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The primary contribution of this paper is the introduction of a novel, lightweight, and interpretable framework called Flip Distribution Alignment Variational Autoencoder (FDA-VAE) for multi-phase MRI synthesis. The authors propose a unique Flip Distribution Alignment (FDA) strategy that enforces symmetric alignment in the latent space by flipping the mean vectors and aligning the variances of the input and target image distributions. This approach effectively disentangles shared and modality-specific features, enhancing the quality of synthesized images. Additionally, the paper introduces a Y-shaped bidirectional training scheme that simultaneously performs self-reconstruction and cross-phase synthesis, improving the stability and interpretability of the latent space. Compared to existing state-of-the-art methods, FDA-VAE achieves superior synthesis quality with significantly fewer parameters and faster inference times, demonstrating its potential for efficient and high-quality medical image synthesis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Here are the major strengths of the paper:

    1. Innovative Flip Distribution Alignment (FDA) Strategy The paper introduces a novel FDA mechanism that enforces symmetric alignment in the latent space by flipping the mean vectors and aligning the variances of the input and target image distributions. This approach effectively disentangles shared and modality-specific features, enhancing the quality of synthesized images. In addition, this method improves generation efficiency compared to complex modeling of modality feature correspondences.

    2. Lightweight and Efficient Model Architecture FDA-VAE is designed with a lightweight architecture, utilizing only 11.78M parameters. This is significantly fewer than many state-of-the-art models, resulting in faster inference times without compromising image quality.

    3. Comprehensive Evaluation and Theoretical Analysis The paper provides evaluations of FDA-VAE, demonstrating superior synthesis quality and efficiency compared to existing methods. The article also provides theoretical and interpretability analyses about FDA.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Here are the major weaknesses of the paper:

    1. Lack of explanation for the design choices of the image encoder in the model While the authors state that the use of two phase-specific decoders is intended to enhance feature disentanglement, it is unclear why a similar approach was not applied to the encoder such as adopting phase-specific encoders (many methods have adopted separate designs, e.g. ‘Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders’)? and this aspect also appears to be missing from the experimental discussion.

    2. The ablation study is not comprehensive enough Both FDA and the Y-shaped bidirectional training strategy in FDA-VAE are helpful for feature disentanglement, and the relative strengths of their effects is an interesting point. The paper seems to lack a discussion on the VAE + KL + Y-shaped formulation.

    3. Lacks a visual comparison illustrating distribution collapse caused by relying solely on KL divergence regularization The authors state in the paper that ‘the lack of alignment causes the distributions to collapse onto each other’. However, Fig. 5b only shows the case of feature disentanglement, and the absence of a visualization illustrating distribution collapse due to misalignment raises curiosity. Such a visualization would be important for highlighting the feature disentanglement capability of the proposed method.

    4. Insufficient Comparison with Recent State-of-the-Art Methods The paper lacks comprehensive comparisons with recent state-of-the-art methods in multi-phase MRI synthesis. For example, approaches utilizing latent diffusion models for cross-modality 3D brain MRI synthesis have demonstrated promising results. Without benchmarking against such methods, it’s challenging to assess the relative performance and advantages of the proposed FDA-VAE framework.

    5. Limited Novelty in Latent Space Alignment The proposed Flip Distribution Alignment (FDA) strategy, which enforces symmetric alignment in the latent space by flipping mean vectors and aligning variances, is conceptually similar to existing methods that aim to disentangle shared and modality-specific features in medical image synthesis. For instance, hierarchical latent variable models have been employed in multi-modal MRI synthesis to capture complex feature representations (‘Unified Brain MR-Ultrasound Synthesis Using Multi-modal Hierarchical Representations’ miccai23)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A comprehensive evaluation of the paper based on the above strengths and weaknesses.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their constructive feedback, which helps us improve the clarity of the paper. Below we respond point-by-point.

Reviewer #1 R1-C1 Need evidence that FDA prevents latent collapse: Adding relevant experiments and visualizations does help to support the role of FDA constraints, and due to space constraints we will introduce comparisons in subsequent studies. R1-C2 Overlap in Fig 5(b): In Fig. 5 we used an extreme sectioning case in which the Pre and CV sections differed in texture only at the edges of the vessels and lesions, so that there would be more overlap in the potential distribution, but a symmetrical non-overlapping part could still be observed R1- Additional comments We appreciate these insightful suggestions and will consider them in future work.

Reviewer #3 R3-C1 Model depth / hyper-parameters: As shown in Fig 2, our backbone has one reparameterization block (FDA), two attention layers and three up/down-sampling stages—significantly shallower than ResVit. Complete hyper-parameter files will be released with the code upon camera-ready. R3-C2 Applicability of FDA to other architectures: FDA is designed for backbones without extensive skip connections; applying it to U-Net-like networks may conflict with such links, which we leave for future exploration. R3-C3 GAN loss details: One discriminator is used in single-direction FDA; two independent Patch GAN discriminators are employed in the Y-shaped training, following CycleGAN. R3-C4 Set \mu_A\ »\ 0 to improve FDA: The FDA is designed to ensure that the latent distribution of a particular phase is always maximally different in its fit to the prior distribution, and forcing a constraint on a particular mean to be within a certain range seems difficult to optimize for this purpose R3-C5 Training Epochs: We used about 24K pairs of slices for each model in each task for training, with the batch size set to 24, and about 40K iterations per model for 40 epochs is a relatively reasonable number of rounds. In addition, this number of rounds is based on the evaluation of model performance and the equalization of computational resources. R3-C6 LPIPS computation: Response: Each grayscale image is duplicated to three channels and evaluated with the official LPIPS (AlexNet) implementation.

Reviewer #4 R4-C1 Shared encoder design: A single encoder minimizes parameters and provides a unified latent basis, which is crucial for symmetric alignment in FDA. R4-C2 VAE + KL + Y-shape motivation: Bio-directional synthesis, analogous to CycleGAN, stabilizes GAN training and prevents mode collapse, which we observe empirically in our experiments. R4-C3 Comparison with latent diffusion models: Due to page limits we did not include LDM baselines; assessing them is part of our planned future work. R4-C4 Reference methods update: We will add the suggested citations in the final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top