Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Despite the success of deep learning in medical image segmentation, domain shifts caused by variations in scanners and imaging protocols often degrade performance, limiting real-world clinical deployment. Domain generalization (DG) aims to address this issue by learning robust models that generalize well across different domains. While existing DG methods based on feature-space domain randomization have shown promise, they suffer from a limited and unordered search space of feature styles. In this work, we propose MixStyleFlow, a novel DG approach that utilizes normalizing flows to explicitly model the distribution of domain feature styles. By sampling domain feature styles from the learned normalizing flows and mixing them with original feature statistics along the feature channel dimension, our method effectively expands and diversifies domain features in a controllable manner. We evaluate MixStyleFlow on two medical segmentation tasks—prostate MRI and fundus imaging—demonstrating superior generalization performance on unseen target-domain data. Our results highlight the potential of normalizing flows for improving domain generalization in medical image segmentation, paving the way for more robust deep learning models capable of handling diverse clinical scenarios. The code is available at https://github.com/Reza-Safdari/MixStyleFlow.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3460_paper.pdf

SharedIt Link: https://rdcu.be/eHwPW

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_36

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Reza-Safdari/MixStyleFlow

Link to the Dataset(s)

Prostate segmentation: https://liuquande.github.io/SAML/ OD/OC segmentation: https://zenodo.org/records/8009107

BibTex

@InProceedings{SafRez_MixStyleFlow_MICCAI2025,
        author = { Safdari, Reza AND Nikouei Mahani, Mohammad-Ali AND Koohi-Moghadam, Mohamad AND Bae, Kyongtae Tyler},
        title = { { MixStyleFlow: Domain Generalization in Medical Image Segmentation using Normalizing Flows } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {376 -- 385}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of the paper is the proposal of MixStyleFlow, a novel domain generalization (DG) approach for medical image segmentation. MixStyleFlow leverages normalizing flows to model the distribution of domain feature styles explicitly. By sampling diverse feature styles from these flows and integrating them with original feature statistics using a mixstyle approach along the feature channel dimension, the method enhances the robustness and generalization of deep learning models across varied domains, such as different scanners and imaging protocols. The approach is evaluated on prostate MRI and fundus imaging tasks, demonstrating improved performance on unseen target-domain data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel Methodology: The integration of normalizing flows with a mixstyle approach for feature style generation is a fresh perspective in domain generalization. Normalizing flows provide a controllable and explicit way to model feature style distributions, which is a departure from traditional feature-space domain randomization methods. This is interesting because it allows for a more systematic diversification of domain features, potentially improving model robustness in clinical settings.
2. Relevant Application: The application to medical image segmentation, specifically for prostate MRI and fundus imaging, addresses a critical challenge in clinical deployment where domain shifts are common. The focus on real-world clinical scenarios enhances the practical relevance of the work.
3. Clear Evaluation: The paper evaluates MixStyleFlow on two distinct medical segmentation tasks, providing evidence of its generalization capability on unseen domains. This evaluation on diverse datasets strengthens the claim of improved robustness.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited Justification for Normalizing Flows: The paper does not adequately justify the choice of normalizing flows over other generative methods (e.g., VAEs). Without a clear explanation of why normalizing flows are superior or particularly suited for this task, the methodological choice lacks strong motivation. For instance, prior work like Goodfellow et al. (2014) on GANs or Kingma and Welling (2013) on VAEs could have been referenced to contextualize this decision.
2. Outdated and Narrow Comparison Methods: The baseline methods used for comparison are limited to papers from 2021–2022, which reduces the persuasiveness of the results given the rapid advancements in the field. Additionally, the comparisons are confined to mixstyle-like style augmentation methods, omitting other DG approaches, such as those based on meta-learning (e.g., Li et al., 2018) or adversarial training (e.g., Vu et al., 2019). This narrow scope weakens the evaluation.
3. Limited Contribution Scope: The contribution is primarily centered on applying normalizing flows for feature generation and mixing, which feels incremental. The paper lacks additional innovations, such as novel loss functions or architectural improvements, to broaden its impact.
4. Presentation Issues: The writing quality is average, with insufficient detail in the introduction about the proposed method, making it challenging to grasp the full scope of MixStyleFlow. Visuals, such as Figure 1 (model comparison), are blurry, and result tables (Tables 1, 2, and 3) do not adhere to standard conference formatting, reducing readability. Furthermore, the formatting of equations and references requires improvement to meet publication standards.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak reject for this paper due to several critical factors. While the proposed MixStyleFlow method is novel in combining normalizing flows with mixstyle for domain generalization, the paper falls short in several areas. First, the lack of justification for choosing normalizing flows over other generative methods undermines the methodological rigor. Second, the comparison methods are outdated (2021–2022) and limited to a single category (style augmentation), which diminishes the robustness of the evaluation. Third, the contribution is relatively narrow, focusing solely on feature generation and mixing without broader innovations. Additionally, presentation issues—such as an underdeveloped introduction, blurry figures, non-standard table formats, and suboptimal equation and reference formatting—hinder clarity and professionalism. These weaknesses collectively outweigh the strengths of novelty and clinical relevance, leading to my recommendation.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

MixStyleFlow is an innovative framework for domain generalization that leverages normalizing flows to directly represent the distribution of styles in domain features.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A new framework for domain generalization has been developed that employs normalizing flows to effectively model and modify feature styles in a structured and expressive manner. It represents the distribution of domain feature styles, allowing for controlled and varied style augmentations.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The lack of discussion and comparative evaluation with the most recent similar methodologies. DOI:10.1007/978-3-031-43901-8_2 DOI:10.1609/aaai.v37i2.25332 DOI:10.1016/j.bspc.2024.106801
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

(1)The experimental materials requires additional enhancement.It is more conducive to understanding and reproducing the model if the paper could provide adequate and comprehensive details about the experiments. (2)References to preprints should be replaced with their officially published editions(if any). (3)Despite the variations in the validation datasets and tasks, a more in-depth discussion and comparative evaluation with the recent methods which should be benefit in emphasizing the contributions of this research.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The lack of discussion and comparative evaluation with the recent similar methods. Although the data sets show some variations, they all address the domain shift problem in DG for Medical Image Segmentation. Essentially, these studies utilize different strategies explicitly model the distribution of domain feature styles. With learning domain-invariant representations，their variants manipulate the style information in feature maps to simulate domain shifts. What unique features does the suggested MixStyleFlow possess? Without a discussion and comparison with recent similar methods, how can the advantages of the proposed model be shown (the methods of references 18-22 are outdated)? Full discussion and comparison which would enrich the narrative, it would provide readers with a broader perspective on data augmentation and DG.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposes MixStyleFlow for DG in medical image segmentation. It utilizes normalizing flows to model and perturb feature styles in a structured manner. The method aims to address limitations of existing feature-space domain randomization techniques by explicitly capturing the distribution of domain feature styles, enabling more diverse and controllable style augmentations. Experiments on prostate MRI and fundus image segmentation tasks demonstrate superior generalization performance compared to existing DG methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The use of normalizing flows to model feature style distributions is innovative and addresses a clear gap in existing DG methods that rely on limited or unordered style perturbations.
2. The method is well-grounded in theory, with clear explanations of normalizing flows and how they are integrated into the segmentation framework.
3. The inclusion of experiments with limited training data (10% and 30%) adds practical value.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. It lacks a comparison of training times or resource requirements of your method.
2. I suggest the authors to add a discussion about the limitation and potential improvement of their methods.
3. The paper lacks an ablation study to demonstrate the contribution of individual components (e.g., the dual-decoder architecture, normalizing flow parameters).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an approach to DG in medical image segmentation. The experimental results are compelling, demonstrating clear improvements over existing methods.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #4

Please describe the contribution of the paper

The authors introduce MixStyleFlow, a novel method for domain generalization in medical image segmentation that leverages normalizing flows. The approach involves initially training a segmentation network with two decoders: one for predicting semantic segmentation masks and another for reconstructing the input images. Normalizing flows are then used to model the distribution of feature statistics (mean and standard deviation) from selected layers of the reconstruction decoder. In the second stage, the model is retrained using augmented samples generated by interpolating in-domain feature statistics with out-of-domain statistics sampled from the learned normalizing flow models.

The proposed method is evaluated against several state-of-the-art baselines on two multi-domain datasets, which are derived from existing public datasets. Experimental results demonstrate that MixStyleFlow achieves superior performance in many cases.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents a novel application of normalizing flows in the context of medical image segmentation, specifically for domain generalization—an underexplored area where such probabilistic modeling has strong potential.

The datasets used are derived from publicly available sources, making the experiments traceable and, in principle, reproducible by third parties.

The proposed method demonstrates consistent performance improvements over state-of-the-art alternatives across both evaluated datasets, highlighting its practical effectiveness.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper lacks essential architectural details of the proposed model, which significantly limits its reproducibility. While the model is described as an encoder–dual-decoder architecture, no specifics are provided regarding the backbone, decoder structures, or other implementation details. One may assume it is U-Net-based, but this should be explicitly stated and described in the main text or supplementary materials.

The rationale behind selecting specific encoder layers (layers 2, 3, and 4) for applying normalizing flows is not discussed. Without this justification, it is unclear whether these choices are empirically driven, arbitrary, or based on prior work.

The paper does not describe how the baseline methods were trained and optimized. This omission makes it difficult to assess the fairness of the comparisons or to reproduce the experimental results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

In Figure 1, the encoder and decoder components are visually represented as frozen, which is misleading since they are in fact trained during the first phase of the proposed framework. Additionally, the figure does not clearly illustrate that the second phase involves training with augmented images. Updating the diagram to more accurately reflect the process would improve clarity.

In all results tables, I recommend including a final column showing the average performance across all domains. This would make it easier to assess overall effectiveness and compare methods at a glance.

Reproducibility and evaluation practices should be strengthened. The manuscript currently lacks important details such as the hyperparameter search space for each model (including the baselines), the number of training and evaluation runs, and the validation strategy. I encourage the authors to follow established best practices for experimental reporting, such as those described in Dodge et al., “Show Your Work: Improved Reporting of Experimental Results” (2019).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe the paper introduces an interesting and novel approach to domain generalization in medical image segmentation by leveraging normalizing flows to model feature statistics and generate domain-diverse augmentations. The idea is well motivated, and the results demonstrate promising performance gains over competitive baselines on two benchmark datasets.

However, several key aspects of the paper currently limit its clarity and reproducibility. In particular, important architectural details (e.g., the segmentation backbone and decoder structure) are missing, and the rationale for selecting specific layers for flow modeling is not discussed. Furthermore, the paper does not specify how the baselines were trained or optimized, and important experimental reporting details—such as validation strategies, hyperparameter search, and the number of runs—are not provided.

Overall, while the core idea is solid and potentially impactful, the manuscript would benefit from clarifications and additional information. I consider the paper to be a weak accept, pending a rebuttal that addresses the concerns above and strengthens the reproducibility and transparency of the work.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Thank you, Reviewers 1–4, for your insightful feedback that strengthened our manuscript. (R[reviewer number]Q[question number]P[part number]): Regarding the reviewers’ concerns about outdated and narrow comparison methods (R1Q7P1, R1Q10P3, R3Q7P2), our comparison methods were chosen based on their relevance to feature-space domain randomization and style augmentation, aligning closely with the MixStyleFlow framework. Although our baselines from 2021–2022 remain widely recognized and relevant in the medical imaging DG community—as evidenced by their continued use in recent literature (e.g., TriD [11], 2023)—we acknowledge that newer and alternative methods, such as meta-learning or adversarial training, exist. We will include a detailed discussion to position our paper against the suggested methods in the “Related Work” section and plan to provide quantitative comparisons in future work. We will address the replacement of preprint references (R1Q10P2), the presentation issues (R3Q7P4), the updates to Figure 1 to clearly reflect the training phases (R4Q10P1), and the addition of an average performance column to the results tables (R4Q10P2) in the final manuscript. Regarding the reviewers’ concerns about experimental clarity, baseline training details, and reproducibility practices (R1Q10P1, R4Q7P3, R4Q10P3), we note that the ‘Implementation Details’ section provides brief training information, while full details are available in the linked GitHub repository due to space limitations. This ensures transparency and facilitates reproducibility of our approach. R2Q7P1: As outlined in the “Implementation Details” section, we provide hardware specifications and training durations for both the normalizing flow models and segmentation tasks. In future work, we plan to include direct wall-clock and resource usage comparisons. R2Q7P2: In the “Conclusion” section, we noted the computational overhead of normalizing flows as a key limitation. In the final paper, we will expand on this by suggesting joint training with the segmentation model to improve efficiency, and using trained flows for test-time adaptation to enhance generalization. R2Q7P3: The dual-decoder architecture was adopted from MaxStyle [8], where it was shown to be suitable for interpreting feature statistics manipulation, and we retained it unchanged for fair comparison. For normalizing flows, we used simple, small models as described in the “Implementation Details” section, without structural optimization. We acknowledge the value of an ablation study and plan to explore this in future work. R3Q7P1: We chose normalizing flows for their ability to model complex, high-dimensional data distributions with exact likelihood computation, unlike VAEs, which rely on approximations, thus enabling precise density estimation for realistic and diverse medical image samples. Their invertibility enables direct data–latent space mapping for generative modeling. Compared to GANs, normalizing flows avoid mode collapse and instability, improving reliability with limited data. In future work, we plan to provide a more detailed comparison with VAEs and GANs. However, the revised manuscript includes a brief discussion justifying our choice. R4Q7P1: We’ve revised Section 3.2 to include additional architectural details of our model. The encoder–dual-decoder structure is adopted from [8], which describes it fully. R4Q7P2: The choice of decoder layers 2, 3, and 4 and their temperature values was empirically determined to balance low-level details and high-level semantics for style augmentation. While originally omitted due to space limits, this explanation is now included in the revised Section 4.2 (“Implementation Details”).

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

MixStyleFlow: Domain Generalization in Medical Image Segmentation using Normalizing Flows

Author(s):