List of Papers Browse by Subject Areas Author List
Abstract
Medical anomaly detection aims at identifying samples that
deviate from normal patterns and localizing specific anomalous regions, playing a critical role in early detection and intervention of diseases. Reconstruction methods based on generative models are a key category among current methods for medical anomaly detection. However, a common challenge for them is achieving accurate reconstruction of normal regions while suppressing the reconstruction of anomalous regions. StyleGAN, with its powerful generative capability and the ability to perform controllable image modifications, has shown huge potential for medical image anomaly detection. However, the latent space of StyleGAN still requires further exploration and utilization. In this paper, we propose a StyleGAN-based latent Code Retrieval and Partial Swap (SCRPS)
method for brain image anomaly detection. We construct a healthy image latent code repository by leveraging GAN inversion in StyleGAN’s latent space. We then design a coarse-to-fine latent code retrieval mechanism to filter out normal images most similar to test image. We also introduce a partial latent code swap strategy that replaces anomalous latent codes with linear combinations of normal latent codes and employ a perceptual score to perform anomaly localization. Comprehensive experiments on brain tumor and stroke lesion datasets show that our method outperforms several state-of-the-art approaches, with 3.12 and 7.14 percentage points improvements in average volume-level AUROC and maximum achievable Dice score, respectively.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2271_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
BraTS2020 dataset: https://www.med.upenn.edu/cbica/brats2020/data.html
ATLAS2.0 dataset: https://fcon_1000.projects.nitrc.org/indi/retro/atlas.html
OpenBHB dataset: https://baobablab.github.io/bhb/dataset
BibTex
@InProceedings{WeiJie_StyleGANbased_MICCAI2025,
author = { Wei, Jie and Hu, Xiaofei and Zhang, Shaoting and Wang, Guotai},
title = { { StyleGAN-based Brain MRI Anomaly Detection via Latent Code Retrieval and Partial Swap } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15961},
month = {September},
page = {568 -- 578}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors propose a latent-swapping strategy that replaces fine-grained anomalous details with normal ones, thereby improving reconstruction-based anomaly detection for the StyleGAN2 architecture.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- While StyleGAN-based anomaly detection has shown promise in medical imaging due to its inability to reconstruct anomalies (https://ieeexplore.ieee.org/abstract/document/9434141), its progress has been hindered due to insufficient differentiation between poor normal and anomalous image region reconstructions (https://link.springer.com/chapter/10.1007/978-3-030-59520-3_18). The latent-swapping strategy employed by this paper improves this differentiation by replacing anomalous details with normal ones, thereby improving downstream StyleGAN2-based anomaly detection performance (Fig. 3b).
- The paper contains multiple aspects of a robust evaluation: (1) compares to eight competing methods, (2) performs an ablation study to demostrate the benefits of using the course-to-fine swapping, (3) uses standard deviation as a measure of uncertainty, and (4) validates performance improvements with statistical testing.
- The manuscript is well-written.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The first contribution that the authors “introduced StyleGAN into anomaly detection” is a vast overstatement. StyleGAN has been used in anomaly detection for a multitude of applications for many years. I first saw it used in medical imaging in 2021 (https://ieeexplore.ieee.org/abstract/document/9434141). See also: (1) https://www.nature.com/articles/s41598-023-29521-z, (2) https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12933/129330N/Automated-anomaly-detection-in-histology-images-using-deep-learning/10.1117/12.3006224.short, (3) https://link.springer.com/chapter/10.1007/978-3-030-59520-3_18. The scope of this claim needs to be narrowed. The abstract and the fourth paragraph in the introduction containing the claim that StyleGAN is largely unexplored in anomaly detection should also be updated.
- I am concerned about the relevance of this work to the clinical application due to its computational expense, an expense which is not delineated in the manuscript. Backpropagation on StyleGAN2’s latent space likely takes several minutes per slice on a single GPU. The latent-swapping likely adds to this expense. With a longer page limit, I would have liked to have seen presented: hardware utilization in Methods, computational expenses in Results, and a Discussion section containing information on clinical feasibility and algorithmic limitations.
- While the proposed method outperformed eight models, I wouldn’t necessarily count the compared models as the most state-of-the-art models. For instance, f-AnoGAN and Ganomaly are outdated with a plethora of algorithms outperforming them. I would have liked to have seen a comparison to diffusion models, as they seem to be replacing GANs in the reconstruction-based anomaly detection space. For example, this MICCAI 2024 paper (https://link.springer.com/chapter/10.1007/978-3-031-72120-5_37/tables/1) utilizing diffusion models had a 64.72 average Dice on BraTS, compared to this paper’s 36.04 (with the caveat that one is BraTS 2020 and the other is BraTS 2021). The performance on Atlas is more similar (26.67 compared to this mansuscript’s 24.14).
- I am concerned about the use of AUROC as an evaluation metric as it is highly affected by class imbalance, which is the case for tumor/lesion binary maps. AUPRC may have been more appropriate to use. Additionally, max achievable Dice doesn’t necessarily tell us the performance on the entire 3D image (depending on the metric definition).
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
If extended to a journal with more space, I would like to see:
- The searches that lead to the hyperparameter selection (Fig. 2b was great for K2). I’d especially be interested in the percentage of the training data necessary for the algorithm to perform well, as the expense of the extracting latent spaces for the training data concerns me.
- A more precise definition of what is meant by volume-level AUROC and max achievable Dice. I assume the A_per map was binarized and then AUROC/Dice were calculated between the segmentation maps? With AUROC calculated over all slices and Dice reported for the image slice with the higest dice?
Minor fixes:
- In Section 3.2, include -> including.
- On the odd pages, the title is not in the header due to excessive length. Adding a running title will help.
- What statistical test was used for Table 1?
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the algorithm in its current form may not be computationally feasible for a clinical setting (nor perform well enough due to its unsupervised nature), the theoretical contribution of editing StyleGAN’s latent space to improve GAN-based anomaly detection will likely be of interest to the community.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The authors of the paper propose an anomaly detection framework based on a StyleGAN model trained for image generation and inversion on healthy brain MR images. Their main methodological contribution is that they draw latent codes from healthy images for diseased images in a specific way during inference.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novelty: To the best of my knowledge, their method of post-hoc StyleGAN latent space queries of healthy images for diseased images at inference time seems to be a novel and effective idea.
- Methods: Extension of the StyleGAN architecture during inference, making it versatile for different datasets.
- Evaluation/Performance: I am not very familiar with the state of the art in locating anomalies, but Table 1 seems very comprehensive to me, and the performance gaps are evident. Also, the ablation studies help to understand the dynamics during inference.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It is not clear to me to which number the authors set L (the number of generator layers). They only provide L1=5 (number of layers for “coarse latent code matching”) and L2=9 (starting layer for “partial latent code swapping”).
- When I first read the paper, I found it difficult to understand all the steps of querying the latent space during inference. Please add some more explicit information from the text to Figure 1 for better understanding (e.g. L1, L2, K1, K2).
- The method requires several consecutive steps: First, the GAN is trained on healthy images, then the GAN inversion is trained on healthy images, and then inference latent codes are queried for diseased images using three heuristics. In addition, each of the three heuristics has hyperparameters (e.g. L1, L2, K1, K2) that probably need to be adjusted for different datasets.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Minor comment that the authors talk about latent codes s_t with t as a subscript in the text but have t as a superscript in Figure 1.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The authors propose an anomaly detection framework based on a StyleGAN and draw latent codes from healthy images in a novel way during inference. However, I still have some open questions in the weaknesses section and think that they should better explain the latent space queries during inference by improving the connection between the text and Figure 1.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The paper proposes a novel anomaly detection method based on style-GANs. It is a reconstruction based method, that exploits the style-GANs capacity to represent different levels of appearance variability to achieve more specific / localized anomaly detection results. Results indicate that the method outperforms state of the art approaches.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper presents a novel method for anomaly detection.
Qualitative and quantitative results suggest that this approach yields much more localized and specific markers for anomalies in imaging data.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The explanation of the method is ok, but could be clarified even more. Can you provide a point by point run-down from the input examples for training to the model, and then how the model is used on new data?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
p.4.: Sec.2.1 first line: adversarial network consists -> adversarial network that consists ….
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(6) Strong Accept — must be accepted due to excellence
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper proposes new methodology, and results suggest that it outperforms the state of the art. The qualitative results in particular demonstrate that it is able to produce more localized and specific markers of anomalies in imaging data.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We appreciate the reviewers’ recognition of our work, noting that it “presents a novel method” (R1&R3), “contains robust evaluation” (R2), and is “of interest to the community” (R2). Here, we address the reviewers’ concerns and clarify certain misunderstandings.
Novelty of Using StyleGAN in Medical Anomaly Detection (R2) We acknowledge that our claim regarding the novelty of introducing StyleGAN to medical anomaly detection was inaccurate. In fact, many previous methods have already utilized StyleGAN as a generative model and employed optimization-based GAN inversion for image reconstruction. Our actual contribution lies in the further exploration and utilization of the latent space of StyleGAN. We will revise our statement accordingly in the camera-ready manuscript.
- Evaluation Metrics (R2)
- A_per is a probability map rather than a binarized map. AUROC is a threshold-independent metric and can be computed directly on A_per.
- The Max Achievable Dice is calculated by uniformly selecting 100 thresholds from the pixel anomaly scores sorted in ascending order and choosing the threshold that yields the highest Dice score.
- The statistical test we used is t-test.
- As for AUPRC, we computed it during our experiments and confirmed that our method also outperforms the compared methods at the pixel level in terms of AUPRC. Due to space limitations, the AUPRC results are not presented in this version but will be provided in the extended version.
- Compared Methods (R2)
- In our comparative experiments, we included recent 2024 methods such as MediCLIP and RealNet. While reconstruction-based methods like f-AnoGAN and G-anomaly are outdated, they remain competitive with currently popular feature-based approaches when appropriate anomaly localization scores—such as perceptual score—are considered.
- We plan to include methods using diffusion models as the generative backbone in future work.
- Regarding the discrepancy with MICCAI2024 paper (the AUROC on BraTS is 64.72 vs. 36.04), aside from differences in BraTS 2020 vs. BraTS 2021 datasets, we believe the primary reason for this performance gap lies in the imaging modality used: the referenced paper used the T2 modality for anomaly detection on BraTS, while we used T1. These two modalities naturally exhibit differences in performance.
Computational Expense (R2) The optimization-based GAN inversion takes approximately 2 minutes per slice on a single NVIDIA 2080 GPU with our setting of 2,000 iterations. This may limit its applicability in clinical settings. We plan to include a detailed analysis of hardware utilization and computational expenses in future work. Additionally, we are actively working on improving efficiency through encoder-based GAN inversion methods.
- Hyperparameter Sensitivity (R2&R3)
- As shown in Figure 2, our method is not sensitive to K₁ and K₂ (the numbers of candidates retrieved in the first and second stages, respectively).
- L₁ is set to half of the total number of layers (L), representing number of layers for the first stage retrieval. L is related to image resolution; in our case, L=12, as mentioned in Section 2.1.
- L₂ (starting layer for “partial latent code swapping”) requires more careful tuning across datasets, which is related to the size of internal structures across different organs.
- Due to space limitations, we omitted the ablation study on the size of the latent code repository. However, our method is not sensitive to it. We observed that performance remained stable when the repository was reduced from around 11,000 slices to around 150 slices. In future work, we plan to further reduce the number of reference slices and explore whether selecting a small yet representative subset from diverse anatomical regions can ensure stable performance.
- Figures and Writing (R1&R2& R3) We apologize for the typos and the omissions of some parameters in the figures. We will make the necessary revisions in the camera-ready version.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
The reviewers agree that the method is conceptually interesting and clearly described, and all lean towards accepting the paper. However, there are shared concerns: the claim of novelty is overstated given prior use of StyleGAN in medical anomaly detection, and comparisons to stronger, more recent SOTA methods—particularly diffusion models using healing concepts in latent or image space, some of which have been evaluated on similar datasets—are missing.