Abstract

Accurate abnormal region detection in medical images is critical for early diagnosis. Unlike supervised and self-supervised methods, unsupervised methods require no annotated training data and generalize well to unseen abnormalities. Such advantages are achieved by detecting abnormal regions from the differences between an input image and a generated pseudo-normal image, which is similar to the input image but excludes abnormal regions. However, existing unsupervised methods often suffer from high false positive rate at test time due to poor pixel-level matching between the normal regions of the input image and the pseudo-normal image. To address this challenge, we propose MatchGen, a novel plug-and-play framework to enhance the detection performance of existing unsupervised methods by optimizing the pseudo-normal image at test time. This generates an optimized pseudo-normal image that accurately matches the normal regions of the input while maintaining a clear distinction from the abnormal regions, which significantly improves the detection performance. Extensive experiments on four real-world datasets demonstrate the outstanding effectiveness of MatchGen.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0400_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/lele0007/MatchGen

Link to the Dataset(s)

BraTS2021: brain MRI scans of adult brain glioma patients. https://www.cancerimagingarchive.net/analysis-result/rsna-asnr-miccai-brats-2021/ BTCV: liver tumor segmentation. https://www.synapse.org/Synapse:syn3193805/wiki/89480 RESC: retinal OCT edema segmentation dataset. https://challenger.ai/competition/fl2018 IDRiD: diabetic retinopathy lesions and also normal retinal structures annotated at a pixel level. https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid

BibTex

@InProceedings{MaXin_MatchGen_MICCAI2025,
        author = { Ma, Xinyu and Ma, Jinhui and He, Shiqi and Che, Xin and So, Hon Yiu and Chu, Lingyang},
        title = { { MatchGen: Detecting Medical Abnormal Region by Generating Matched Normal Regions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {323 -- 333}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes MatchGen, a plug-and-play framework designed to enhance existing unsupervised medical anomaly detection methods by introducing a test-time optimization step. This step refines the generated pseudo-normal image to better match the normal regions of the input image while maintaining distinction from abnormal areas, aiming to reduce false positives.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Problem Focus: Directly targets the common issue of high false positive rates in reconstruction-based unsupervised methods.
    • Plug-and-Play: Designed as a general framework that can be applied on top of existing AE-based and GAN-based unsupervised methods.
    • Demonstrated Improvement: Shows consistent quantitative improvements (Dice, APpix) over various base unsupervised models across four different medical imaging datasets (BraTS2021, BTCV, RESC, IDRID).
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Insufficient Comparison to Prior Art: The paper fails to compare against or discuss highly relevant prior works, such as those employing normative priors (e.g., Chen et al., “Unsupervised Lesion Detection via Image Restoration with a Normative Prior”) or morphing after reconstruction (e.g., Bercea et al., “What Do AEs Learn?…”). This omission makes it difficult to assess the true novelty and relative performance of the proposed method.
    • Hyperparameter Tuning on Validation Set: The method introduces new hyperparameters (ϵ, τ) and relies on a validation set containing annotated abnormal images to tune them. This practice is problematic for evaluating a truly unsupervised method, as it potentially overestimates performance and biases results, especially when more hyperparameters are involved.
    • Lack of Runtime Analysis: The paper does not report or discuss the computational overhead introduced by the test-time optimization procedure. This is a significant drawback, as test-time optimization can be computationally expensive, potentially limiting practical applicability.  
    • Limited Scope (2D only): The experiments and implementation are restricted to 2D images, while many medical imaging modalities are volumetric (3D). Furthermore the subdivison of the datasets is not clear, and a slice level evaluation usually can lead to issues (and is not common in segmentation and potentially not aligned with the medical task) . 
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors leading to this recommendation are the significant weaknesses identified:

    • Insufficient Comparison: The failure to benchmark against key related works significantly weakens the paper’s claims of novelty and state-of-the-art performance.
    • Methodological Concerns: Using an annotated validation set to tune hyperparameters contradicts the unsupervised premise and likely inflates the reported results, making the performance gains questionable under realistic conditions.

    While MatchGen presents an interesting idea for refining pseudo-normal images at test time and shows empirical gains over its chosen baselines, the methodological flaws in evaluation and the lack of comparison to highly relevant work prevent a positive recommendation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.
    • I appreciate the authors’ efforts in designing the experimental setup. However, the current evaluation, which involves fine-tuning on a validation set and slice-wise splitting, might not fully align with the typical goals of unsupervised anomaly detection. In a real-world scenario, labeled validation data is often unavailable.

    • I’d recommend including a discussion of highly related work, even if you have reservations about its applicability or performance (be it the non-universal nature or the reported performance of an approach).

    • To provide a more complete picture of the method’s practical utility, I strongly suggest including runtime comparisons in the main paper.



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors present a technique called MatchGen to improve unsupervised anomaly detection performance on reconstruction-based architectures using VAEs or GANs. Specifically, MatchGen aims to identify at test time a better embedding to generate pseudo-healthy reconstructions of images with anomalies. Experiments on 4 different datasets suggest the effectiveness of MatchGen.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The methods proposed are interesting and novel. The idea of “test time adaptation” in the context of UAD is creative.
    2. The experimental set-up is sensible and involves a rather large number of methods and datasets.
    3. The paper is well-written and clear.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Doubts on experimental settings: I have a few doubts regarding the experimental settings. I understand that the validation and test sets consist only of anomalous images: if so, why? I find this problematic on both counts: if the validation set is supposed to be used to define the threshold for “anomaly” in the difference maps, then I believe including images from healthy subjects would be helpful. On the other hand, having a test set of only anomalous images can provide a biased assessment of performance. Additionally: how is the training set of anomaly-free images created? My understanding is that BraTS for instance doesn’t have data from healthy subjects. If this is the case, training data can only be selected by using slices without anomalies. This however creates a bias during training that can provided untrustworthy results. Please comment on these crucial aspects.
    2. Unclear implementation of other techniques: were the other competing methods implemented and optimised from scratch, or were the results obtained from the literature? Please specify. If the former, please indicate how the tuning was carried out. When comparing to the results reported in “MedIAnomaly: A comparative study of anomaly detection in medical images” by Cai et al, the results reported in this paper seem quite different. Were the settings different? Please comment and specify in the paper.
    3. Lack of (some) relevant literature: in the past years, MICCAI has seen a lot of publications in this field. While it is stated that MatchGen cannot deal with diffusion-based architecture, it would have been good to see them included in the quantitative comparison. I recommend to at least cite them in the paper as recent advances in unsupervised anomaly detection. Some suggestions: a) Wolleb et al., “Binary Noise for Binary Tasks: Masked Bernoulli Diffusion for Unsupervised Anomaly Detection”, b) Naval-Marimont et al., “Ensembled Cold-Diffusion Restorations for Unsupervised Anomaly Detection”, both published at MICCAI 2024.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. “The search scope for MatchGen is ε ∈ [0,0.6] and τ ∈ [0,1]”: can you clarify how the optimisation was carried out? I.e. how where these hyper-parameters determined, and did the optimal values change a lot between different tasks/methods?
    2. Fig. 2 is very, very difficult to read. I would suggest moving this to an appendix and including in the main paper a version with less comparisons (maybe showing fewer base methods). Please pick a range of showing best and worst cases if possible, to show the real impact of MatchGen.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology is interesting and novel. The paper very well-written. The experiments extensive. A few more details are needed to make sure the experimental settings are fair, but I am strongly leaning towards “accept”.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I thank the authors for addressing most of my comments. While several concerns are still outstanding, I believe the paper should be accepted at MICCAI.



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors propose a novel approach to generate better pseudo-healthy images during the testing. It consists in minimising the l1 norm between the image reconstructed by the model and the input image, while keeping regularisation in the latent space to ensure pseudo healthy generation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method is novel because it focuses on inference rather than training to overcome main issues of unsupervised methods. The figure really helps to understand the method.

    The proposed approach is usable using several training procedure, which is a real plus.

    The approach have been implemented on several existing anomaly detection method (auto encoder and GANs), tested on four datasets and compared with several baselines. The proposed add-on seems to consistently improve results of baseline methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The methods needs to be fine tuned during inference, but there is no mention in the paper of the ressources needed to such optimisation. How many steps are needed, is it consistent for different samples, for different datasets ? How much time does it take to generate the pseudo-healthy scan using MatchGen ?

    It is not clear to me how this framework works with GAN as there is no encoder-decoder, but rather a generator-discriminator.

    The authors main statement is that the method reduce the amount of false positives. The paper lacks of a (simple) metric to highlight that, such as FP rate or precision.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution of the paper is nice, and the proposed method is exploring new solutions, but it lacks of information concerning usability of the method (inference time). This needs to be clarified by the authors.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors clarified the points made during the review phase. Especially the information about time cost and usage. This part should be added to the final version for more clarity and higher impact.




Author Feedback

We thank reviewers’ comments, which will be addressed in our final paper. References refer to our paper or reviewer comments.

1.Hyper-parameter Tuning

Hyperparameter tuning is necessary for most unsupervised methods in the field. The key difference is how it is performed: (a) use annotated validation set with abnormal images, as in [Bercea et al., 5,7,14,30,32,33] and our work; or (b) use normal-only validation set, as in [Chen et al.]. Based on internal experiments, our method is robust to hyperparameters. To fairly compare with baselines, we adopt (a), which performs grid search in given value ranges to select hyperparameters for the best validation results. Final results are reported on held-out test sets to avoid inflated results. Table 1 shows our dataset subdivision.

2.Use Slices Without Anomalies

It is a well-established practice [7,9,14] to select slices without anomalies from abnormal 3D scans for training. The resulting noisy distribution is a common challenge for all methods. However, slice-based datasets remain effective for benchmarking 2D methods [7,8,30,32], especially when evaluation is done fairly on held-out test set. As a typical 2D method, the strength of our method is also validated by its performance on inherently 2D data like retinal fundus [20] and OCT [12].

3.Distribution Shift

Distribution shifts across slice locations are inherent to slice-based datasets. This reflects real-world scenarios where tumors appear at different anatomical locations. Following prior works [5,7,8,14,30], we do not control for slice locations. The key idea is to model normal image distribution marginalized over locations instead of conditioned on locations. This promotes generalization w.r.t. slice locations. All methods share this setup to ensure fair comparison.

4.Related Works

MatchGen (MG) differs clearly from the related works. [Bercea et al., Wolleb et al., Naval et al.] adopt a different paradigm using fixed reconstructions without test-time optimization (TTOPT), which do not explicitly reduce false positives (FP). While MG performs TTOPT to explicitly reduce FP by matching normal pixels. [Chen et al.] also tries to reduce FP, but it does not match pixels, and it requires a prior only available in models trained via ELBO, limiting its compatibility with other models. While MG is better at plug&play. Although [Chen et al.] is not compared with MG, [A,B,C] show [Chen et al.] is consistently inferior to DAE [14], and DAE+MG outperforms DAE in Table 2. [A] Wijanarko et al., Tri-VAE: … Tumor MRI, CVPR 2024 [B] Marimont et al., DISYRE: Diffusion … Detection, ISBI 2024 [C] Pinto et al., Benchmarking … in Cardiac MRI, AS 2025

5.Time Cost and Usage

Our method finishes in under 10 seconds per image on a 4090 GPU. This is practical for clinics, where image analysis occurs alongside other diagnostics. The efficiency is due to fast gradient optimization with a simple loss function computed on a single image, which often converges in 1000 steps. Dealing with one image per time also allows parallelization in batch-processing scenarios, like batch-analysis in radiology department, a central server serving many clients in clinics, and automated data annotation.

6.Use Only Abnormal Images

Our dataset design that uses only abnormal images for validation and testing follows the prior works [6,7,14,30,32]. This is good because our task aims to distinguish abnormal pixels from normal ones. Since abnormal images have both classes of pixels, using them for validation and testing accounts for both classes. Thus, the test set does not lead to biased evaluation. Using healthy images is not helpful, as they have only normal pixels.

7.Others

(1) “Implementation Details” in our paper tells the code source and parameter-tuning of baselines. The result in [Cai et al.] is different because Cai’s are reported on validation set, but ours on test set. (2) MatchGen uses the encoder-decoder-based generator of GAN, not discriminator.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    This submission proposes MatchGen, a plug-and-play test-time refinement strategy designed to reduce false positives in reconstruction-based unsupervised anomaly detection (UAD). Reviewers appreciated the originality of the idea and its practical applicability across several datasets. While the method is promising, a number of key concerns require clarification before a final decision can be made.

    The most pressing issue is the use of an annotated validation set containing abnormal images for hyperparameter tuning. This setup contradicts the premise of unsupervised anomaly detection. The training procedure also warrants attention: selecting slices without visible tumors from BraTS does not guarantee healthy anatomy, as such slices may still exhibit mass effect or other tumor-induced deformations. This introduces potential bias and undermines the assumption of a clean training distribution. Furthermore, training and testing on different anatomical slice locations can lead to distribution shifts that are not acknowledged or controlled for.

    Another major point raised by reviewers is the lack of comparison to several conceptually related approaches—particularly those leveraging morphing-based refinement or normative priors. Since no additional experiments can be requested at this stage, it becomes all the more important that the authors clearly explain how MatchGen differs from these existing methods, both in terms of architecture and test-time optimization philosophy.

    Finally, the paper does not discuss the computational cost introduced by the test-time optimization step. Even without precise runtimes, a brief qualitative assessment of its efficiency and feasibility in clinical or real-world settings would help contextualize the method’s applicability.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed scores during the initial round of reviews. Authors addressed most of the concerns raised, and I therefore recommend the acceptance of this work, despite several weaknesses identified by the reviewers (which I believe that could be addressed on the final camera ready version). I thus strongly encourage the authors to consider some key points suggested by the reviewer (e.g., running times or extended discussion of existing methods). From my side, this approach consistently improves over several existing methods, and thus comparison to other SoTA approaches is not necessary (perhaps could be interesting to explore this in a potential journal extension).



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    MatchGen is a test-time plug-in that refines any reconstruction-based unsupervised anomaly detector by optimising a latent code so the reconstruction fits normal pixels yet stays on the healthy manifold. On four public datasets and five VAE/DAE/GAN baselines it boosts Dice, and halves false-positive pixels, adding ≈10s per image (parallelisable).

    During the first review round the scores were 1(Weak Accept), 3(Weak Reject), 4(Weak Accept).After rebuttal, R1 revised to 1 (Accept), R3 to 1 (Accept); R2 kept the original Weak Reject. The rebuttal clarified that (i) the two hyper-parameters are tuned exactly as in previous unsupervised-detection papers that also rely on annotated validation sets, so fairness is preserved; (ii) the extra optimisation takes about 10s and converges in ≤1000 gradient steps; (iii) plug-and-play compatibility with diffusion-based detectors is future work, but MatchGen already improves five classical baselines.

    With two Accepts, one remaining Weak Reject rooted in presentation rather than methodology, and no outstanding technical objections, the AC recommends acceptance (poster). The camera-ready should add the timing figures, cite recent diffusion-based UAD methods, and detail the validation-set tuning, but no further experiments are required.



back to top