Abstract

In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2425_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/NicoSchulthess/anomalydino-dpmm

Link to the Dataset(s)

https://github.com/DorisBao/BMAD

BibTex

@InProceedings{SchNic_Anomaly_MICCAI2025,
        author = { Schulthess, Nico and Konukoglu, Ender},
        title = { { Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {45 -- 55}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a novel approach to use the embedding extracted from a foundational model, specifically DINOv2, to model the normality of various types of medical images. This is achieved through a DPMM model fitted on feature patches from normal data. At test time, the cosine similarity between the DPMM cluster centers and the new embeddings is computed to generate an anomaly score for each patch, thus creating a coarse anomaly segmentation map. Finally a threshold-based method is applied to obtain the final segmentation of the anomalies.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Substantially reduces memory consumption compared to well-known memory-based approaches.
    • Achieves AUROC results in line with the state of the art, while maintaining faster execution.
    • Effectively reduces the number of prototypes used, achieving better performance compared to one-shot settings of other approaches.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • AUPR scores indicate that methods such as PatchCore are significantly more accurate for the positive class.
    • Dice scores are reported only in comparison with AnomalyDINO.
    • Image-level metrics are not reported.
    • The evaluation lacks qualitative analysis. For each detailed comment see the detailed section.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The proposed approach extends AnomalyDINO on using a DPMM instead of a memory bank to store a small set of prototypes, the results are better compared to SOTA approaches in the “similar sized corset” (using around 150 prototypes), and have a competitive runtime and similar memory requirements. However there are several missing aspects in the evaluation that require further analysis: The AUPR scores reveal a substantial performance gap between the proposed method and other SOTA approaches such as patchore, with a loss of over 23% when compared to the best performing version of PatchCore, this remain true also in the few-shot settings were all the variants of patchore obtain better results. This is really important since the imbalance in background to foreground distribution typical of this domain might mislead the results measured with AUROC, and suggests that PatchCore is far more effective at correctly identifying anomalous regions. This cannot be directly compared using Dice scores since they are reported only in comparison with AnomalyDINO, limiting the scope of segmentation performance evaluation. Furthermore, the analysis lacks image-level metrics such as Image-AUROC, and therefore we cannot asses the ability in acutally separating the normal images from the anomalous ones. I think that since the anomaly segmentation is derived from the anomaly map, it would be relevant to show some qualitative examples of segmented regions to better understand how well this approach performs anomaly segmentations. Finally, for the γt and tπ parameters, there is no explanation on how they were selected. Minor comment: I suggest clearly stating the contributions of this work and the differences with respect to AnomalyDINO in the introduction.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As previously mentioned, the evaluation lacks several key aspects. Furthermore, the gains in execution time and memory efficiency do not justify the loss in AUPR.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I believe that the concerns raised during the first review stage were not adequately addressed. Most of the points were dismissed citing space limitations, rather than being substantively resolved. Therefore, my evaluation remains unchanged.



Review #2

  • Please describe the contribution of the paper

    The paper addresses the challenge of unsupervised anomaly detection in medical images. The proposed method builds on AnomalyDINO, an existing memory bank-based approach that struggles to scale efficiently for large 3D medical image datasets. To overcome this limitation, the authors introduce an efficient modification: rather than storing a memory bank of patch embeddings for the entire dataset, their method trains a Dirichlet process mixture model (DPMM) to approximate the distribution of normal patch embeddings. This adaptation significantly reduces runtime and memory usage while maintaining high anomaly detection performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using DPMM is a reasonable approach to improve efficiency of AnomalyDINO.
    • Authors evaluate their method on a well-established BMAD benchmark.
    • The paper demonstrates strong results of AnomalyDINO and the proposed modification in medical anomaly detection, which is an interesting proof-of-concept experiment showing that DINOv2 features of medical images are to some extent meaningful even without any fine-tuning.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I have a general concern about the evaluation of unsupervised models for medical anomaly detection (AD). Both my experience and common sense suggest that supervised models, when trained on sufficient labeled data, significantly outperform all unsupervised approaches. Given this, what practical benefits do unsupervised anomaly detection models offer?

    In my view, a meaningful evaluation of unsupervised AD models — one with real-world relevance — should involve distilling their knowledge into a conventional neural network (e.g., a UNet), fine-tuning it in a supervised manner, and then comparing its performance to other supervised baselines.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a novel method that demonstrates improved runtime and memory efficiency while maintaining competitive performance with state-of-the-art (SOTA) approaches for medical anomaly detection. However, I am not sure about the practical significance of this work (as elaborated in my comment regarding its major weaknesses).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper performs anomaly detection by fitting a Dirichlet Process Mixture model (DPMM) to “normal” DINOv2 embeddings and calculating the cosine similarity between test embeddings and the nearest component center. While DINOv2 embeddings have previously been proposed for anomaly detection (on non-medical images), this work utilizes the DPMM to decrease the computational requirements compared to the full-shot setting and increase performance compared to the one-shot setting.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed methodology provides a low-cost and accessible approach to anomaly detection.
    2. The utilization of foundation model embeddings for medical image anomaly detection is an important research direction and will be of interest to the community.
    3. I appreciated that the authors used a DPMM instead of modeling the entire distribution as a Gaussian, which is a common practice in out-of-distribution detection. This is not only much less expensive at inference, but also (likely) models the training distribution more effectively.
    4. As the proposed algorithm is patch-based, it is able to provide anomaly segmentation maps, in addition to the anomaly detection.
    5. The manuscript provides a robust literature search, a nicely-composed graphical abstract, compares their work to 15 different baselines (some derrivates of others) on 3 public datasets, performs an ablation study on distances, and considers the computational expense of the algorithms.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The manuscript does not convincingly demonstrate the benefits of the proposed methodology over previous methodologies. While it does improve the computational expense over full-shot AnomalyDINO, full-shot AnomalyDINO’s expense is not exorbitant (e.g., 33GB RAM and ~7 seconds/image for inference on BraTS) and has better performance, which is argubably more important in a clinical setting. Additionally, other compared methods outperform the proposed methodology and only have relatively higher computational expense.
    2. The manuscript does not provide a measure of uncertainty and does not support its claims in the results with statistical analyses.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Table 1 is hard to parse in its current format due to the large amount of data presented. Any visual cues you can add to help the reader interpret the table would be helpful, such as bolding the best values.
    2. In addition, the runtime is hard to interpret in Table 1 because it is over the entire testing dataset. Including the size of the testing datasets or presenting this information per image would be helpful. The runtime of one image is of most interest for a clinical setting.
    3. Please ident all the paragraphs after the first to improve readability, especially on page 2.
    4. While most of the manuscript was well written, there were several sentences that could be made more clear/grammatically correct (e.g., “The comparison with large memory banks usually leads to accurate anomaly detections, however at the cost of substantially long runtimes and high memory utilization.”).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite the lack of convincing performance and robust evaluation (measures of uncertainty/statistical analyses), the use of DINO embeddings would be of interest to the anomaly and out-of-distribution detection communities within MICCAI. For relying on a pre-trained natural imaging extractor, the algorithm performs fairly well in terms of both performance and computational expense.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    While the evaluation of the results is lacking, the manuscript presents a novel formulation to improve the memory requirements of AnomalyDINO. Due to its use of foundation model embeddings, it will be of interest to the out-of-distribution detection and unsupervised anomaly detection communities within MICCAI.

    This decision is in line with MICCAI’s call for papers, which states that the conference is looking for innovative methodologies of interest to the community with clear contributions over previous methodologies (in this case, those contributions are memory-related) and that the evaluation of these methodologies may be limited.




Author Feedback

Dear reviewers

We appreciate your valuable feedback and the time invested in reviewing our work. Below, we address the raised concerns.

Performance gap to SOTA (R1, R3): The reviewers correctly point out the presence of a substantial performance gap between our method and the best performing versions of PatchCore and AnomalyDINO, particularly in terms of AUPR. However, in the “few prototype” setting, we (120-150 prototypes) outperform PatchCore with 150 and AnomalyDINO with 1024 prototypes. This makes our approach more prototype efficient, especially important when heavily compressing the memory bank. Furthermore, our approach adapts the number of prototypes depending on the data and does not require a hyperparameter search (which is also not possible in a strictly UAD setting). We agree with R3 on the importance of achieving high AD performance in clinical settings, but the computational load should also be limited. The memory bank’s size of AnomalyDINO is proportional to the dataset size. The BraTS dataset used here contains 7500 normal 2D images for training. Datasets in practice will be much larger, to improve robustness against contrast changes and patient variability, which would substantially increase the computational load for AnomalyDINO and PatchCore. For more clarity, we calculated inference times for a 2D image and a volume consisting of 200 slices (for BraTS): STFPM: 122ms / 24.4s PatchCore (full): 202ms / 40.4s AnomalyDINO (full): 422ms / 84.4s Ours: 39ms / 7.8s

Lack of image-level metrics (R1): The choice of not representing image-level metrics was more a design choice than anything else. Due to space restrictions, we chose to focus on pixel-level metrics because we believe this is a harder challenge for UAD models.

Evaluation of anomaly segmentation (R1): The reviewer points out that the segmentation evaluation shows limited insights as we only compare with AnomalyDINO and do not provide anomaly maps. As our work is an improvement over AnomalyDINO, we only reported baseline segmentation scores for AnomalyDINO. Due to space restrictions, we did not add other baselines. A qualitative comparison of anomaly maps would have provided an intuitive understanding of the segmentation capabilities. Due to space constraints, we decided to report average segmentation metrics instead of cherry-picked anomaly maps.

UAD concern (R2): We acknowledge that the UAD methods’ performance is falling behind those of finetuned models. Firstly, we have a scientific motivation for exploring UAD methods: For humans, it is very easy to identify abnormalities. Neural networks however have great issues with this task. Thus, reaching human-like performance would be a huge leap in the field of machine learning. Secondly, our practical motivation: finetuning is only possible for well known diseases with existing annotated data. Anomaly detection can go beyond known issues. It is not feasible to obtain labels for each rare disease to finetune a distilled model.

Lack of uncertainty analysis (R3): This is an omission on our part due to time constraints. We are keen to include this in the final version.

Selection of hyperparameters (R1): From experiments on a synthetic toy dataset, we determined that γt should be chosen sufficiently small when not all modes of the distribution are represented in each batch. Setting γt to 0.2 yielded a good trade-off between stability and convergence. On real data, visualizations of the data distribution and the obtained model indicated a good fit. After convergence, the distribution of πk was bimodal, one mode between 1e-1 and 1e-4, the other below 1e-40. tπ was selected to split off the mode below 1e-40.

Remarks on readability and clarity (R1, R3): We thank you for the comments on improving the presentation of the results in Tab. 1, indenting the paragraphs, and grammatical improvements! We will definitely implement this in the final version.

We hope that our responses sufficiently address your concerns.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The reviewers appreciate the general direction of the paper—leveraging DINOv2 embeddings and replacing a memory bank with a Dirichlet Process Mixture Model (DPMM)—as a promising step toward more scalable anomaly detection in medical imaging.

    However, key concerns were consistently raised across reviews. Most notably, the method’s benefits over prior work (including AnomalyDINO and PatchCore) are not convincingly demonstrated. While the proposed approach improves computational efficiency, this comes at a notable cost in detection performance. The authors should clearly justify the trade-offs between efficiency and performance and explain where this method is expected to be most beneficial. Clarifying how hyperparameters were selected and providing qualitative visualizations of anomaly maps would also help strengthen the submission. Addressing these points constructively in the rebuttal will help clarify the method’s contribution.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers agree that the paper introduces a clearly novel and technically sound substitution of AnomalyDINO’s large memory bank with a Dirichlet-process mixture, cutting inference time and GPU memory by an order of magnitude while preserving competitive AUROC. They praise the method’s principled formulation, literature positioning, and relevance to unsupervised medical anomaly detection. However, two substantive evaluation gaps persist after rebuttal. First, AUCPR drops by up to 25 pp relative to PatchCore in small-lesion scenarios, a clinically critical metric that deserves deeper analysis rather than a brief runtime trade-off argument. Second, only pixel-level metrics are reported; study-level AUROC/AUCPR and example maps would make clinical utility clearer. Minor concerns include the absence of uncertainty analysis, limited segmentation comparisons, and lack of released code. Balancing clear methodological novelty against open performance questions, I recommend accept , and strongly encourage the authors to expand evaluation and release code in the camera-ready or a future version of the work.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The validation in experimental results is weak to support the strength of the proposed method. First, pixel-level AUC is not a strong metrics to evaluate pixel level anomaly detection in medical imaging, since it is super unbalanced. Also, the proposed method is not SOTA and there are a lot of performances gaps. Thus, I think the impact is limited in MICCAI community.



back to top