List of Papers Browse by Subject Areas Author List
Abstract
An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features.
Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at https://github.com/mahshid1998/MadCLIP.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1787_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/mahshid1998/MadCLIP
Link to the Dataset(s)
N/A
BibTex
@InProceedings{ShiMah_MadCLIP_MICCAI2025,
author = { Shiri, Mahshid and Beyan, Cigdem and Murino, Vittorio},
title = { { MadCLIP: Few-shot Medical Anomaly Detection with CLIP } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {424 -- 434}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper introduces MadCLIP for few-shot anomaly detection. The authors propose to use a pretrained ViT with two sets of adapters for healthy and abnormal images respectively. Each adapter is a pre-selected layer from the ViT model followed by linear layers and an activation, with the rest of the ViT model being frozen. The trainable layers are finetuned alongside learnable text embeddings for each class. The main contribution is the use of sigmoid-based SigLIP loss as opposed to the standard softmax-based CLIP loss. The authors report promising few-shot performance in image and pixel level AUC-ROC, however the reproducibility of these results is uncertain.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Methodology is well described and clear to follow if top be reimplemented
- Strong few-shot performance
- Relatively elegant loss for the dual branches (difference of pairwise dot-products)
- While I am unsure this is indeed the first use of SigLIP across the entire field of anomaly detection in medical imaging, this work is still a helpful case study of the empirical performance of SigLIP.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The experiments section is not clearly described and results seem to be from single runs with no standard errors/deviations which makes reproducibility questionable
- The ablation metrics also do not have any confidence associated with the performance drops, making it difficult to ascertain the consistency of the results.
- Metrics are limited. The authors provide image-level and pixel-level AUC but no standard deviations across multiple runs. There are a myriad other metrics for segmentation in anomaly detection such as AUC of Per-Region Overlap (PRO) or the Precision-Recall Curve. These methods are better suited for the class-imbalance scenario of the image containing mostly background pixels.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
-
Section 2 - Dual branch Architecture: In the last paragraph, the authors are aggregating the scores and weighting them by $1/ i $. What is $ i $ ? I am guessing it is referring to the spatial-norm of the grid-size (or something to that effect). Please clarify this for the reader. -
A table for the ablation analysis would be greatly helpful for the reader.
- The method is clearly described and reimplementing it should be straightforward but the experiments may be more difficult to reproduce as we do not know how the few-shot samples were collected from the datasets. Are they random? Fixed subset?
- Additionally since multiple runs were not made, it is unclear whether the results as reported in this paper will be reproducible at all.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is generally well-written with the methodology clearly described. However the experiments/results sections fall a bit short. The authors could add more details about how the data was selected and experiments performed. The authors should also clarify whether multiple runs were performed for the main few-shot analysis in Tables 1 and 2. Ideally, time permitting, there needs to be some confidence intervals/standard deviations for this to be a strong accept.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
(1) A novel few-shot AD architecture with multilevel adapters, each focusing on either normal or abnormal instances, enhanced by a dual optimization objective utilizing learnable text embeddings for better separation. This approach does not require extensive synthetic data or memory banks, unlike SOTA methods. (2) This is the first application of SigLIP loss in medical AD, proving its effectiveness. (3) Strong generalization and improved performance are demonstrated through extensive validation and cross-dataset evaluation across diverse medical modalities and anatomical areas.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- A dual-branch architecture is introduced to independently extract normal and abnormal features using learnable adapters within the CLIP vision encoder.
- To enhance semantic alignment, learnable text prompts are employed.
- The SigLIP loss is utilized to efficiently address the many-to-one relationship between images and unpaired text prompts.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- In Table 1 & 2, it seems that APRIL-GAN exceeds MadCLIP in all the settings. Please explain why.
- Limitation of MadCLIP shall be discussed.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper introduces MadCLIP, a novel few-shot anomaly detection (AD) architecture that employs multilevel adapters to effectively capture both normal and abnormal visual features. Additionally, it utilizes learnable text embeddings to represent the distributions of these features. MadCLIP adopts a dual optimization approach, modeling normal and abnormal representations separately to enhance multimodal (text and vision) similarity within each class while minimizing it across classes. This approach simplifies decision-making by subtracting learned feature representations, resulting in improved class separation. Extensive validation and cross-dataset evaluations across diverse medical modalities and anatomical regions demonstrate MadCLIP’s strong generalization capabilities and superior performance.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This work adapts the CLIP vision encoder for medical anomaly detection by applying sets of linear layers with ReLU to the frozen CLIP output. The approach uses two separate vision encoders to capture normal and abnormal features, aligning them with learnable text prompts via a cosine similarity loss to enforce feature space separation between normal and abnormal features. Anomaly maps and scores are optimized using Dice and Focal loss for segmentation, and SigLIP loss for classification. The method is evaluated against recent unsupervised and few-shot baselines, demonstrating strong performance.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This submission presents an interesting approach to anomaly detection by adapting vision and text encoders for medical data and aligning the learned representations. The method demonstrates robustness across multiple datasets and often outperforms baseline approaches.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Some aspects of the paper could be clarified:
- The specific cohorts used for training and testing are unclear. Could the authors provide sample numbers?
- Clarification on which elements of the adaptation method are novel and which are adapted from previous works would be helpful—e.g., are the adapters the same as those used in MVFA?
- There is limited information on the learnable text prompts for normal and abnormal samples. Could the authors provide more details on these?
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The methods are interesting and the results are strong, but the clarity of the presentation could be improved, as outlined in the weaknesses.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank the reviewers for their constructive feedback. Below, we address their concerns, which will be incorporated into the camera-ready version.
Experimental Clarity & Metrics (R1): We follow the evaluation strategy of SOTA, e.g., MVFA [13], MediCLIP [35], DRA [7], therefore, we use the same metrics ensuring fair and consistent comparisons. While we acknowledge that metrics like PRO and Precision-Recall can be informative too, we prioritize consistency with prior work to ensure comparability.
Results’ reproducibility and splits (R1&R3): The exact sample indices will be released alongside our code (whose selection strategy is in line with MVFA [13]), enabling reproducibility. We will also provide trained model checkpoints and all necessary data splits. The number of training samples is written in the tables or their caption, i.e., 2, 4, 8, and 16, while the test set is originally fixed for everybody’s common usage in each dataset.
Definition of |i| (R1): The variable $i$ denotes the feature levels to which adapters are added to the visual encoder, and $∣i∣$ denotes the number of adapters added to the overall architecture as defined in “Vision Adapters” (Sec. 2). We will further clarify this in the camera-ready.
Ablation Study & Presentation (R1): We conducted six ablation experiments across all six datasets to evaluate the individual contributions of each model component. Due to space limitations, we summarized the results in-text. We agree that presenting this in tabular form would enhance clarity.
Confidence intervals (R1): Thanks for the suggestion. The experimental study strictly follows and replicates the earlier arts’ implementation (i.e., methods mentioned in the tables) for fair comparisons. In this rebuttal, we cannot further supply results following the rules of MICCAI, so as not to get the desk rejected.
MadCLIP vs. APRIL-GAN [5] (R2): MadCLIP outperforms APRIL-GAN on all datasets and five modalities—except OCT17—in both AC and AS metrics. E.g., with 16 training samples, MadCLIP achieves on average 5.31% higher AC AUC and 0.96% higher AS AUC. On OCT17, APRIL-GAN performs slightly better in AC (by 0.22%), though both models exceed 99% and are thus satisfactory. We attribute APRIL-GAN’s marginal advantage on OCT17 to the dataset’s low variability, which may benefit memory bank-based methods that compare training and test data directly.
Limitation of MadCLIP (R2): Our method assumes that learnable adapters and prompts are sufficient to bridge the gap between medical images and textual descriptions. However, the well-known modality gap between vision and language remains unaddressed explicitly. This may limit alignment quality and generalization across datasets. Future work will focus on this possible issue.
Novelty and Architecture Clarification vs. MVFA [13] (R3): Unlike MVFA’s separate adapters for AC and AS, we propose a shared adaptation layer over CLIP embeddings, followed by lightweight task-specific heads. This design enables shared representation learning while preserving task-specific distinctions. Additionally, our dual-branch architecture does not rely on an external memory bank, a core dependency in MVFA, thereby simplifying the pipeline and reducing computational overhead. We are the first to address the many-to-one text–image alignment challenge using the SigLIP loss formulation. While MVFA relies on a fixed set of handcrafted prompts to align visual and textual features, we use learnable prompts (details are below).
Learnable text prompts (R3): Definitions are in Sec. 2; here, we provide illustrative examples. Prompts follow the format: [learnable tokens][CLS][Objective], where [CLS] is a fixed descriptor (e.g., flawless for normal, diseased for abnormal) and [Objective] specifies the modality (e.g., brain). We use multiple synonyms (e.g., perfect for flawless, damaged for diseased, to be released with the code) and aggregate the resulting prompts to enhance AD.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A