Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1250_paper.pdf

SharedIt Link: https://rdcu.be/eHwU0

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_59

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhoYue_UltraAD_MICCAI2025,
        author = { Zhou, Yue AND Bi, Yuan AND Tong, Wenjuan AND Wang, Wei AND Navab, Nassir AND Jiang, Zhongliang},
        title = { { UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {625 -- 635}
}

Reviews

Review #1

Please describe the contribution of the paper

Proposed method uses pretrained image and text encoders in a vision-language few-shot anomaly detection task for ultrasound data. They also use few-shot adaptation for finer discrimination of different types of anomalies.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Challenging task and good performance
2. Uses diverse data sources
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Authors do not seem to mention or compare to methods such as mediCLIP[1] which also perform few-shot anomaly detection for ultrasound data.
2. As far as I understand, the task in the proposed method is actually closer to few-shot supervised learning (as opposed to anomaly detection, which is generally considered unsupervised) since it uses different classes of real anomalies and their label masks(?). Other few-shot anomaly detection approaches such as mediCLIP use only healthy data, even in adaptation.
3. Because there are so many elements, some explanations and the core contributions are a bit unclear. There are a few typos throughout (e.g. Lable, anormaly).
[1] Zhang, Ximiao, et al. “Mediclip: Adapting clip for few-shot medical image anomaly detection.” MICCAI, 2024.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty seems somewhat limited when compared to other methods such as MediCLIP, which also use CLIP for few-shot anomaly detection in ultrasound data. The descriptions of all the elements and their combination could also be a bit more clear.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors state that existing methods, specifically MediCLIP, are only used for in-domain tasks, whereas proposed method can perform cross-domain tasks (better generalization). But in the original MediCLIP paper, there are some experiments where training and test tasks involve different modalities. These are hard to compare directly, but it seems as though the authors have made new experiments showing the proposed method outperforms MediCLIP. They also agree to clarify the differences between their work and MediCLIP.

Review #2

Please describe the contribution of the paper

The paper proposes UltraAD, a few-shot approach to adapt VLMs like CLIP for both anomaly classification and localization using learnable prompts and multiscale features. The proposed method is trained with one public breast ultrasound dataset and tested on unseen breast ultrasound scans with domain variations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The paper is well-written and easy to follow. 2) The proposed components are simple but effective as shown by the ablation study. 3) The proposed method outperforms the anomaly detection and classification state-of-the-art methods tailored for VLMs
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The proposed method is only evaluated on one type of anatomy (breast ultrasound scans) whereas other methods like AnomalyCLIP propose a more generalized model capable of localizing different medical anomalies in a zero-shot manner. 2) The learnable class token conditioned on the image features highly resembles CoCoOp (1), however it is not referenced directly or indirectly in the manuscript. Moreover, this is not clearly explained why it would be better than having a trainable context (i.e. “a photo of a”) like CoOp. 3) What is the MiniNet composed of? Is it an MLP or a single layer? More details concerning this should be given. 4) The parameter cost of the proposed method should be elaborated. For example, the framework includes an learnable Adapter for the patch tokens at each of the selected layers while other methods directly use the patch tokens as is. The high expense of parameters might be the reason why the method performs better than the others. 5) The methodology for engineering text prompts related to ultrasound pathological features was not clearly described.

(1) Zhou, Kaiyang, et al. “Conditional prompt learning for vision-language models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

1) There’s a typo in Fig. 1 (“lable” –> “label”) 2) AnomalyCLIP is mentioned twice in Table 2 3) The scores for the final method mentioned in Section 3.3 seem to be incosistent with the scores presented in Tables 2 and 3 (i.e. pixel AUROC in Sec. 3.3 is 93.4 while in Table 3 it is 93.9) 4) The authors mentioned using an in-house breast ultrasound dataset, however they did not mention whether it will be made publicly available for reproducibility.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My overall score is based on concerns regarding the limited generalizability, unclear methodological choices, and lack of detail in key components. The evaluation is restricted to a single anatomy, similar ideas from prior work like CoCoOp are not properly cited, and important design elements such as MiniNet and the prompt engineering process are not clearly explained. These issues limit the clarity, novelty, and broader impact of the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed all of my original comments, and their clarifications will significantly improve the paper’s clarity and contribution. Overall, the work tackles the critical challenge of domain variations in ultrasound, which is a highly important aspect in clinical settings where models must generalize across diverse imaging conditions and sources.

Review #3

Please describe the contribution of the paper

They propose a novel method for few-shot anomaly classification and localisation, using both the intermediate and final outputs of a CLIP model together with various adaptation mechanisms to distinguish between normal ultrasound images, those with begin tumours and those with malignant tumours. They validate the method on a challenging setting, both training and taking the few-shot examples from a different dataset collected with a different model scanner.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

S1. The wide availability of medical ultrasound makes it an important modality to research its use, particularly for methods such as few-shot classification and segmentation as they can more easily be applied to new classes (without as high a data requirement).

S2. The realistic, challenging evaluation using test data from a different scanner data distribution makes the results more representative of the methods clinical use.

S3. The adaptation of mainstream computer vision methods for the medical domain, such as their integration of the segmentation component of the VAND ‘23 challenge winning method APRIL-GAN is important as much of the work involving vision-language models is limited to natural images.

S4. The results are strong, consistently being ahead of other methods in all of anomaly detection and classification. In segmentation the proposed method is genherally ahead, only being beaten in the 4-shot case for one model by 0.2 AUROC (scored out of 100), however 4/5 of the times the proposed method is best it exceeds the second best method by more than this amount.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
W1. There are multiple typos or errors which hinder the readability, particularly in the tables (I will list the ones I have spotted in chronological order):
- In a number of places in the document the authors write ‘fewshot’ instead of ‘few-shot’, most obviously at the top of Fig 1: “Pre-Load Fewshot Features”
- At the bottom of the preload figure section it says “Onehot Lable” instead of “Onehot Label”
- In the Anomaly detection section of table 2 ClipAdapter has the citation for AdaCLIP. I’m not sure which way round the error is here, I would assume the name is wrong, as if it were changed to AdaCLIP the methods list would match the Anomaly localization methods in Table 3.
- Also in the anomal detection section of table two, two methods are labelled “AnomalyCLIP”. I assume the second is the error as it has the citation for MVFA
- The final paragraph says “anormaly” instead of “anomaly”
W2. Some parts of the method are not fully described. The most important component is the “mask-guided post-processing technique” mentioned at the end of the “anomaly Detection and Classification” Experiments subsection. You mention this improves results, but there is no description about what it is. The method diagram also does not indicate any further computation on the segmentation maps after prediction using cosine similarity. More minorly, the training section mentions use of both Dice and focal loss but does not detail the weighting.

W3. Using AUROC for highly imbalanced data, such as the pixel-level localisation, is not ideal as methods can get a high score by being biased to the majority class even when the method fails to accurately segment objects. This leads to the AUROC saturating, making it harder to distinguish between methods. Presuming that AUROC was chosen because it does not require choosing a threshold, you could follow existing work by instead using metrics such as area under the precision recall curve (sklearn.metrics has a good implementation) or optimal DICE score (the best DICE score over all possible thresholds).

W4. There is an unlabelled dark yellow box to right of ‘MiniNet’ in Fig 1, does it correspond to an operation? If so, provide a meaning in the key. If not, and it actually refers to f’ , maybe put ` f’ ` inside the box (currently there are two annotations for f’ next to the arrows). If you choose this I would also put the f before the MiniNet inside the box (then both would be consistent with the visualisations of u, w_n and w_a). I would also recommend having another poss at being consistent with your alignment as some spacings are inconsistent (vertical alignment of w_n and w_a, the two L_{seg} on right hand side).

Here are some further points which are not major issues but could help further increase clarity:
- The key for figure 1 is very helpful, moreso than if the same information was provided in a caption. Putting a box around it would help to separate it from the rest of the diagram (it’s quite close to the visualisation of the logits).
- Label the different shot experiments with their shot number instead of M1/M2/M3. Currently their indirect names makes the reader have to hesitant and look back to their definitions which seems unnecessary. Given that you already have 2 cases where you remind the reader that M3 means 4-shot it would be easier to always directly refer to each as the X-shot experiment. If you are hesitant because putting ‘Shot Number’ (or similar) in the tables left column may confuse readers due to being above the method names then I would suggest quickly explaining it in the caption. In a similar vein, when comparing the different experiments it would match other work more if you went from lowest shot to highest.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I’ve listed my suggestions in the weaknesses section. the only extra one is personally I would move the “Image Feature Adapter” subsition into the “memory-boosted few-shot adaptation” section as it seems unusual to have it before the few-shot section when it states that it is trained during the few-shot adaptation process.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the paper requires a second pass to correct the text and table errors listed (and a careful review in case there are others we all have missed), once these errors are fixed (which seems achievable in the rebuttal phase) I do not see a significant blocker for the publication of this work. The method is novel, evaluation is thorough, and it is evaluated in such a way (distribution shift between training and test data) which clearly demonstrates that the method’s ability would translate to a clinical setting.

I am currently limiting my recommendation to “Weak Acceptance” rather than “Acceptance” due to the missing details regarding the “mask-guided post-processing technique”, please explain this thoroughly in the rebuttal (both the technique and how you will better explain it in the paper).
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed my main issues, being the typos, clarity issues and explaining the mask-guided post-processing component, so I am confident that the paper is suitable to be accepted. The thorough evaluation using data from domains notable different from the training data is a great example for other anomaly detection work, as domain-shift problems are a key challenge in anomaly detection.

As many of the changes are promised but not entirely described (which is understandable given the character limitation on the rebutall) I would suggest the authors ask a colleague who is not familiar with the project to review the rewritten sections as it may be that they have assumed knowledge which a paper reader would not have. For instance, in the original paper it wasn’t clear that the mask-guided post-processing was applied to the final classification prediction rather than the anomaly map, let alone how it was actually computed.

My only concern which is not mentioned in the rebuttal is the naming of the shot classes as M1/M2/M3 for 16/8/4, but I would again strongly suggest the authors changing this and directly using the shot counts in the text/tables, as when I was rereading the paper to remind myself it again confused me.

Finally, my point about using AUROC was not suggesting that the proposed work would be below the baselines when measured using AUPRC but rather AUPRC gives a better indication of how ‘solved’ the problem is (as AUROC saturates quickly in imbalanced data cases). Thus I would suggest pixel-wise AUPRC results are included in the final manuscript if space allows, replacing pixel-wise AUROC if necessary.

Overall, my concerns have been sufficiently met; this is a strong paper that should be accepted, particularly because of the strong, clinically-relavent evaluation.

Author Feedback

We thank reviewers for the positive feedback. (R2) noted “the method is novel, evaluation is thorough, would translate to a clinical setting.” (R1/R2/R3) acknowledged our strong performance on challenging tasks. (R3) noted the effectiveness of our simple design.

Compare to MediCLIP(R1): Though MediCLIP is trained on few-shot normal images with synthetic anomalies, it is limited to: 1) binary anomaly detection; 2) in-domain tasks—differing in both novelty and scope. In contrast, our method targets fine-grained classification (normal, benign, malignant) to better support clinical diagnosis. The differences will be discussed. Notably, we introduce a learnable few-shot feature bank to boost classification and a PIF module for deep fusion to enhance pixel-level localization. Superior performance is achieved and demonstrated on multiple breast US data. E.g., in 8-shot cross-domain tests, our method outperforms MediCLIP (AUROC: image-level 90.7 vs. 76.1; pixel-level 91.5 vs. 88.5).

Compare to CoCoOp(R3): CoCoOp [24] uses class names with learnable context tokens to improve prompt quality for classification. We instead target class-agnostic anomaly detection, replacing class names with a single token conditioned on the [CLS] token for better generalization via global features (unlike CoOp). This design follows [26]. Our novelty is the first VLM-boosted method for fine-grained US anomaly detection, better addressing clinical needs. See our prior response for new features.

Generalization across varying anatomies(R3): AnomalyCLIP[26] is trained on large-scale industrial data and tested on medical data, incl. US. But without seeing medical data, its performance drops. Like AnomalyCLIP, our method also supports zero-shot learning and has better generalization ability. E.g, AnomalyCLIP achieves pixel-level AUROC 81.5 on unseen TN3K thyroid data, while our method has 89.2 on unseen anatomy (from breast to thyroid). In medical use, precision is critical. Our approach improves AnomalyCLIP for clinical translation via few-shot learning using a few anatomy-specific examples. While broader validation is limited by scarce real patient US data, our method generalizes better across devices and settings (Tab 2/3). Despite data limits, strong results on cross-domain breast data suggest good potential for other anatomies.

Anomaly Seg Map(R2): Patch-wise cosine similarity between text and image is upsampled to the image size as anomaly maps Y1&Y2 (see Fig.1). The final anomaly map is obtained by averaging the two maps.

Mask-Guided post-processing(R2): This is a widely adopted strategy in recent AD methods (e.g., WinCLIP), where segmentation results are used to boost classification. Specifically, the final prediction hat{y}’ is computed as: 0.5{max[(Y1+Y2)/2)]z+hat{y})}. The paper has been updated.

Parameter cost(R3): AdaCLIP(10.7M), AnomalyCLIP(5.6M), VCP(7M), MVFA(12M), vs. ours(7.2M). Our superior performance is not due to having the largest model. Moreover, we witness that larger models may overfit in few-shot settings, bringing no performance gain.

Inconsistency of AUROC scores(R3): Tab2/3 show average scores from three runs (Sec 3.1); ablation study reports single-run results.

Evaluation Metric(R2): We also evaluated using AUPRC, where our method remains superior.

Detailed implementation & data/code (R1/R2/R3): The implementation details will be updated (R1), i.e., MiniNet is a 1-channel CNN with 3×3 kernels(R3), and Dice/Focal loss weights are 1(R2). Besides 300 in-house patient images, other data are public. Code&model will be released(R3).

Clarity of text description & figure visualization (R1/R2/R3): Thx! We’ve fixed all typos, reviewed the text with a native speaker. Baseline name issues in Tab. 2 are corrected(R2/R3): row 2 is now “AdaCLIP”, row 5 “MVFA”. Prompts are manually crafted per category using clinical descriptions(R3). The dark yellow box is the adapted [CLS] token f′ after MiniNet(R2), will be updated in Fig. 1.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

While this paper received mixed scores, it seems that the rebuttal addressed reviewers concerns. After going through the rebuttal answers and reviewers comments, I do not have any major concerns to go against the reviewers scores, and thus I recommend the acceptance of this work.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Author(s):