Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Automated Breast Ultrasound (ABUS) provides three dimensional volumetric imaging that improves breast lesion detection without radiation exposure and reduces operator dependency. However, the resulting high data volume poses significant challenges for radiologists in localizing lesions accurately and distinguishing benign from malignant cases—challenges that can directly impact early diagnosis and treatment outcomes. To tackle these critical issues, we propose SAMASK-CLTR (Spatial-Aware Mask Prompting with Convolutional Transformer Architecture), a hybrid framework that combines the feature extraction power of CNNs with the global modeling capability of Transformers. In our approach, ResNet-50 extracts hierarchical, multi-scale features that are refined by a Transformer encoder-decoder to capture global context. Crucially, during decoding, a mask prompt enhanced with 3D positional encoding guides the network to focus on key tumor regions, directly addressing the challenges of precise localization and classification. Experiments on 7,073 ABUS images—including 6,973 clinical cases from Internal Datasets and 100 cases from the public ABUS Challenge Cup—demonstrate that SAMASK-CLTR achieves AUCs of 88.45% and 70.46% on internal and external datasets, respectively. These results highlight the potential of our framework to significantly enhance breast cancer diagnosis by improving the accuracy and reliability of lesion classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2653_paper.pdf

SharedIt Link: https://rdcu.be/eHwLQ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_54

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/SAMASK-CLTR/Code

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XuPei_SAMASKCLTR_MICCAI2025,
        author = { Xu, Peirong AND Zhu, Luoqian AND Chen, Jingkun AND Qian, Xin AND Sun, Yue AND Bao, Lingyun AND Tan, Tao},
        title = { { SAMASK-CLTR: A spatial-aware mask guided learning model for benign and malignant tumor classification in ABUS } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {567 -- 577}
}

Reviews

Review #1

Please describe the contribution of the paper
1. Propose a hybrid CNN-Transformer model that integrates spatial-aware mask prompts for direct benign-malignant classification, achieving substantial improvements over conventional CNNs.
2. Evaluate multiple input modes and analyze their impact on classification performance.
3. Conduct extensive experiments on large-scale clinical and public datasets, validating the cross-dataset generalization capability of SAMASK-CLTR.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Propose a hybrid CNN-Transformer model for direct benign-malignant classification, demonstrating superior performance compared to traditional CNNs
2. Enhance the network’s ability to focus on critical tumor regions by incorporating a mask prompt augmented with 3D positional encoding, effectively tackling the challenges of accurate localization and classification.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The novelty of this paper is somewhat limited, as it primarily integrates existing techniques into its framework, such as ResNet and MSDeformable Attention. The concept of combining CNNs and Transformers is relatively straightforward, and the implementation of components like 3D Position Embedding and Spatial-aware Mask Prompts is fairly intuitive.
2. To improve clarity, an overview of the proposed model should be presented at the beginning of the Methods section, rather than placing Fig. 2 at the end. Additionally, Fig. 2 requires a more detailed explanation. For instance, the roles of Q and K in the Decoder should be explicitly clarified.
3. Furthermore, the comparison experiments would benefit from the inclusion of more state-of-the-art methods to provide a comprehensive evaluation of the proposed approach.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Other issues
1. Figures and tables are not cited in the paper.
2. What is the loss function?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper introduces Spatial-Aware Mask Prompting with Convolutional Transformer Architecture (SAMASK-CLTR), a framework that integrates the feature extraction strengths of convolutional neural networks (CNNs) with the global modeling capabilities of Transformers for classifying benign and malignant tumors in automated breast ultrasound (ABUS). However, 1. the novelty of this paper is limited.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The novelty of this paper is limited as the idea of combining CNNs and Transformers is relatively straightforward.

Review #2

Please describe the contribution of the paper

This paper combines CNN with Transformer to propose the SAMASKCLTR model, which the paper claims can significantly improve the diagnosis of breast cancer by improving the accuracy and reliability of lesion classification. They validated these claims on publicly available automated breast ultrasound and internal datasets, showing advantages over existing models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents strong experimental results showing that the proposed method outperforms the baseline on both internal and external datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

AP,LAT and MED are not mentioned in the interpretation of Figure 1. And the diagram is interpreted as MED but plotted as MAD. The paper lacks an explanation for some terms, such as ABVS, which does not indicate what it is or whether it is related to ABUS. None of the figures in the article are indicated in the text. This article mentions the use of oversampling to solve the problem of data imbalance, but does not provide specific details or rationale. This interpretation undermines the credibility of the results and the scientific rigor of the paper. The experimental part lacks the introduction of evaluation indexes.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method shows some experimental results that are superior to existing frameworks, but there are partial problems with the details of its paper. All the images and texts are not clearly indicated in the text, and the evaluation indicators involved in the experiment are not introduced in the paper. These defects make the article lack a certain rigor.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The paper presents strong experimental results showing that the proposed method outperforms the baseline on both internal and external datasets.External challenge datasets were used to improve the rigor of the evaluation. By combining mask prompts with 3D position encoding, the ability of the network to focus on tumor areas was enhanced

Review #3

Please describe the contribution of the paper

This paper proposes a hybrid CNN transformer for cancer classification from 3D ABUS data for breast lesion classification (as benign or cancer). The core contribution is the introduction of spatially aware mask prompts to directly help with the pathological assessment of lesions from imaging. This method is tested on several images from private and public datasets - looking at the accurary, AUC, sensitivity and specificity. Ablation studies are also included.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The experiment and method description of the paper is really thorough and well written
- I appreciate the use of the external challenge dataset to improve the evaluation rigor
- The ablation studies make sense and show that there has been thoughtful investigation into what makes the model work well
- There are a large number of images in the evaluation sets which is great
- I think this is an excellent application of spatially aware mask prompting that we’re seeing in general DL literature to a highly relevant clinical problem
- The literature review is very extensive and well written
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The organization of the methods is a bit unintuitive - the paper goes directly into a description of deformable attention which is a bit confusing when it’s use is not brought up until the CTRL section later. Consider starting with an overview of the network (Figure 2), a simplification of the core components, and then breaking into the specifics details. This will really help the reader navigate what you’ve done
- Minor: page 8 “itguides” instead of it guides
- In Figure 2 , I don’t totally understand the top right block (specifically the * 8)
- this sentence: “Consequently, ResNet101, DenseNet121, and SwinUnetr were established as the state-of-the-art (SOTA) baseline models for their respective input modes. “ Needs a reference - may have missed this if it was mentioned earlier
- The breakdown of data distribution (into training, validation, test) is unclear - maybe you could add to table 1 to explain what was used for experiments
- There are no standard deviations in the results table (would be good to see) particularly if you did some sort of cross validation
- “A comprehensive analysis indicates that our method significantly outperforms existing models in cross-dataset generalization and overall performance, further validating its robustness.” - do you have statistical analysis to back this up?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The comments that I have outlined about in the strengths and weaknesses should help guide revisions.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I chose weak accept because the weaknesses that I have outlined are related to presentation and organization. I think the methods section needs some serious restructuring to clarify the network which is why I said “weak” because I’d like to see this in the rebuttal. Otherwise, as I’ve pointed out in the strengths, this paper is a thorough application of mask prompting to a highly relevant problem. The validation is well thought out and rigorous - although there are some important things to address (cross validation, dataset breakdown, significant claims) - and I think that overall this would make a nice addition to the MICCAI conference.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have clearly addressed my concerns, and I believe they have also sufficiently addressed the concerns about novelty. I think this will make a very nice addition to the conference with the proposed changes.

Author Feedback

We gratefully acknowledge the reviewers’ valuable feedback and have addressed comments as follows.

Q1: Experiment Reproducibility (R1,R2,R3) We will make our code publicly available on GitHub. Furthermore, pending approval from our collaborating hospital, we will also release the dataset.

Q2: Typos and Missing References (R1, R2, R3) We will add or update citations for Figure 1 (Background, p. 2), Figure 2 (CLTR and Mask sections, p. 4), Table 1 (Dataset, p. 5), and Table 3 (Ablation Studies, p. 7). In response to R2’s suggestion regarding missing SOTA model references, we will update Table 2 (p. 6) with the appropriate SOTA citations.

Q3: Definition of Abbreviations (R1,R2) We will update abbreviation definitions including AP (Anteroposterior), LAT (Lateral), and MED (Medial). And we’ll clarify that ABUS and ABVS are equivalent terms from different manufacturers, both common in literature. The ‘*8’ in Figure 2, indicating batch size, has been removed to avoid confusion.

Q4: Evaluation Metrics and Loss Function (R1,R3) We chose AUC (Area under the ROC curve), ACC (Accuracy), SEN (Sensitivity) and SPE (Specificity) metrics which are common in the classification task for evaluation. We train the model by minimizing binary cross-entropy loss.

Q5: Organization of Section 2 (R2,R3) In our revision, we will restructure Section 2 as follows: first present Figure 2 as an overall model overview, then sequentially describe CLTR construction, the spatial-aware mask prompt, the multi-head attention mechanism and 3D positional encoding. Within the CLTR subsection, we will explicitly clarify that, in the decoder, the query (Q) is integrated with the spatial-aware mask prompt output, thereby embedding tumor and spatial cues to direct attention toward lesion regions while the key (K) propagates feature representations across decoder layers, culminating in the final output after iterative decoding.

Q6: Explanation of Oversampling (R1) To address 5:1 benign-malignant imbalance, we used random oversampling to increase malignant samples, ensuring balanced training.

Q7: Statistical Analysis (R2) All comparisons between our method and the baselines achieved statistical significance (p < 0.01), and we will state this in the revised version.

Q8: Data Distribution and Results Tables (R2) We used a 7:1 train/test split and will clarify standard deviations in data distribution as suggested.

Q9: Comparison with State-of-the-Art Models (R3) In Table 2, we benchmark against the backbones and architectures employed in leading studies: DenseNet-121 as used by Yang et al. (IWBI 2024), ResNet-18/50/101 as adopted by Wang et al. (IEEE TMI 2024), and advanced frameworks such as SwinUNETR (Oh et al., ARM 2024) and 3D DETR (Tao et al., MEDIA 2024). Each of these has demonstrated top performance on medical image classification and detection tasks. As suggested, we will also incorporate the latest 2025 developments in our journal extension.

Q10:Main Contributions (R3) The spatial-aware decoding prompt we introduce for tumor classification offers several advantages over existing methods. First, to reduce the cost of 3D feature processing, we downsample the combined mask image and positional encoding before fusing them into the classification query—whereas most current SOTAs either treat the mask as a separate channel or employ cross-attention, both of which can inflate 3D computation and dilute mask features. Our Spatial-Aware Mask Prompt (SAMP) preserves rich internal and margin details within a densely encoded feature space, as demonstrated by our ablation results in Table 3 and our SOTA comparisons in Table 2. We rigorously validated this approach on both a large internal clinical cohort (6,973 ABUS volumes) and the public ABUS Challenge Cup dataset (100 volumes). In addition, during the decoding phase, our model focuses more on pixels contributing to classification rather than the entire tumor region, which may enhance explainability.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper presented interesting ideas of a hybrid CNN transformer from 3D ABUS data for breast lesion classification. Reviewers agreed with strong validation in both private and public datasets and most concerns were well addressed.

back to top

SAMASK-CLTR: A spatial-aware mask guided learning model for benign and malignant tumor classification in ABUS

Author(s):