Abstract

We focus on the problem of Unsupervised Domain Adaptation (\uda) for breast cancer detection from mammograms (BCDM) problem. Recent advancements have shown that masked image modeling serves as a robust pretext task for UDA. However, when applied to cross-domain BCDM, these techniques struggle with breast abnormalities such as masses, asymmetries, and micro-calcifications, in part due to the typically much smaller size of region of interest in comparison to natural images. This often results in more false positives per image (FPI) and significant noise in pseudo-labels typically used to bootstrap such techniques. Recognizing these challenges, we introduce a transformer-based Domain-invariant Mask Annealed Student Teacher autoencoder (D-MASTER) framework. D-MASTER adaptively masks and reconstructs multi-scale feature maps, enhancing the model’s ability to capture reliable target domain features. D-MASTER also includes adaptive confidence refinement to filter pseudo-labels, ensuring only high-quality detections are considered. We also provide a bounding box annotated subset of 1000 mammograms from the RSNA Breast Screening Dataset (referred to as RSNA-BSD1K) to support further research in BCDM. We evaluate D-MASTER on multiple BCDM datasets acquired from diverse domains. Experimental results show a significant improvement of 9% and 13% in sensitivity at 0.3 FPI over state-of-the-art UDA techniques on publicly available benchmark INBreast and DDSM datasets respectively. We also report an improvement of 11% and 17% on In-house and RSNA-BSD1K datasets respectively. The source code, pre-trained D-MASTER model, along with RSNA-BSD1K dataset annotations is available at https://dmaster-iitd.github.io/webpage.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1343_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1343_supp.pdf

Link to the Code Repository

https://github.com/Tajamul21/D-MASTER

Link to the Dataset(s)

https://dmaster-iitd.github.io/webpage/

BibTex

@InProceedings{Ash_DMASTER_MICCAI2024,
        author = { Ashraf, Tajamul and Rangarajan, Krithika and Gambhir, Mohit and Gauba, Richa and Arora, Chetan},
        title = { { D-MASTER: Mask Annealed Transformer for Unsupervised Domain Adaptation in Breast Cancer Detection from Mammograms } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a transformer-based unsupervised domain adaptation technique by integrating mask annealing and adaptive confidence refinement for pseudo-labels. The proposed method evaluated on four different breast mammogram datasets achieves better results compared to several methods in detecting breast cancer. Additionally, a natural image experiment is performed with Cityscapes and Sim10K datasets, and the reported results suggest improvements by the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The integration of mask annealing and adaptive confidence refinement for filtering pseudo-labels is very interesting to me.

    2. The proposed method exhibits improved performance compared to several other unsupervised domain adaptation methods with different combinations of source and target.

    3. Good sets of experiments including ablation studies are presented to convey the contributions of the paper.

    4. As promised by the authors, it will be a great contribution to release the RSNA-BSD1K annotations towards furthering research in breast cancer detection.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper needs significant improvement in its structure and organization. Figures and tables have been added but not discussed well.

    2. The experimental setup is problematic. Even though the target domain datasets are used without their corresponding labels in a UDA, the train and test sets should not be the same.

    3. Detection results and an ablation study are presented without discussing them at all.

    4. Missing details on how the different hyperparameters were selected.

    5. Some concerns regarding the datasets. The paper mentions generalizability across different geographies, different machines, techniques, protocols for image acquisition. A discussion on how the datasets used in the experiments differ, how much domain shift exists, etc. needs to be added. Not much information is provided about the In-house dataset except the number of mammograms.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors also claimed to release their pre-trained model along with RSNA-BSD1K annotations.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The authors are suggested to add a discussion comparing the #parameters and training/inference time of different methods to their proposed one.

    2. The tables and figures need significant improvement. Fig. 2 is hard to follow without clear labeling. Such as what are the blue blocks after Backbone, what are going into the Def-DETR Decoders, etc. Table 1 and Fig. 4 need to be improved, particularly the fonts. The visualization in Table 2 should be added as a separate figure with proper annotation. Table 1 (supplemental) should be part of the main paper. The Algorithm in Fig. 3 should be added as an algorithm.

    3. The reported breast cancer detection results need to be discussed. Instead of showing the natural image results, it would be better to discuss the BCDM results.

    4. Can the authors discuss any limitations or failure cases of their method?

    5. How to select the augmentation techniques for weak and strong augmentations? How crucial are the selected particular augmentations? What parameters were used for the augmentations?

    6. How many times did the authors repeat these experiments? What significance test was performed?

    7. It is not viable for clinical applications to have test data during training. I wonder if the D-MASTER performance will be consistent if evaluated on totally unseen test sets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Problematic experimentations, missing important data/experimental details, and result discussion.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors for submitting their responses. Unfortunately, the practicality of the proposed method/experiments still remains questionable. I doubt if the model would perform like this breaking the assumption of access to test data during training. This is not feasible for clinical applications. None of the provided references uses medical images, it would be better to see any medical imaging specific references. Furthermore, the paper needs some serious revision by adding thorough analysis of the results and findings. I still stick with my previous rating.



Review #2

  • Please describe the contribution of the paper

    This study presents a transformer-based domain-invariant mask-corrected student-teacher autoencoder (D-MASTER) framework for solving the unsupervised domain adaptation (UDA) problem of detecting breast cancer from mammograms. The framework is designed to enhance the model’s ability to capture breast abnormalities, especially when dealing with small-sized regions of interest such as masses, asymmetries, and microcalcifications. The investigators provide a bounding box annotated subset of 1000 breast radiographs from the RSNA Breast Screening Dataset (RSNA-BSD1K) to support further research in this area.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Introducing a transformer-based Domain-invariant Mask Annealed Student Teacher Autoencoder Framework, which integrates a novel mask-annealing technique and adaptive confidence refinement module. Unlike previous approaches that leverage massive datasets for pretraining and fine-tuning, D-MASTER employs a learnable masking technique for the Mask Autoencoder (MAE) branch to generate masks of varying complexities, enhancing domain-invariant feature learning.
    2. Proposing an adaptive confidence refinement module within the Teacher-Student model to filter pseudo-labels effectively. This module progressively restricts the confidence metric for pseudo-label filtering, allowing more pseudo-labels during the initial adaptation phase and prioritizing more reliable pseudo-labels as confidence increases, thereby improving detection accuracy.
    3. Providing a bounding box annotated subset of 1000 mammograms from the RSNA Breast Screening Dataset (RSNA-BSD1K), facilitating further research in BCDM.
    4. Establishing a new state-of-the-art (SOTA) in detection accuracy for UDA settings. Significant performance gains are reported, with a sensitivity of 0.74 on INBreast and 0.51 on DDSM at 0.3 False Positives per Image (FPI), compared to 0.61 and 0.44 using current SOTA, respectively. Similarly, significant improvements are observed on in-house and RSNA-BSD1K datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The article describes the design of the D-MASTER framework, but the specific implementation of the key technologies and the explanation of the principles are rather brief.
    2. In the results section, the authors report the performance enhancement of D-MASTER on different datasets, but lack detailed analysis and discussion of comparative experiments.
    3. There are some instances of lack of clarity in the use of terminology or more complex sentence structure in the article.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is suggested that the authors provide more detailed algorithm descriptions and mathematical derivations in the methodology section so that readers can deeply understand the working principles and innovations of the framework.
    2. It is recommended that the authors provide a more in-depth analysis of the experimental results, including comparisons with existing techniques, the effects of different parameter settings, and the limitations of the model’s performance, in order to demonstrate the method’s advantages and applicability.
    3. It is recommended that the authors further simplify and clarify the expressions in the text to improve readers’ understanding and reading experience.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method proposed in the paper exhibits some level of innovation, and the experimental results effectively demonstrate the superiority of the approach. Additionally, the authors’ commitment to open-sourcing the code and data will facilitate further research in the field. However, there are still some areas that could be further optimized, and I hope the authors will address these issues.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Based on the author’s feedback, some issues have been further clarified and it is recommended to be accepted.



Review #3

  • Please describe the contribution of the paper

    Introduced a transformer-based Domain-invariant Mask Annealed Student Teacher autoencoder (D-MASTER) framework. D-MASTER adaptively masks and reconstructs multiscale feature maps, enhancing the model’s ability to capture reliable target domain features

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Experimental results showed a significant improvement of 9% and 13% in sensitivity at 0.3 FPI over state-of-the-art UDA techniques on publicly available benchmark INBreast and DDSM datasets respectively

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    NA

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Thank you for the interesting paper, I have no suggestions to suggest for modifying the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Methodology and Experimental results

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank reviewers for their thoughtful comments and appreciating our work as “interesting” (R1, R4), and “innovative” (R2), with a “good set of experiments including ablation studies” (R1), which “effectively demonstrate the superiority of the approach” (R2). Reviewers also acknowledge that releasing the annotated dataset will “facilitate further research in BCDM” (R2) and is a “great contribution” (R1).

(R1) “The experimental setup is problematic…in a UDA, the train and test sets should not be the same… I wonder if the D-MASTER performance will be consistent … on totally unseen test sets.”

We firmly believe that our experimental setup is not problematic. Standard UDA methods like DANN [1], ALDA [2], SAFN [3], MDD+IA [4], CADA-P [5], GVB-GD [6], HDAN [7], SPL [8], SRDC [9], HDMI [10], and SHOT [11], all follow the same setting by adapting on the whole dataset in an unsupervised way. Concern regarding performance on unseen test splits is also unfounded. Note that Table 3 in the main paper already shows results on “unseen” cityscapes test split. Still, suggested experiments on custom split can be easily added in the camera ready.

(R1) “how much domain shift exists”.

We utilized 4 BCDM datasets with train and test split ensuring that no patient is common between the two splits. First, we trained a model on the source (train split) and tested this model on the source (test split), establishing the source model’s reliability. Then the reliable source model was directly tested on the target (test split) called “source only”. Next, the source model was fine-tuned on the target (train split) and tested on the target (test split) called “skyline”. A significant performance gap between the source only and skyline models indicated a domain shift.

Clinically, the domain gap across the four datasets was verified by three expert radiologists. The DDSM dataset is composed of screen film mammography images, while our In-house hospital data consists of digital mammography images. Although some RSNA and INBreast images also originate from digital mammography, they exhibit noticeable differences in image quality. The RSNA-BSD1K dataset, notably, includes data from ten different vendors, contributing to its diversity. Furthermore, DDSM, INBreast, and RSNA-BSD1K datasets are derived from screening mammography distributions, whereas our In-house hospital data is sourced from diagnostic mammography distributions. This distinction underscores the variability and domain shifts present among the datasets, highlighting the robustness and adaptability required for effective cross-domain breast cancer detection.

(R2) “Limitations of model performance” and (R2) “Failure cases”.

If small datasets like INBreast are used as source, then the trained model has severe generalization issues. Our method struggles in such scenarios. Using single views (CC and MLO) is another limitation but can be fixed easily.

(R1) Weak and strong augmentations.

Refer to section 3 (implementation details). We followed MT [12], and AT [13], to study different parameters in augmentations.

(R1) “Missing details on how the different hyperparameters were selected”

We followed a systematic grid search and empirical validation performance for tuning our hyperparameters. Additionally, we considered domain-specific knowledge and insights from previous works in UDA [12, 13, 11, 2, 6] to inform our choices. We will add further details in the supplementary section corresponding to Table S1.

[1] arxiv.org/abs/1409.7495
[2] arxiv.org/abs/2001.01046
[3] arxiv.org/abs/1502.02791
[4] arxiv.org/abs/2006.04996 [5] arxiv.org/abs/1906.03502v2
[6] https://arxiv.org/abs/2003.13183 [7] https://arxiv.org/abs/2011.14540 [8] https://arxiv.org/abs/1706.07522 [9] https://arxiv.org/abs/2003.08607 [10] https://arxiv.org/pdf/2012.08072 [11] https://arxiv.org/abs/2002.08546 [12] https://arxiv.org/abs/1703.01780
[13] https://arxiv.org/abs/2111.13216




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The decision is split: Reject (R), Weak Accept (WA), and Accept (A). Reviewer 1 is concerned about the problem setting, specifically the assumption of access to test data during training. According to the AC’s knowledge, UDA has two setups: 1) adaptation for the unlabeled training target data and testing for the unseen target data, and 2) adaptation for the test data. The latter setup is called test-time adaptation. Therefore, the AC considers that if the author clearly states this assumption in the early part (introduction), it is acceptable. Overall, based on the positive comments by two reviewers, the AC recommends accepting this paper if there is some space. If this paper is accepted, please clearly explain the problem setup.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The decision is split: Reject (R), Weak Accept (WA), and Accept (A). Reviewer 1 is concerned about the problem setting, specifically the assumption of access to test data during training. According to the AC’s knowledge, UDA has two setups: 1) adaptation for the unlabeled training target data and testing for the unseen target data, and 2) adaptation for the test data. The latter setup is called test-time adaptation. Therefore, the AC considers that if the author clearly states this assumption in the early part (introduction), it is acceptable. Overall, based on the positive comments by two reviewers, the AC recommends accepting this paper if there is some space. If this paper is accepted, please clearly explain the problem setup.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    As mentioned by R1, there is a problem in the UDA setting. This paper trains the D-MASTER framework on both source and the whole target datasets (Fig. 2), and tests the performance on the whole target dataset (Page 7), which is problematic. Although the author mentioned in rebuttal that other UDA adopted the same setting, SRDC made clear in the paper that they followed a 50%/50% split and divided the target domain into training and testing sets. In summary, though the idea of this paper is interesting, considering the setting problem, this paper cannot be accepted at this stage.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    As mentioned by R1, there is a problem in the UDA setting. This paper trains the D-MASTER framework on both source and the whole target datasets (Fig. 2), and tests the performance on the whole target dataset (Page 7), which is problematic. Although the author mentioned in rebuttal that other UDA adopted the same setting, SRDC made clear in the paper that they followed a 50%/50% split and divided the target domain into training and testing sets. In summary, though the idea of this paper is interesting, considering the setting problem, this paper cannot be accepted at this stage.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    One reviewer did not give detailed causes to support his positive recommendation for this paper, and thus the remaining two reviewers’ ratings are divergent. I agree that there are flaws with the presentation of the paper; it seems that the technical contributions of the paper are generally recognized by two reviewers (R1, R3). The authors should clearly state whether all methods in Table I are tested by training on Dataset-A and testing on Dataset-B to ensure a fair comparison of individual methods. Overall, I would recommend weak acceptance for this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    One reviewer did not give detailed causes to support his positive recommendation for this paper, and thus the remaining two reviewers’ ratings are divergent. I agree that there are flaws with the presentation of the paper; it seems that the technical contributions of the paper are generally recognized by two reviewers (R1, R3). The authors should clearly state whether all methods in Table I are tested by training on Dataset-A and testing on Dataset-B to ensure a fair comparison of individual methods. Overall, I would recommend weak acceptance for this paper.



back to top