Abstract

Detecting and classifying lesions in breast ultrasound images is a promising application of artificial intelligence (AI) for reducing the burden of cancer in regions with limited access to mammography. Such AI systems are more likely to be useful in a clinical setting if their predictions can be explained. This work proposes an explainable AI model that provides interpretable predictions using a standard lexicon from the American College of Radiology’s Breast Imaging and Reporting Data System (BI-RADS). The model is a deep neural network which predicts BI-RADS features in a concept bottleneck layer for cancer classification. This architecture enables radiologists to interpret the predictions of the AI system from the concepts and potentially fix errors in real time by modifying the concept predictions. In experiments, a model is developed on 8,854 images from 994 women with expert annotations and histological cancer labels. The model outperforms state-of-the-art lesion detection frameworks with 48.9 average precision on the held-out testing set. For cancer classification concept intervention increases performance from 0.876 to 0.885 area under the receiver operating characteristic curve. Training and evaluation code is available at https://github.com/hawaii-ai/bus-cbm.


Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/4008_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/4008_supp.pdf

Link to the Code Repository

https://github.com/hawaii-ai/bus-cbm

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Bun_Learning_MICCAI2024,
        author = { Bunnell, Arianna and Glaser, Yannik and Valdez, Dustin and Wolfgruber, Thomas and Altamirano, Aleen and Zamora González, Carol and Hernandez, Brenda Y. and Sadowski, Peter and Shepherd, John A.},
        title = { { Learning a Clinically-Relevant Concept Bottleneck for Lesion Detection in Breast Ultrasound } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes an explainable AI model for detecting and classifying breast lesions in ultrasound images. The model incorporates a standardized lexicon and includes a concept bottleneck layer to predict known diagnostic features. This allows radiologists to review and modify the AI system’s predictions in real-time. Experimental results show that the model outperforms existing frameworks, achieving higher precision and improved cancer classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method in this paper demonstrates a notable level of interpretability, greatly enhancing its applicability.
    2. The authors validate the model using a relatively large amount of data(8,854 images from 994 women), ensuring robustness and reliability.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper’s structure is not well-organized. It devotes excessive space to describing the dataset, while providing insufficient detail on the methodology. Many crucial implementation details are missing, making it difficult for readers to grasp the specifics of the proposed method.

    2. The paper lacks any visualized results, such as the outcomes of lesion detection. The absence of visual illustrations hinders the readers’ understanding and assessment of the effectiveness of the proposed approach.

    3. Several important details are missing. For instance, it is unclear why one would contain an average of about nine images, and what is the difference between those nine images, whether they were taken at different times or from different angles?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It would be beneficial to enhance the understanding of the BI-RADS masses lexicon by providing concrete examples with corresponding ultrasound images. This visual representation will assist readers in comprehending the specific features and characteristics used for classification.

    2. To improve clarity, consider presenting a flowchart or diagram illustrating the inclusion and exclusion criteria for the dataset. This visual representation will provide a clear overview of the data selection process and help readers better grasp the criteria used for dataset composition.

    3. It is advisable to provide more visualized results to demonstrate the advantages of the model’s interpretability. For example, including visualizations of the model’s predictions, such as lesion detection results, will enable readers to assess the model’s performance and understand its strengths in a more intuitive manner. Additionally, visualizing concept predictions or highlighting important features identified by the model can further emphasize the interpretability of the proposed approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The paper’s writing style is a significant weakness as it fails to provide a sufficiently clear description of the proposed method. Without a clear and comprehensive explanation, it becomes challenging for readers to evaluate the method’s feasibility and reasonableness.

    2. The paper lacks the presentation of visualized results, such as illustrating the outcomes of lesion detection or highlighting the interpretability of the model. Visualized results play a crucial role in demonstrating the effectiveness and interpretability of the proposed method.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The main contribution of this work is in the area of interpretability, but the authors have not provided any content related to interpretability or visualization in the experimental section.



Review #2

  • Please describe the contribution of the paper

    This paper proposes an explainable AI model to detect and classify lesions in breast ultrasound images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strong points of this paper are:

    • It introduces an explainable AI model with two clear objectives: (1) to allow the system to improve performance by taking clinical feedback into account; (2) to explain to the clinicians why the choices that appear as output are made, eliminating the notion of a “black box” and increasing clinicians’ acceptance of systems like these.
    • it is very well written, the information is clear and well articulated in general.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The work’s weak points are:

    • There is a lack of information about the data (which is very important and decisive in interpreting the results).
    • There are some choices made that are not justified
    • The tables are a bit confused
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • In the abstract the authors mentioned: “Training and evaluation code is available at ****”.
    • The submission does not mention open access to data

    Although the data is private, with the availability of the code it is my understanding that the model can be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper deals with the very interesting topic of explainable AI. Some points are not clear:

    • What is the examination time added by using this AI model?
    • The last two sentences of the abstract are written in a confusing way, they should be rewritten.
    • Why did the authors use Resnet-101 as the FPN?
    • Why did the authors use the RCNN mask when there are other more evolved detection networks with better performance (Fast R-CNN, Faster RCNN, etc.)?
    • What are the results by BI-RADS category? What is the distribution of BI-RADS levels in training and testing? This can influence or bias the result.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical part of this paper is very interesting (explainable AI on breast ultrasounds to detect and classify lesions), the models are well described and reproducible. However, there are some aspects related to the choice of the based networks and the distribution of the data that are not clear. The abstract also has confusing sentences.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Autors applied a Concept Bottleneck Network (CBN) to develop an AI model which perform an interpretable prediction of lesion from Breast Ultrasounds (BUS) images based on the prediction of BI-RADS features before the final cancer classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Novel application of an XAI technique (CBN for cancer prediction from BUS); 2) The AI model reported promising results as: 2.1) achieved better results campared to a standard black-box architecture 2.2) applied on real data in a clinically relevant scenario.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Authors did not perform a k-fold cross validation that in this scenario (internal medical dataset of limited size) might be crucial to assess model generalization capabilities. 2) No limitation mentioned.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    CBM models offers advantages in terms of interpretability but their development is limited to the availability of concept-annotated dataset. Future research might be in line of the automatic discovery of concept, e.g.: https://openreview.net/forum?id=FlCg47MNvBA

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work provides an interesting showcase in an aspect on how the multidisciplinary collaboration between radiologist and AI scientist can foster the successful development of XAI system, that In particular, feasibility of the present work was possible thank to the creation an the annotated dataset with lesion delineation and BI-RADS standard lexicon.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors added a limitation section as requested.




Author Feedback

We appreciate the comments of all the reviewers (R3, R4, and R5) on all review points (Q4-Q14). We have tried to address all feedback and believe our manuscript has been strengthened as a result. Please find the authors (A) detailed responses below.

Q6: “There is a lack of information about the data…” (R3), “The paper’s structure is not well-organized. It devotes excessive space to describing the dataset, while providing insufficient detail on the methodology. Many crucial implementation details are missing, making it difficult for readers to grasp the specifics of the proposed method. (R4) (A) Supplemental Table 1 provides additional descriptive statistics. Section 2 describes the architecture and weight freezing. Section 3.2 describes the data split. Section 3.3 describes learning rate schedule, loss, and augmentation. Subsections of 3.3 describe concept correction and cancer head architectures. Section 4 describes hyperparameter optimization and Supplemental Table 2 provides the search space. The planned code release will also provide clarity.

Q6: Critique of the lack of visualized predictions and BI-RADS lexicon labels. (R4) (A) We have created an additional figure with concrete predictions, annotations, BI-RADS lexicons, and biopsy labels on the testing set which addresses this comment, as well as R4’s comments in Q10 and Q12 vis-à-vis visualization.

Q6: Critique of the lack of k-fold cross validation (CV) for estimation of generalization. (R5) (A) We have a relatively large dataset (R4) and thus felt CV was not necessary to obtain accurate performance estimates. This work would require nested CV and a full hyperparameter search, presenting computational constraints. We are collecting more data for future work showing generalizability.

Q6: Requested explanation for having several images per woman. (R4) (A) Our clinical data are collected opportunistically. The examining sonographer captures images they feel are necessary. These images may be at different angles and/or positions.

Q6: Requested limitations statement be added. (R5) (A) We have added a limitations statement. Briefly, the limitations are: limited demographic information, lack of evaluation alongside an expert reader, and lack of a geographically-distinct testing set.

Q10: Requested justification for choice of ResNet-101 FPN and Mask RCNN. (R3) Suggested future work in automatic concept discovery. (R5) (A) ResNet-101 is presented in the Mask RCNN paper and is a standard pre-trained FPN. Mask RCNN extends Faster RCNN to perform both detection and object segmentation and is familiar to the authors. Future work could extend to more advanced models. We thank R5 for the suggestion and, though it is beyond current scope, is planned in the future.

Q10: Request for information on BUS examination time added by model use. (R3) (A) Model predictions are available to the examining sonographer instantaneously; optional concept correction may add negligible additional time (we estimate 5-15 seconds).

Q12: Requested additional information on data distribution and architecture. Requested distribution and performance disaggregated along the BI-RADS lexicon. Requested data inclusion/exclusion flowchart. (R5) (A) See Q6 responses for architecture clarification. Supplemental Table 1 provides BI-RADS lexicon distribution in each data split. Splitting along case-control groups helps to maintain balance between “malignant-looking” and “benign-looking” lesions. We do not present subgroup performance due to lack of power due to limited sample size in some subgroups (i.e., not parallel lesions). Unfortunately, due to page limitations, we are unable to provide a visualization of the inclusion/exclusion process.

Q*: General critique of manuscript and abstract writing style, citing lack of clarity. (A) We have revised the abstract and description of the model architecture as well as training procedure in Section 2 to enhance clarity.

We thank the reviewers for their comments.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two reviewers favor to accept while reviewer #4 has expectations on illustrations or visualization of the interpretability, which I believe is a less relevant comment given the nature of the work is not to provide interpretability through visualization. So overall I recommend acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two reviewers favor to accept while reviewer #4 has expectations on illustrations or visualization of the interpretability, which I believe is a less relevant comment given the nature of the work is not to provide interpretability through visualization. So overall I recommend acceptance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces an explainable AI model, based on a standard lexicon and a concept bottleneck layer, to detect and classify lesions in breast ultrasound images. The paper received (weak accept -> no reassessment, weak reject-> weak reject, accept -> accept) scores (before->after rebuttal). The main strengths identified were the following: paper is well-written, well formulated explainable AI model, good level of interpretability, novel application, and promising results. The reviewers also raised the following weaknesses: lack of information about the data, missing implementation details, and lack of qualitative visual results demonstrating interpretability. The main weakness is the last one in this list (i.e., lack of qualitative results), which is particularly important in a XAI method. This paper has positive and negative aspects, but the lack of visual results is a major issue particularly for an explainable AI model.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces an explainable AI model, based on a standard lexicon and a concept bottleneck layer, to detect and classify lesions in breast ultrasound images. The paper received (weak accept -> no reassessment, weak reject-> weak reject, accept -> accept) scores (before->after rebuttal). The main strengths identified were the following: paper is well-written, well formulated explainable AI model, good level of interpretability, novel application, and promising results. The reviewers also raised the following weaknesses: lack of information about the data, missing implementation details, and lack of qualitative visual results demonstrating interpretability. The main weakness is the last one in this list (i.e., lack of qualitative results), which is particularly important in a XAI method. This paper has positive and negative aspects, but the lack of visual results is a major issue particularly for an explainable AI model.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Proposes an explainable AI model to detect and classify lesions in breast ultrasound images with large-scale validation. Reviewer concerns were lack of data description, not well justified choices, lack of implementation details, no visualizations for interpretability. All of these points have been addressed per rebuttal including an additional figure to be included for interpretability. Seems reasonable to accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Proposes an explainable AI model to detect and classify lesions in breast ultrasound images with large-scale validation. Reviewer concerns were lack of data description, not well justified choices, lack of implementation details, no visualizations for interpretability. All of these points have been addressed per rebuttal including an additional figure to be included for interpretability. Seems reasonable to accept.



back to top