Abstract

The lack of large and diverse training data on Computer-Aided Diagnosis (CAD) in breast cancer detection has been one of the concerns that impedes the adoption of the system. Recently, pre-training with large-scale image text datasets via Vision-Language models (VLM) (\eg CLIP) partially addresses the issue of robustness and data efficiency in computer vision (CV). This paper proposes Mammo-CLIP, the first VLM pre-trained on a substantial amount of screening mammogram-report pairs, addressing the challenges of dataset diversity and size. Our experiments on two public datasets demonstrate strong performance in classifying and localizing various mammographic attributes crucial for breast cancer detection, showcasing data efficiency and robustness similar to CLIP in CV. We also propose Mammo-FActOR, a novel feature attribution method, to provide spatial interpretation of representation with sentence-level granularity within mammography reports. Code is available publicly\footnote{We will release the model checkpoints upon decision}: \url{https://github.com/annonymous-vision/miccai}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0926_paper.pdf

SharedIt Link: https://rdcu.be/dY6kX

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_59

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0926_supp.pdf

Link to the Code Repository

https://github.com/batmanlab/Mammo-CLIP

Link to the Dataset(s)

https://www.kaggle.com/competitions/rsna-breast-cancer-detection https://vindr.ai/datasets/mammo

BibTex

@InProceedings{Gho_MammoCLIP_MICCAI2024,
        author = { Ghosh, Shantanu and Poynton, Clare B. and Visweswaran, Shyam and Batmanghelich, Kayhan},
        title = { { Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {632 -- 642}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a new method called Mammo-FActOR that focuses on learning attribute-level similarity between texts with images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The topic is interesting. The paper is easy to follow. The proposed method is sound.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Please see the detailed comments

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The competing methods in Table 1 are only baseline models, without any other methods from recent literature. It looks more like ablation studies. Can authors explain more about the contrastive loss in Mammo-FActOR? Why minimize the contrastive loss between images with and without attributes? Are x and x~ fixed as CC and MLO? Were the experiments only conducted on CC images?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty Domain Knowledge

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces Mammo-CLIP, the first Vision-Language Model (VLM) applied to mammography. The authors construct an in-house dataset pairing mammograms with corresponding reports for pre-training. To address data insufficiency, they employ Multi-View Supervision (MVS) and report synthesis for data augmentation. Mammo-CLIP demonstrates superior performance across various settings (zero-shot, linear probing, fine-tuning) and clinical tasks. Additionally, the paper also presents a feature attribution method to provide interpretability for the VLM in medical imaging.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper integrates various off-the-shelf techniques to introduce the first Vision-Language Model (VLM) in mammography. They design several data augmentation methods to address the common issue of data insufficiency in the medical field. Their proposed model demonstrates impressive performance across different settings and clinical tasks compared to the CLIP baseline. Additionally, they propose a feature attribution method, which serves as an interpretability tool for medical VLMs. Their work establishes a specialized foundation model in mammography, potentially influencing the field and inspiring VLM research in other medical domains.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The paper lacks a comprehensive performance comparison with other VLMs, including those not specifically designed for mammography or x-rays [1,2,3]. Comparing specialist and generalist medical foundation models would provide valuable insights. • The authors should discuss the potential impact of their work on other domains within x-ray imaging or medical imaging modalities beyond x-ray. • Certain components adopted in the paper, such as MVS and report synthesis, lack ablation studies. A detailed examination of each technique’s necessity would enhance the understanding of their contributions. • The paper should include in-depth discussions on VLMs in medical imaging. Topics such as the preference for specialist versus generalist models in different settings, the required amount of paired data for specialist medical VLM development, the influence of domain shifts (different machines/hospitals) on VLM performance, and the key differences between mammography VLMs and other x-ray VLMs warrant exploration. Insightful analysis and discussions on the broader medical VLM field would enrich the paper.

    [1] Zhang, Sheng, et al. “Large-scale domain-specific pretraining for biomedical vision-language processing.” arXiv preprint arXiv:2303.00915 2.3 (2023): 6. [2] Eslami, Sedigheh, Christoph Meinel, and Gerard De Melo. “Pubmedclip: How much does clip benefit visual question answering in the medical domain?.” Findings of the Association for Computational Linguistics: EACL 2023. 2023. [3] Lin, Weixiong, et al. “Pmc-clip: Contrastive language-image pre-training using biomedical documents.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces Mammo-CLIP, the first VLM for mammography, integrating various off-the-shelf techniques to enhance performance compared to the CLIP baseline. However, it lacks a comprehensive performance comparison with other VLMs and omits ablation studies. Moreover, the paper lacks insightful analysis and discussions on the broader medical VLM landscape.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors build Mammo-CLIP VLM from a large dataset formed from In-house, VinDr and RSNA datasets. MVS, image transformations, translation between 2 languages and report generation from (attributes+randomly selected prompts) are used to augment the data. During validation, CLIP is used as a baseline and ResNet and EfficientNet are used as backbones. The models are evaluated for their classification performance (AUC for calcification, mass and malignancy, Accuracy for density) and localization performance (mAP) on VinDR and RSNA datasets. The proposed VLM is shown to have improvements over the baseline. Mammo-FActOR is proposed as a method to map the visual representation to the textual attributes by learning an MLP over the image encoder using contrastive learning. Localization of mass and calcification is done using the feature maps from Mammo-FActOR (without any boxes during training).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper builds a mammography VLM (Mammo-CLIP) using a large scale dataset. Quantitative validation on various tasks show improvements over the CLIP baseline. Creation of a large dataset like this is very useful for the community (Will the dataset be made available to the public?).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The following need to be clarified: -Should there be a negative sign in Eq 1? -Eq 2 summation notation is not clear. What is the summation over? -What is the form of function pi? Is it the dot product between MLP output and t ?

    Although the description of the method is easy to understand, it is very hard to read and follow the “Results” section because the tables 1-3 are far up in the paper.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    https://github.com/annonymous-vision/miccai.

    Model checkpoints to be released after decision

    Experimental details in 3.2

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors build a VLM using mammography images and text, which is trained constrastively on a large dataset. One of the main weakness is the “Results” section and its readability can be improved further. Tables 1-3 could be moved closer to the results section so that we dont have to scroll up many pages while reading this section. Table 2 column order should follow the same ZS, LP, FT order as in Table 1, just to be consistent and to make it easier for the reader.

    Finally, is there any reason for not using vision transformer backbones? The authors say that this is a future direction, but in general, the SOTA architectures for building foundation models are vision transformers due to their better expressivity during self-supervised training.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper addresses clinically important tasks in mammography and produces the respective models on large datasets to solve these tasks. Quantitative validation shows improvements over the CLIP baseline. Although the initial part of the paper is easy to follow, the results section is not that easy.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank the reviewers for the valuable feedback and suggestions. Adhering to the guidelines, we will not add any experiments to the paper. We will add the “-“ sign in Eq 1 and only move the results table closer to the results section. As some of the reviewers asked to compare with pubmed based pretraining models, we did experiments but will not include them in the paper. We also talked about them briefly in the rebuttal.

Reviewer 1

  • On Mammo-Factor

    Mammo-Factor aims to find the channel unit which encodes a certain attribute. For e.g, our aim is indentify the channel units that encodes mass. So, first from the report, we get the sentences with mass and get the embeddings of these sentences. Next, our loss pulls all the representations closer where mass is present and pulls apart where mass is not present. This effectively extracts the units which encodes the mass information.

  • Use of CC and MLO Views

The paper uses both CC and MLO images for training using the following strategy. x^I and x^I~ are the original image and augmented variant of the image. If a patient has both CC and MLO views, x^I is CC and x^I~ is MLO. If a patient has only CC or MLO view, x^I~ will be an augmented variant of x^I. We mention the augmentations in the paper in detail.

Reviewer 2

  • Comprehensive Performance Comparison with Other VLMs

We appreciate the reviewer’s suggestion. We have compared against PMC-CLIP during the rebuttal period. Respecting with the guidelines, where we are not allowed to add new set of experiments, we are not including the updated results here. The performance of our model is better than PMC-CLIP in all the metrics atleast by 16% as PMC-CLIP is trained with PubMed documents with low quality images. We pretrain our model with high quality real life patient data.

  • Impact on Other Domains Within X-ray or Beyond:

    While we developed Mammo-CLIP for mammograms only, it can easily be extended to other x-ray imaging domains.

  • Ablation Studies on MVS and Report Synthesis:

    We did not synthesize any reports in our work. We only use the reports during the pretraining. Adhering the guidelines, we are not allowed to add new set of experiments. We perform ablation studies on using only image text pair (x^I, x^T) instead of the augmentations and it reduces the objection detection (w/ finetuning the image encoder) performances by 15%, classification (w/ finetuning the image encoder) performances by 13%.

  • In-depth Discussion on VLMs in Medical Imaging:

We acknowledge the reviewer’s request for a more comprehensive discussion on the nuances of applying VLMs in medical imaging. Specialist VLMs are tailored for specific medical tasks or domains, such as mammography or pathology. These models are trained with highly curated datasets that focus on particular types of imaging or diseases. The primary advantage of specialist models is their enhanced accuracy and reliability in specific contexts due to their tailored training, which deeply encodes domain-specific nuances in their parameters. Generalist VLMs, on the other hand, are trained with more diverse datasets that cover a broader spectrum of medical conditions and imaging types. These models aim to provide a more flexible and scalable approach, capable of handling various tasks without needing retraining for each new application. While they offer greater versatility, they might not reach the same level of precision as specialist models in certain specific tasks.

Reviewer 3

  • Should there be a negative sign in Eq 1?

Yes, we will correct this.

  • Eq 2 Summation Notation

The summation notation in Equation 2 is intended to convey the comprehensive self-supervision across all combinations of the original and augmented pairs. To clarify, the summation should cover all possible pairs of original and augmented image and text representations, excluding identical pairs.

  • Form of Function \pi

The function \pi in the context of the Mammo-FActOR module is designed to measure the similarity between the MLP output and the attribute representation t_k. The notation suggests that \pi is a similarity function, and in our case, it is the dot product between the output of the MLP applied to the image representation and the textual representation of the attribute.




Meta-Review

Meta-review not available, early accepted paper.



back to top