Abstract

A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3702_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3702_supp.pdf

Link to the Code Repository

https://github.com/aurooj/VLM_SS.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Uro_Knowledgegrounded_MICCAI2024,
        author = { Urooj Khan, Aisha and Garrett, John and Bradshaw, Tyler and Salkowski, Lonie and Jeong, Jiwoong and Tariq, Amara and Banerjee, Imon},
        title = { { Knowledge-grounded Adaptation Strategy for Vision-language Models: Building a Unique Case-set for Screening Mammograms for Residents Training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a grouping method for mammograms to overcome the difficulties encountered in contrastive learning due to the highly biased sample set by constructing contrastive batches that truly differ.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem of selecting samples to contrast from a highly biased dataset in contrastive learning is real, and it is believable that it affects mammography. It is also a strength of this paper that they apply their model to both fine tuning pretrained models both in and out of domain.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The motivation section seems disconnected from the rest of the paper. There are many tasks described in the paper in insufficient detail e.g. The reports are “cleaned”. The breast tissue area is cropped and pixel data from both views are stitched together, originating from the chest wall. (how?) The main claim of the paper includes hard negative mining, but it is not clear where this technique is actually used.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Neither dataset nor source code is released. The description of numerous aspects is described at a very high level and would not be reproducible other than in a general sense.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    mutlimodal > multimodal, p. 2 “increase in recall rates”… More recall naively sounds good – but I know what you mean is high false positive rate, which is a problem. Maybe this could be worded differently? p4 – primarily categorized into 10 groups – then 5 are listed. What are the other 5? What does “primarily” mean – is there a secondary? When considering grouping - one issue is that that different groups differ differently ABC is somehow closer to ABCD than to DEF. It would be interesting to explore that. Table 2 Internel/Externel – should end in al

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is insufficient detail, code or public data to reproduce the work. It is unclear how the motivation section really drives the work. One of the claims, the use of hard negative mining, is not clearly applied.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    In the rebuttal, the authors state that they will release their code. I also better understand the motivation section after reading the rebuttal. Additional detail on the image processing and several unclear sentences were also addressed.



Review #2

  • Please describe the contribution of the paper

    A sampling strategy scheme to train VLM in the context of mammography retrieval. They use the training scheme in an out-domain VLM (ALBEF) and in an in-doimain (MedCLIP) using a private mammograpgy dataset of 46k MMG patients-report pairs. However, the sampling strategy does not show improvements to the MedCLIP baseline in the in-domain test set and shows little improvements in out-domin test sets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The translation to mammography retrieval of MedCLIP.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper wtiting and structure could be improved for clarity.
    • The authors claim the sampling strategy improves retrieval performance for the ALBEF model for internal evaluation, but not for MedCLIP, which is an in-domain VLM.
    • The out-domain improvements are not clear enough since no statistical analyses are provided to make sure there is anough statistical difference.
    • The authors claim that more experiments are needed but that they could not run them due to lack of time.
    • Figures are not readable, the text font is too small.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The Figures supporting the sampling scheme have very small text making it difficult to support the Methodology section.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Statistical tests are needed to validate the out-domain improvement of your method is enough and a strong contribution.

    • The contributions should be listed and highlighted, giving exact percentages and clear justifications.

    • Figure 1: the text in b) is not readable, I would suggest to remove subfigure a) and enhance and keep only subfigure b).
    • Figure 2: font is too small.
    • Figure 3 should show a zoom of the identified lesion. Not visible.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall I would recommend to improve the paper structure. Some paragraphs are too long and do not go straigh to the point, making a bit difficult to read and understand.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a methodology for fine-tuning visual language models (VLMs) to a specific medical radiology domain. The approach is based on efficient sampling of contrastive pairs, which also includes oversampling underrepresented groups. The methodology is validated on two mammogram datasets (internal vs external), used for fine tuning one general pretrained VLM (ALBEF) and one medical radiology specific pretrained VLM (MedCLIP).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I am not aware of other attempts of mini batch sampling of contrastive pairs – involving oversampling of underrepresented groups – for mammogram retrieval, so I would say that (at least) the application is innovative. The implementation of the proposed methodology, and the experiments reported in this work are non-trivial.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Except for the lack of clarity regarding some aspects of the presented methodology and results – which I described in the comments to authors – I don’t see any particular flaws in the paper.

    I did notice, however, that hyperparmeter values for model training weren’t explored more widely. Training models using preset values of hyperparameters may lead to biased conclusions. For example, hypothetically, a chosen optimiser learning rate for “our model” may work better than the same one for “their model”. It would have been better if this were addressed in the experiments.

    Another, slight, drawback is that there is no publicly available code for recreating the experiments. Also, the two mammogram datasets are not publicly available.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Lack of clarity: 1) Section 2, Knowledge extraction – If I understand correctly, “BIRADS image descriptors” are a set of narrative report guidelines. If this is the case, I suggest using more appropriate wording here instead, because the term “image descriptor” is usually tied to image properties (image feature extractors), and not text. Next, it is unclear how the positive key concepts were extracted, and how/if the authors dealt with noise in the reports. Moreover, I find the following claim confusing: “abnormal image descriptors are primarily categorized into 10 groups – breast composition, calcification, asymmetry, mass, surgical changes.”. I suggest rewriting/clarifying. Finally, the authors should explain how the negative and uncertain findings were detected/decided. 2) Section 3, Datasets – What is the raw mammogram resolution and colour depth for the bilateral projections? How is the breast tissue area cropped? How is resizing done? 3) Section 3, Implementation details and Results – What does the acronym ITM stand for? Model names vary across the text – naming should be unified.

    There are inconsistencies in variable, metric, and model naming (e.g. R1 vs R@1) – throughout the text - which should be corrected.

    Finally – and this is my personal preference – I suggest using the term “instance” or “example” instead of “sample” when addressing individual cases/data points. A sample (in statistics, primarily) is a set of instances. Moreover, “sample”/”sampling” is also used as a verb. Using the term “sample” for one “instance” makes the text more difficult to follow.

    Section 2, Selective Sampling – The oversampling method doesn’t consider the relative differences between underrepresented groups. I am curious as to why the authors didn’t consider using some kind of probability sampling here to compensate.

    Supplementary document referenced in the text is missing.

    I suggest proofreading the paper for English grammar and syntax, either with the help of a native speaker or by using an online service.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method of selective contrastive sampling for dealing with the problem of mammogram retrieval should be interesting to MICCAI audience. Although the proposed method is fairly simple, a lot of effort was put in this work - therefore I believe it should be presented at the conference. The presentation of the paper is very good. The experimental setup and the results presented are credible. The conclusions are supported by the results and the discussion.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their valuable feedback and address their major concerns below. We will release the code repo including preprocessing and add all feedback and results in the final version. Motivation(R5,R6): The proposed retrieval model is used to automatically select relevant cases (mammogram-report pairs) from 100,000s of cases for training radiology residents. Hand-picking a set of cases is time-consuming, challenging, can introduce sampling bias, and is unlikely to match the desired distribution needed for adequate training of residents (pg 2, L4-L14 in paper). Contribution(R6): We propose an innovative and non-trivial’(R1) method to train VLMs on radiology data by an efficient mini-batch sampling(SS) approach to address the real’(R5) challenges of contrastive learning in the medical domain: 1) high-class imbalance introduces false negatives within a mini-batch, 2) underrepresentation of rare groups. We extensively validated the method on `in-domain and out-of-domain VLMs’(R5) under zero-shot, few-shot, and supervised setting. Design choices (batch size, ratio R of freq.:rare groups, #freq. vs. #rare groups, mini-batch shuffling) are extensively studied. MedCLIP-SS (I2R: R@10=23.1, R2I=59.6) works better than MedCLIP(I2R: R@10=9.9, R2I=5.5) with smaller batches(B=8). Applying SS on larger batches (B>64) requires non-trivial solutions, extensive study and is future work (Q6,C4). (R1, R5): Data Preprocessing: We use a binary mask of thresholded pixel values to identify the largest connected component in the image and use its bounding box coordinates to crop the breast tissue area. The cropped bilateral images are concatenated, zero-padded for maintaining aspect ratio, and resized to 512x512 pixels. Reports are cleaned by lowercasing, punctuation removal, and extra spacing removal. The text is then split into sentences, each examined for key concepts: density, calcifications, asymmetry, architectural distortion, mass, and additional features. Negation sentences are ignored. If a sentence contains a key concept, the report is marked accordingly. Each key concept is detected separately and then combined to form discrete groups. We appreciate the reviewers’ suggestion to study the relative distance of groups and will explore this in future work. For simplicity, the current approach treats all groups equally. R1: ITM:image-text alignment. We corrected the notation inconsistencies in the paper. The supplementary material was excluded due to formatting issues by organizers. We apologize for the inconvenience and have addressed the suggestions in the main manuscript. R5,C2: Increased mammogram recall rates indicate incomplete information for diagnosis, requiring the patient to return for further testing by the expert reader. We will reword the sentence. R5,Q6: Hard-negative mining: The proposed knowledge-grounded grouping of image-report pairs ensures sampling of true negatives within a mini-batch, i.e., no examples within a mini-batch come from the same group (sec 2.3, L1-L6). This further ensures to contrast against examples from groups closer to the anchor’s group, e.g., ABC vs ABCD are hard negative examples for each other. Thus, the proposed grouping benefits our sampling approach to take care of hard-negative examples. R5,C3: Primary 5 groups: Breast composition, calcification, asymmetry, mass, surgical change. Secondary groups: architectural distortion, intramammary lymph node, skin lesion, solitary duct, skin and nipple retraction R6: Statistical Analysis: We performed t-test to compare pairwise similarity scores from MedCLIP-SS and MedCLIP on the external test set. With the alternative hypothesis that MedCLIP-SS is better than MedCLIP,i.e., we obtained t-statistic=7.47, one-tailed p-value=5.90E-14 < 0.05 for image-to-report, and t-statistic=10.54, p-value=1.33E-25 <0.05 for report-to-image. This supports the significance of our results that the proposed selective sampling helps the model for the retrieval.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a sampling strategy for fine tuning VLMs to work for mammogram retrieval. The paper received (accept -> no reassessment, reject-> weak accept, reject -> no reassessment) scores (before->after rebuttal). The positive aspects of the paper, according to the reviews, are the innovative application, the relevance of the studied problem, and the translation of MedCLIP to mammogram retrieval. The weaknesses of the paper are as follows: poor assessment of hyperparameters, and lack of statistical tests. This paper has some pros and cons, but it appears that the strengths outweight the weaknesses from the reviews and rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents a sampling strategy for fine tuning VLMs to work for mammogram retrieval. The paper received (accept -> no reassessment, reject-> weak accept, reject -> no reassessment) scores (before->after rebuttal). The positive aspects of the paper, according to the reviews, are the innovative application, the relevance of the studied problem, and the translation of MedCLIP to mammogram retrieval. The weaknesses of the paper are as follows: poor assessment of hyperparameters, and lack of statistical tests. This paper has some pros and cons, but it appears that the strengths outweight the weaknesses from the reviews and rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top