Abstract

Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at: https://github.com/naamiinepal/vlsm-adapter.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/4190_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/4190_supp.pdf

Link to the Code Repository

https://github.com/naamiinepal/vlsm-adapter

Link to the Dataset(s)

https://www.kaggle.com/c/bkai-igh-neopolyp/data https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset http://humanheart-project.creatis.insa-lyon.fr/database/#collection/6373703d73e9f0047faa1bc8 https://stanfordaimi.azurewebsites.net/datasets/23c56a0d-15de-405b-87c8-99c30138950c https://www.kaggle.com/datasets/balraj98/cvcclinicdb https://challenge.isic-archive.com/data https://dfu-challenge.github.io/dfuc2022.html https://datasets.simula.no/kvasir-seg

BibTex

@InProceedings{Dha_VLSMAdapter_MICCAI2024,
        author = { Dhakal, Manish and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh},
        title = { { VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an adapter model to efficiently fine-tune pre-trained VLSMs on domain-specific small datasets. The experimental results indicate that the adapter model for small datasets is better than end-to-end fine-tuned models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors explored an adaptive VLSM for medical image segmetation.The results demonstrate its effectiveness.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The innovation of the paper is limited. There is no significant advancements in the architecture design of the VLSM-Adapter.
    2. Comparison with other segmentation methods is crucial, where performance is equally essential alongside efficient fine-tuning.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It is imperative to compare with other SOTA segmentation methods utilizing single modality (image).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The innovation of the VLSM-Adapter is limited.
    2. The authors attempted to introduce a novel task, possibly to circumvent direct comparison with traditional SOTA methods.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The rebuttal did not adequately address my concerns about the limited novelty and the inadequate discussion on other SOTA segmentation methods utilizing single modality (image).



Review #2

  • Please describe the contribution of the paper

    The paper introduces VLSM-Adapter, a novel approach for efficiently fine-tuning pretrained Vision-Language Segmentation Models (VLSMs) for medical image segmentation. By incorporating lightweight adapter modules, the proposed method reduces the computational resources required for fine-tuning, making it more feasible for adapting VLSMs to domain-specific datasets. Experimental results demonstrate the superiority of VLSM-Adapter over existing methods in terms of both performance and efficiency.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ① VLSM-Adapter introduces a novel method for fine-tuning VLSMs with minimal trainable parameters, addressing the resource-intensive nature of the process. ② The paper provides thorough experimental validation on diverse medical datasets, demonstrating the effectiveness of VLSM-Adapter in achieving state-of-the-art performance. ③ The paper clearly presents the motivation, methodology, experimental results, and contributions of the proposed approach.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The method combines adapters with clipseg to showcase the effectiveness of efficient training. However, the novelty of the approach is somewhat limited.
    2. It is not clear why using adapters can achieve better results than end-to-end fine-tuning? If adapters are the reason, it would be better to demonstrate the results of incorporating adapters and performing end-to-end fine-tuning?
    3. The rationale behind selecting λ_d and λ_{ce} as 1.5 and 1, respectively, should be analyzed
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    please refer to the weakness part

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach has limited novelty and lacks certain details.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The rebuttal did not adequately address my concerns about the limited novelty and the reasoning behind the effectiveness of the approach.



Review #3

  • Please describe the contribution of the paper

    The paper introduces VLSM-Adapter, a method for fine-tuning Vision-Language models.  Compared with other VLSM adaptation techniques used in biomedical image analysis, the presented method uses Adapters, is more computationally efficient, and still obtains better results.  The authors tested different variants of their methods and showed competitive results on a wide range of segmentation benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The VLSM-Adapter is an interesting approach to leverage the power of large pre-trained models on small datasets and limited computing resources. 
    • The method is simple and can be easily ported to other architectures. 
    • The authors presented a very convincing set of benchmarks, showing that the VLSM-Adapter performs well even with very limited parameters. 
    • The authors experimented with three different VLSM-Adapter architecture variants and two adapter block variants, showing competitive results across the board.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • An extremely popular method for fine-tuning large language models (but also shown to be extremely successful on large vision models and clips) is LoRa*. The LoRa approach is similar in spirit since it adapts a network to a different task without changing the model parameters. Given the many successful applications of LoRa and its extreme popularity, the lack of mention of it in the manuscript is a bit concerning. The authors should state how their method compares to LoRa. Moreover, including comparative experiments between LoRa adaptation and VLSM-Adapter would have been extremely valuable in supporting the method proposed. *LoRA: Low-Rank Adaptation of Large Language Models, (ICLR 2022)
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The authors presented only a single variation of their DA and SA blocks (with 3M and 4.2M parameters, respectively). Since the number of training parameters is crucial for analyzing the results, it would be extremely interesting to see how the size of the DA and SA blocks affects them. Did the authors experiment with different DA and SA sizes?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method presented showcases very good results. Moreover, their approach’s simplicity can open the field to broader applicability of large pre-trained VL models to biomedical applications.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you to the Authors for including a comparison with LoRa in their manuscript and addressing my questions.

    My assessment of the manuscript is unchanged.




Author Feedback

We thank all the reviewers for their helpful reviews and address their concerns below.

Limited novelty/innovation (R3 & R4): Even though adapters were first introduced much earlier (ICML’19), their modifications and adaptations for downstream tasks encompass important areas of current research: CLIP-Adapter (IJCV’23) for classification, VL-Adapter (CVPR’22) for VQA, visual reasoning, and image captioning, Meta-Adapter (NeurIPS’23) for few-shot image classification and open vocabulary object detection, and ViT-Adapter (ICLR’23) for classification and segmentation using vision-only models. Thus, the contributions beyond architectural innovations are equally important when adapted to new tasks or contexts with comprehensive experiments. Our contribution and novelty are in this direction, whereby we extend adapters to segmentation tasks using Vision-Language Segmentation Models (VLSMs) which was not done before.

No mention of LoRA (R1): Thanks for pointing this out. We will add this in related work: “Unlike previous adapter methods, LoRA (ICLR’22) was designed to adapt large language models (LLMs), achieving zero additional inference latency by merging the pre-trained and adapted weights. In contrast, our method uses the adapted features as a parallel branch to the pre-trained block.” Since exploring LoRA in VLSMs would be interesting, we will add in the conclusion section: “Exploring the application of LoRA for VLSMs will be an interesting future direction.”

Effects of changing block size (R1): We experimented with different variations of block sizes for the Kvasir-SEG dataset: block sizes {0.76M, 1.5M, 3M, 5.9M} for DA gave DSC within the range of 87.77 to 89.10, and {1M, 2M, 4.2M, 6M} for SA within 85.92 to 86.98. Thus, we chose block sizes of 3M for DA and 4.2M for SA, having adapter dimensions of 512 and 64, respectively, as a trade-off between DSC and parameters. These details are not reported as ablation studies due to page limitations.

Missing comparison to SOTA (R3): The paper aims to use adapter finetuning to obtain performance like end-to-end (E2E) finetuning. Thus, our experiments focused on comparisons of these two finetuning methods, as is the practice for similar works such as CLIP-Adapter (which compares CLIP-Adapter, zero-shot CLIP, Linear-Probe CLIP, and CoOp) and VL-Adapter (which compares zero-shot VL models, end-to-end finetuning, and various adapter modules). Moreover, the datasets we use come from a benchmark study of Poudel et al. (2024) where they compare the VLSM results with SOTA vision-only models. We will add the following in the results section: “Performance of the end-to-end finetuning of the VLSMs we use for these datasets is comparable to the SOTA vision-only models (Poudel et al 2024).”

Significance of adapters (R4): While it is not exactly clear why adapters perform better than E2E fine-tuning on certain downstream tasks (both in our work and other related works), there could be some combination of unique architectural advantage adapters bring, and adapters being less prone to overfitting in smaller datasets, where the finetuning usually happens. E2E finetuning by adding adapters increases the trainable parameters, which can lead to further overfitting in the downstream task. While this could be one of the steps to better understand the workings of adapters, we believe this needs a more rigorous study with larger datasets.

Reason for not ablating with λ_d and λ_ce (R4): Computational constraints prohibited us from performing a grid search for λ_d and λ_ce. However, we selected these parameters using heuristic methods and preliminary experiments to ensure a balanced performance across different tasks and methods including baselines. We consistently apply the coefficients λ_d = 1.5 and λ_ce = 1 across all the methods for easier comparison.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal did not adequately address concerns regarding the limited novelty of the paper and the lack of discussion on other state-of-the-art (SOTA) segmentation methods using a single modality (image). In addition, the rebuttal failed to provide sufficient reasoning behind the effectiveness of the proposed approach. I suggest a “Reject.”

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The rebuttal did not adequately address concerns regarding the limited novelty of the paper and the lack of discussion on other state-of-the-art (SOTA) segmentation methods using a single modality (image). In addition, the rebuttal failed to provide sufficient reasoning behind the effectiveness of the proposed approach. I suggest a “Reject.”



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes an adapter model for efficiently fine-tuning pre-trained Vision-Language Models (VLSMs) on domain-specific small datasets. The experimental results show that the adapter model outperforms end-to-end fine-tuned models on small datasets. However, the innovation of the VLSM-Adapter is limited. Additionally, the authors appear to introduce a novel task, possibly to avoid direct comparison with traditional state-of-the-art methods.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes an adapter model for efficiently fine-tuning pre-trained Vision-Language Models (VLSMs) on domain-specific small datasets. The experimental results show that the adapter model outperforms end-to-end fine-tuned models on small datasets. However, the innovation of the VLSM-Adapter is limited. Additionally, the authors appear to introduce a novel task, possibly to avoid direct comparison with traditional state-of-the-art methods.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed reviews and the criticism relates to novelty and clarity. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper methodology is generally sound, and it presents an interesting method for fine-tuning VLSMs with minimal trainable parameters. Thus, the paper makes a good starting point for further research of using VLSM on small datasets with specific tasks. The authors should clearly highlight limitations in their discussion.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed reviews and the criticism relates to novelty and clarity. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper methodology is generally sound, and it presents an interesting method for fine-tuning VLSMs with minimal trainable parameters. Thus, the paper makes a good starting point for further research of using VLSM on small datasets with specific tasks. The authors should clearly highlight limitations in their discussion.



back to top