Abstract

Adapting large pre-trained foundation models, e.g., SAM, for medical image segmentation remains a significant challenge. A crucial step involves the formulation of a series of specialized prompts that incorporate specific clinical instructions. Past works have been heavily reliant on a singular type of prompt for each instance, necessitating manual input of an ideally correct prompt, which is less efficient. To tackle this issue, we propose to utilize prompts of different granularity, which are sourced from original images to provide a broader scope of clinical insights. However, combining prompts of varying types can pose a challenge due to potential conflicts. In response, we have designed a coarse-to-fine mechanism, referred to as curriculum prompting, that progressively integrates prompts of different types. Through extensive experiments on three public medical datasets across various modalities, we demonstrate the effectiveness of our proposed approach, which not only automates the prompt generation process but also yields superior performance compared to other SAM-based medical image segmentation methods. Code will be available at: https://github.com/AnnaZzz-zxq/Curriculum-Prompting.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2832_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/AnnaZzz-zxq/Curriculum-Prompting

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zhe_Curriculum_MICCAI2024,
        author = { Zheng, Xiuqi and Zhang, Yuhang and Zhang, Haoran and Liang, Hongrui and Bao, Xueqi and Jiang, Zhuqing and Lao, Qicheng},
        title = { { Curriculum Prompting Foundation Models for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper the authors propose a curriculum prompting framework for adapting SAM, a large pre-trained foundation model for medical image segmentation. Their approach relies on fine-tuning pretrained object detection models with medical image annotations and also fine-tuning SAM’s prompt encoder with ground-truth bounding boxes. The bounding-boxes that are produced by the fine-tuned object detection model are used as an initial prompt to the fine-tuned SAM model to produce initial coarse segmentation results. Consequently, the coarse segmentation mask is further augmented with additional key-point prompts to produce the final segmentation result. The authors’ claimed contributions are: 1) Automated prompt generation for SAM models 2) Curriculum prompting approach, that combines bounding-box prompts, and key-point prompts using intermediate coarse segmentation results 3) Favorable empirical results compared to other state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper focuses on important topics that are related to adapting state-of-the-art large foundation models, like SAM, for medical image segmentation.
    • Interesting approach for combining bounding-box prompts with key-point prompts, using intermediate coarse segmentation results. This idea could be useful for other works related to adapting SAM for medical imaging .
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • One of the main claimed contributions of this work is related to the automated prompt generation for SAM models. This is enabled by fine-tuning of object detection model Grounding-DINO with medical data and also fine-tuning SAM’s prompt encoder with ground-truth bounding boxes. The authors provide very little details about the fine-tuning process and do not investigate several interesting questions, related for example to how many annotations would be sufficient to fine-tune SAM in different medical image domains.
    • The empirical comparisons with other methods, have some inconsistencies with what was reported in the cited papers. For example LViT-TW method from [19] is reported to have lower performance in the QaTa-COV19 dataset compared to what was reported in Table II in reference [19]. This is surprising since the authors claim to use the same data split as done in [19]. The difference could be explained by one sentence from the authors’ experiment section “Note that we standardize the text prompt to the name or a simple description of the target lesion, such as polyp, thyroid nodule, or bilateral pulmonary infection, for fine-tuning LViT-TW [19]”. Does that mean that LViT-TW was trained in a different manner compared to [19]?
    • In the empirical comparison with [19], why LViT-TW was chosen instead of LViT-T? The latter model (LViT-T) is reported to achieve better results in [19].
    • In Table II of [19] one of the baselines, nnUnet achieves better results on QaTa-COV19 dataset compared to the proposed framework. The authors could include nnUnet as an additional benchmark method.
    • The authors should report variance of empirical results were multiple data splits are used (for example for the TN3K dataset comparisons)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors could improve their work as follows:

    • Providing more insights related to fine-tuning of object detection models like Grounding-DINO. How many labels are needed for medical imaging in order for such models to be useful as initial prompts for SAM-type models? Could this step be validated separately? Are there any engineering challenges related to the fine-tuning process? How does fine-tuning differ for different types of medical image modalities (CT, x-ray, etc.)
    • Providing more insights related to fine-tuning SAM’s prompt encoder with ground-truth bounding boxes. Same questions as above.
    • Clarify the questions related to empirical comparison with other models stated in the ‘weaknesses of the paper’ section.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Lack of analysis for the fine-tuning of SAM’s prompt encoder and Grounding-DINO
    • Questions related to empirical comparison with other methods
  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work proposed that use prompts of different granularity on the origin medical images to improve the performance of large pre-trained foundation segmentation models. They progressively integrates prompts of different types to avoid possible prompt conflict.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clear method design and structure; Adequate control experiments; Detailed experimental discussions;

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited performance improvement compared to SOTA SAMs; Limited novelty of this optimal prompt work, similar methods for visual-text binding and prompt tuning have been proposed a lot in NLP and CV fields;

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This work proposed that use prompts of different granularity on the origin medical images to improve the performance of large pre-trained foundation segmentation models. They progressively integrates prompts of different types to avoid possible prompt conflict.

    Strength: Clear method design and structure; Adequate control experiments; Detailed experimental discussions;

    Weakness: Limited performance improvement compared to SOTA SAMs; Limited novelty of this optimal prompt work, similar methods for visual-text binding and prompt tuning have been proposed a lot in NLP and CV fields;

    Generally, this work contributes to the prompt optimization of SAMs, however with limited research novelty.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    WR - Limited research novelty

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author emphasized the performance advantages and innovation of the proposed method, answering my main questions. Based on this, I have updated the final rating.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a curriculum prompting approach to enhance the application of large pre-trained foundation models, specifically SAM, for medical image segmentation. The authors propose an automated method for generating optimal prompts, which are integrated in a coarse-to-fine manner to address segmentation tasks of varying complexity. The method demonstrates superior performance on three medical image datasets, outperforming existing SAM-based methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The curriculum prompting strategy is a novel contribution that effectively utilizes multiple prompt types to improve segmentation performance.
    • The design of automation of prompt generation is a significant advancement, reducing reliance on manual input and potentially improving consistency.
    • The method’s effectiveness across three diverse medical image datasets indicates broad applicability and robustness.
    • The paper presents strong quantitative and qualitative results, showcasing the method’s ability to outperform current state-of-the-art SAM-based methods.
    • The authors’ commitment to providing the code for replication and further research is commendable.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some details regarding the proposed approach is not presented. Specifically, what is the number of points used in the point prompt?
    • The workflow includes fine-tuning of pretrained onject detection and keypoint detection models. Besides, the iterative curriculum prompting sounds time-consuming, too. Hence, I am concerned about the overall computational cost and training efficiency of the proposed method.
    • Discussion on the limitations of the proposed method and insights into future direction would add value to the paper.
    • In implementation details like training time, i.e., epochs, are missing, among the others.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Clarify on the details of point generation as raised in the weakness section.
    • Discussion and comparison on the computation/training cost.
    • Discuss the limitations and possible future improvements.
    • Add more implementation details if space allows.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, the work propose an innovative curriculem-based approch to assist the leverage of SAM with automated generation of multiple types of prompts. My major concern is that the entire process might be computation-sensitive. Hence, I would expect the authors to provide some discuss and comparison regarding this aspect.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have generally addressed my concerns and improve the clarity of the paper.




Author Feedback

We thank the reviewers for appreciating the novelty (R4), the importance of our research (R1), and adequate experiments (R3). In this rebuttal, we clarify the raised concerns, i.e. the workflow of the finetuning process (R1, R4), the computational cost of our method (R4), and comparisons to nnUNet (R1) where our method still outperforms this benchmark.

Common concerns:

  1. Finetuning details (R1, R4) Our method finetunes four distinct models. First finetune a keypoint detection model. Then finetune an object detection model and use its boxes to finetune a SAM model to generate coarse masks. The resulting points and coarse masks are used to finetune another SAM model, ensuring each model builds upon previous outputs.

R1:

  1. Sufficient label number for SAM As we use self-generated prompts yielded by finetuned networks, few labels (e.g., 10) are enough to finetune SAM, e.g., full data finetuning SAM (box prompts) on TN3K achieves 73.986% IoU, while 10-shot achieves 72.509%.
  2. LViT-TW Thanks for pointing out the typo. The model we actually compare is LViT-T, which outperforms LViT-TW. It can be inferred from our text “Note that we standardize the text prompt … for finetuning LViT-TW”, as LViT-TW doesn’t need text. The lower performance is caused by: 1. To maintain uniformity in comparison across all datasets, we use a different IoU calculation method which is widely used in polyp area instead of sklearn in [19] ; 2. As the reviewer has noticed, LViT-T was trained differently compared to [19].
  3. nnUNet Thanks for suggesting nnUNet. Our model (Kvasir IoU 89.442%, TN3K 76.367%, Qata-COV19 70.265%) still outperforms nnUNet (Kvasir IoU 87.542%, TN3K 74.648%, Qata-COV19 69.450%).
  4. Variance of results (e.g., TN3K) Our method performance is stable, as evidenced by five runs with different data splits on TN3K, resulting in an IoU of 76.37%±0.3, Dice of 84.33%±0.4.
  5. More insights of finetuning GroundingDINO Training Kvasir with 400 labels suffices as the model reaches 95% performance of using full data. 1000 labels are needed for TN3K, and for QaTa-COV19, at least 2000 labels. In our work, the endoscopy dataset requires the least labels, while X-ray needs the most.

R3:

  1. Improvement We would like to emphasize the differences between our work and other SOTA SAMs. Our first contribution is ‘the design of automation of prompt generation, reducing reliance on manual input’, concluded and appreciated by R4. This makes our method easier to use compared to other SAM-based models. Moreover, we achieve a Dice score of 84.43% on TN3K, outperforming the best result in benchmarks (81.60%) by 2.83%, which is a significant improvement.
  2. Novelty We would like to clarify that our method does not focus on visual-text binding, and we believe that our method is different from existing prompt tuning methods. The novelty of our method lies in the automated prompt generation and curriculum prompting SAM which is appreciated by R1 and R4. To our knowledge, there are not many works like curriculum prompting in the field of prompting visual foundation models.

R4: 1.Point number We use 8 edge points.

  1. Computation cost Thanks for the valuable question! In terms of parameter numbers, our models are identical to the original versions of each model. The time consumption primarily occurs during the finetuning process. Our model requires training for 9.5h (Grounding DINO 2h, HRNet 2h, SAM with box prompts 4h, and SAM with point and mask prompts 1.5h) on TN3K dataset, 2.7h (0.8h, 0.6h, 1h, and 0.3h) on Kvasir, and 21.1h (6h, 6h, 6.5h, and 2.6h) on QaTa-COV19. For comparison, the nnSAM model takes 15.2h, 12.5h, and 20.8h respectively. In most cases, our training time is shorter than nnSAM but outperforms nnSAM on all three datasets. This demonstrates that though our training process is somewhat complicated, the training time is acceptable.
  2. Limitation The finetuning process can be optimized (e.g., replacing SAM with MedSAM) to make our method zero-shot.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have addressed the reviewers’ concerns.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have addressed the reviewers’ concerns.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes an interesting concept. Although there were some concerns raised by the reviewers, the rebuttal was successful in addressing many of those.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes an interesting concept. Although there were some concerns raised by the reviewers, the rebuttal was successful in addressing many of those.



back to top