Abstract

Deep learning models have been successfully developed for various medical image segmentation tasks. However, individual models are commonly developed using specific data along with a substantial amount of annotations, ignoring the internal connections between different tasks. To overcome this limitation, we integrate such a multi-task processing into a general computerized tomography (CT) image segmentation model trained on large-scale data, capable of performing a wide range of segmentation tasks. The rationale is that different segmentation tasks are often correlated, and their joint learning could potentially improve overall segmentation performance. Specifically, the proposed model is designed with a transformer-based encoder-decoder architecture coupled with automatic pathway (AP) modules. It provides a common image encoding and an automatic task-driven decoding pathway for performing different segmentation tasks via specific prompts. As a unified model capable of handling multiple tasks, our model not only improves the performance of seen tasks but also quickly adapts to new unseen tasks with a relatively small number of training samples while maintaining reasonable performance. Furthermore, the modular design of automatic pathway routing allows for parameter pruning for network size reduction during the deployment.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0930_paper.pdf

SharedIt Link: https://rdcu.be/dZxd1

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_49

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0930_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Ouy_Promptbased_MICCAI2024,
        author = { Ouyang, Xi and Gu, Dongdong and Li, Xuejian and Zhou, Wenqi and Chen, Qianqian and Zhan, Yiqiang and Zhou, Xiang and Shi, Feng and Xue, Zhong and Shen, Dinggang},
        title = { { Prompt-based Segmentation Model of Anatomical Structures and Lesions in CT Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {522 -- 532}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a joint learning framework that integrates 83 segmentation tasks using a shared model and a dynamic decoder. The framework utilizes embeddings from the CLIP text encoder, which are generated by feeding specific prompts, to determine the appropriate path for each task. The effectiveness of the proposed method is validated using 15 different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper compiles a large-scale CT dataset comprising 58,499 annotations across 83 tasks, an endeavor that is notably challenging. Evaluating the proposed method on such a comprehensive dataset allows for more generalizable conclusions regarding its effectiveness and facilitates the development of a robust universal segmentation model.
    2. The proposed AP module achieves a significant performance improvement across all datasets, demonstrating its effectiveness.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors do not provide detailed insights into several key aspects: the results of the path selection for each task, the dynamic process of path selection during the training process, the unique advantages of using CLIP embeddings, and the impact of varying medical prompts fed into the CLIP text encoder. Further elaboration on these points would strengthen the paper.
    2. The paper lacks ablation studies.
    3. The paper does not include comparisons with other multi-task joint training techniques.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Refer to Weaknesses.
    2. In the Introduction, the paper notes that multi-head frameworks often require extensive parameters in both the encoder and decoder, with the implication that most parameters are essential for executing a single task. Does this suggest that multi-head frameworks generally employ larger models?
    3. The proposed framework utilizes a two-stage training process, with the first stage comprising 3,000 epochs and the second stage 1,000 epochs. Could you clarify the reason behind this approach and whether it still uses joint training? Additionally, it remains unclear if the competing methods were set to a comparable total of 4,000 training epochs.
    4. The manuscript reports results for only 15 out of the 83 tasks. Please provide the reason.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although this paper successfully compiles a large-scale segmentation dataset for evaluation, the main concerns lie in the insufficient experimental validation and the lack of clear conclusions. Given both significant issues, I recommend a ‘Weak Reject’ for this submission.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors’ efforts in addressing the concerns raised. However, several issues are still not addressed: (1) Lack of comparison with joint learning methods. I understand that no new experimental results can be introduced during the rebuttal phase, yet such comparisons are crucial to establish the effectiveness of the proposed method. The absence of this analysis significantly weakens the paper’s contribution. (2) The authors ignore my concerns about why only report the results for 15 out of the 83 tasks. (3) The authors did not respond to the details of training epochs for the proposed method and competing methods. Given that authors are not allowed to make substantial changes to the submitted paper if accepted, ‘Reject’ is suggested.



Review #2

  • Please describe the contribution of the paper

    The authors present a foundational model for segmenting structures in CT volumes based on the Swin UNETR with added automatic pathway (AP) modules in the decoder. These AP modules use Gumbel softmax layers to dynamically choose between different pathways in the decoder based on a target label based text encoding from the CLIP model. The foundation model is trained on a very large dataset compromising both in-house and public datasets. The resulting model is evaluated both on tasks present in the pretraining and one new task using few-shot learning; in nearly all case, the model performs equally well or surpasses the SOTA architectures nnU-Net and VB-Net.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is very well written, all choices regarding the architecture, training process and evaluation are well motivated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The central idea of using CLIP encodings to steer the training and operation of a multi-task model is very reminiscient of the “CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection” (Liu et al., 2023), which current ranks first in the open leaderboard of the MSD challenge. It would therefore be important to differentiate the present approach from that of Liu et al.; a direct comparison of the results would also be interesting.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Since the dataset is largely composed of in-house data, it is not possible for people outside of the author’s organization to repeat their experiments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While the standard deviations given in the supplementary material allow to assess most differences between models, pairwise statistical tests would have been helpful to verify which differences between architectures are statistically significant.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach at hand is valuable, the paper is well written and the results are clearly communicated. However, the authors need to provide a statement regarding the unexplained similarities to an existing succesful approach.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my concerns in their rebuttal. A final version with revisions according to their statements in the rebuttal would be a very valuable publication for MICCAI.



Review #3

  • Please describe the contribution of the paper

    This work focuses on the SAm-like prompt-based segmentation models for organs and tumors in CT scans. The authors trained a model on large-scale and wide range of segmentation tasks using a transformer-based encoder-decoder architecture model with automatic pathway modules. The design also provide parameter pruning capability to reduce the model complexity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper collected and trained on a very large-scale dataset to date, which proves the conprehensive anatomical targets.
    2. The training and validation scales are large and adequite.
    3. The performance improvements are impressive and significant.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In the introduction, the paper stated the challenge of SAM is it relies on location clues (point, boxes, and masks), which resulting limitation of full automation of the segmentation process. Which is very confusing, isn’t the point clicks, boxes or masks helpful to provide local prompts to assist segmentation, why it unable to achieve automation? The author can highlight the challenges of previous works and clarify the innovation of the proposed method. I would be interested in to see the CLIP model used as the encoder for prompt information. The authors can add more discussion and description on the use of CLIP model, and the effectiveness of the CLIP. I believe Figure 2 is used in the authors’ other papers already. The author should consider modify it a bit to make a difference. The Automatic pathway routing part needs more clarification in Section 2.2. Ablation studies on CLIP and AP can be very helpful to justify the technical novelties.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author can add description on whetherer code or weights will release.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. As the conference version paper of this work, the author can highlight more on technical description and clarify the novelty. To make a difference of the journal version.
    2. Some parts of discussion can be improved, such as the introduction, background and challenge description.
    3. Figures, do not reuse same figure for multiple submissions.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The scale of the study makes a remarkable contribution to the prompt-based segmentation of CT studies. Training more than 30,000 annotated CT scans is a big step the community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    Though the paper is not perfect, the volume and impact of the work is sufficient for MICCAI society.




Author Feedback

Q1: The paper stated the challenge of SAM is it relies on location clues (point, boxes, and masks), which resulting limitation of full automation of the segmentation process. Which is very confusing, isn’t the point clicks, boxes or masks helpful to provide local prompts to assist segmentation, why it unable to achieve automation? (R1) A1: In CAD systems, we aim for fully automated processes. Achieving fully automated segmentation results by SAM or MedSAM would necessitate the development of a corresponding keypoint detection or bounding box localization algorithm. On the other hand, our method can complete the segmentation process simply by providing a descriptive prompt of the segmentation target. Q2: I would be interested in to see the CLIP model used as the encoder for prompt information. (R1, R2) A2: This is a great idea! Once this paper is accepted, we will add some results comparisons when using different prompt descriptions (like “lower limbs arteries” or “a picture of artery vessels of lower limbs”). Q3: Figure 2, the author should consider modify it a bit to make a difference. (R1) A3: Thanks for the insightful suggestion. Figure 2 is an illustration of our training dataset, we will make a difference to better show our dataset. Q4: The Automatic pathway routing part needs more clarification. Some parts of discussion can be improved, such as the introduction, background and challenge description. (R1) A4: Thanks for the comment. We will enhance these sections by providing a more detailed explanation of the routing mechanism, and providing a clearer overview of the problem we are addressing, the significance of our work, and the main contributions. Q5: The authors do not provide detailed insights into several key aspects: the results of the path selection for each task, the dynamic process of path selection during the training process, the unique advantages of using CLIP embeddings, and the impact of varying medical prompts fed into the CLIP text encoder. (R2) A5: Thanks for the comment. For the results and dynamic process of the path selection, we will provide them for some tasks once this paper is accepted. The use of CLIP embeddings in our task offers significant advantages for multi-task learning. In traditional multi-task models, each task is given a fixed encoding or output position, losing the correlations between tasks. For example, the relationships between various vascular tasks or organs and their corresponding lesions. These correlations can be effectively captured and expressed through CLIP’s text embeddings. At the same time, we will add some results comparisons when using different prompt descriptions. Q6: The paper lacks ablation studies. (R1, R2) A6: Thank you for the comment. We have shown the performance of “Sole-path” model in Table 2. The model share the same structure with our proposed method without the CLIP embedding input and AP modules. We could see that it performance worse than the final framework. Q7: The paper does not include comparisons with other multi-task joint training techniques. The central idea of using CLIP encodings to steer the training and operation of a multi-task model is very reminiscient of the “CLIP-Driven Universal Model “ (Liu et al., 2023). (R2, R3) A7: Thank you for the comment. It is indeed important to compare our method with publicly available approaches. We have thoroughly reviewed the “CLIP-Driven Universal Model “ paper. Since the MSD challenge is already closed, it is difficult to compare our model’s performance on the MSD test set. If our paper is accepted, we will consider comparing their model’s performance on our test set or on some public datasets, given that they have open-sourced their algorithm code and model. Q8: Pairwise statistical tests would have been helpful to verify which differences between architectures are statistically significant. (R3) A8: Thanks! We will add the pairwise statistical tests once this paper is accepted.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors utilized a comprehensive annotated dataset for multi-organ segmentation and showed that their proposed method surpasses most state-of-the-art segmentation methods. While it is important to conduct a proper statistical test to determine the significance of the performance improvement, the paper has received two acceptance recommendations from two reviewers and is therefore recommended for acceptance at MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors utilized a comprehensive annotated dataset for multi-organ segmentation and showed that their proposed method surpasses most state-of-the-art segmentation methods. While it is important to conduct a proper statistical test to determine the significance of the performance improvement, the paper has received two acceptance recommendations from two reviewers and is therefore recommended for acceptance at MICCAI.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed method was validated on 58,499 annotations across 83 tasks. Most of them are comparable or superior to state-of-art methods. Although some experiments were not performed or reported in this conference paper, it is sufficient of interest to the MICCAI community. The concerns raised by the reviewer 2 should be addressed in the future journal paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The proposed method was validated on 58,499 annotations across 83 tasks. Most of them are comparable or superior to state-of-art methods. Although some experiments were not performed or reported in this conference paper, it is sufficient of interest to the MICCAI community. The concerns raised by the reviewer 2 should be addressed in the future journal paper.



back to top