Abstract

The delineation of the Clinical Target Volume (CTV) is a crucial step in the radiotherapy (RT) planning process for patients with nasopharyngeal carcinoma (NPC). However, manual delineation is labor-intensive, and automatic CTV contouring for NPC is difficult due to the nasopharyngeal complexity, tumor variability, and judgement-based criteria. To address the above-mentioned problems, we introduce SAM-RT, the first large vision model (LVM) designed for CTV contouring in NPC. Given the anatomical dependency required for CTV contouring—which encapsulates the Gross Tumor Volume (GTV) while minimizing exposure to Organs-at-Risk (OAR)—our approach begins with the fine-tuning of the Segment Anything Model (SAM), using a Low-Rank Adaptation (LoRA) strategy for segmenting GTV and OAR across multi-center and multi-modality datasets. This step ensures SAM-RT initially integrates with anatomical prior knowledge for CTV contouring. To optimize the use of previously acquired knowledge, we introduce Sequential LoRA (SeqLoRA) to improve knowledge retention in SAM-RT during the fine-tuning for CTV contouring. We further introduce the Prompt-Visual Cross Merging Attention (ProViCMA) for enhanced image and prompt interaction, and the Gate-Regulated Prompt Adjustment (GaRPA) strategy, utilizing learnable gates to direct prompts for effective CTV task adaptation. Efficient utilization of knowledge across relevant datasets is essential due to sparse labeling of medical images for specific tasks. To achieve this, SAM-RT is trained using an information-querying approach. SAM-RT incorporates various prior knowledge: 1) Reliance of CTV on GTV and OAR, and 2) Eliciting expert knowledge in CTV contouring. Extensive quantitative and qualitative experiments validate our designs.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2115_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2115_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Kho_Unified_MICCAI2024,
        author = { Khor, Hee Guan and Yang, Xin and Sun, Yihua and Wang, Jie and Huang, Sijuan and Wang, Shaobin and Lu, Bai and Ma, Longfei and Liao, Hongen},
        title = { { Unified Prompt-Visual Interactive Segmentation of Clinical Target Volume in CT for Nasopharyngeal Carcinoma with Prior Anatomical Information } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors adopt the Segment Anything (SAM) model to segment clinical target volumes (CTVs) in CT imaging of Nasopharynx cancer patients. To provide the model with prior knowledge about tumor and surrounding tissue, the model is pretrained to segment gross tumor volumes (GTVs) and organs at risk (OAR). Two techniques to improve the utilizability of input prompts, i.e. landmarks, bounding boxes and segmentation masks, and a sequential Low-Rank Adaptation (LoRA) are developed. Authors investigate the ability of those techniques in reference to the standard SAM model and compare their model to other state-of-the-art (SOTA) models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The task is a relevant one, with fast segmentation of target volumes needed for further advances in the development of adaptive radiation therapy.
    2. Authors probe new technique in order to improve the current SAM architecture, putting special focus on medical imaging.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Overall, the paper is hard to follow a) No proper comparison to other works and how the techniques developed here differ from existing methods is provided b) Developed techniques are not well motivated, notations are used without any introduction
    2. Prior knowledge: Due to the results in Table 1, Authors claim that the utilization of prior knowledge, i.e. segmentation of GTV and OAR, enhances the CTV contouring performance. However, this comparison is invalid. It is not possible to directly compare the performance of models trained on different datasets. Especially for the small dataset sizes at hand. Differences in the middle row of Table 1 (SAM-RT) are very likely only due to the different datasets used for training, but not the training strategy.
    3. Comparison to SOTA models: Authors do not provided any information about training strategy used to compare to SOTA models in section 3.3. Are those models trained in the same fashion as the model at hand, or are they used in a zero shot manor without any adaptation to the data? From the information provided, it seems like those comparisons are not fair.
    4. The model follows are 2D structure, while CT segmentation should be performed in 3D. How are the 2D slices combined to form a 3D prediction during inference? Are the results of the model given on a slice or a patient basis?
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Test data is partially public, but code is not provided and methods are hard to understand.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper needs revision to make contributions of the authors more clear and developed techniques better understandable. Comparisons between different model architectures and SOTA methods have to be provided in a fair way.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. paper is hard to follow
    2. novel contributions of authors are not fully clear
    3. comparisons/ablations are not fair
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a novel and intricate framework that extends the Segment Anything Model (SAM) to enhance clinical target volume (CTV) segmentation. The key innovation lies in leveraging anatomical knowledge from related tasks, such as gross tumor volume (GTV) segmentation and organs-at-risk (OAR) segmentation, to improve CTV segmentation. The authors utilize both publicly available and private datasets for training and evaluation.

    The main contributions include:

    The introduction of a sequential low-rank adaptation (SeqLoRA) method, adapted from the wider machine learning community, to enhance knowledge retention during fine-tuning on the target task. The development of a prompt-visual cross merging attention module (ProViCMA) coupled with gate-regulated prompt adjustment (GaRPA) to enhance the interaction between images and prompts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors introduce a compelling application of the SAM 2D segmentation model, incorporating innovations from the wider machine learning community. The SeqLoRA mechanism is proposed to enhance knowledge retention during fine-tuning, leveraging the clinical relevance of nearby GTV and OAR locations for CTV segmentation. This is achieved by introducing a moderate number of additional parameters to existing layers, which remain frozen during optimization. The second contribution includes GaRPA and ProViCMA blocks, which enhance network performance by improving interaction between prompts and image features. Despite the complexity of the overall architecture, these blocks are relatively simple and intuitively understood. Statistical testing using the Wilcoxon signed-rank test and relevant segmentation metrics, including both overlap-based and spatial distance-based metrics, are employed for performance comparison. The authors conduct several ablation studies to assess the contribution of each training dataset, the proposed blocks, and hyperparameters. Additionally, the method effectively combines knowledge from multiple datasets containing slightly different modalities, such as CT and contrast-enhanced CT versus CT and MR, demonstrating performance improvements in the end-task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper’s language is generally good, but the excessive use of abbreviations makes it difficult to follow. In subsection “Unified Prompt-Visual Encoder” on page 5, the introduction of various prompts lacks clarity. Specifically, the purpose of the prompt Q_n and how it is obtained are not clearly explained. Additionally, the method for collecting other prompts during training is unclear. Furthermore, there is no description of how the model is used during inference. The authors opt for a 2D Segment Anything Model, but given the 3D segmentation task, a 3D model trained with similar additional modules might be more suitable. The authors should explain why they did not build on top of the 3D models, such as SAM-Med3D. Results for the SAM-RT+LoRA configuration are missing, which would provide a quantitative evaluation of the proposed SeqLoRA block. Including these results would enhance the paper’s completeness and provide valuable insights into the effectiveness of the SeqLoRA block.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The absence of publicly available code is a notable limitation, particularly given the complexity of the architecture comprising multiple modules with numerous parameters. While the authors report some hyperparameters used for training, reproducing the paper’s results may still be challenging without access to the code.

    To ensure reproducibility, the authors are strongly encouraged to share their code. Given the page limit constraints, providing detailed textual descriptions alone may not be sufficient to enable reproducibility. Code sharing would facilitate transparency, allow researchers to understand the implementation details, and replicate the experiments effectively.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should clarify whether they used mean average surface distance (MASD) or average symmetric surface distance (ASSD) as their metric for averaged surface distance (ASD). Mentioning the package used for metrics calculation would improve reproducibility. Regarding empty segmentations when calculating the Hausdorff distance (HD) and ASSD, the authors should explain how they handled these cases and what value was attributed to them. It would be beneficial for the authors to specify whether they used any corrections, such as Bonferroni correction, when performing statistical tests to compare performance. Not all performance comparisons in Tables 1 and 2 are statistically significant. The authors should comment on this. The authors report results for several baseline and SOTA methods but provide limited information on how these methods were trained and what data was used. It would be helpful to know if attempts were made to use the same publicly available datasets for pretraining these methods, as this would provide more objective comparisons. Regarding the prompt denoted with Q, the authors should explain its significance and provide more information on its usage during inference. Additionally, clarification on the usage of prompts during inference would enhance understanding of the methodology.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall idea presented in this paper is clinically well-motivated and demonstrates novelty to a certain extent. The proposed modules are thoroughly evaluated, and statistical tests indicate that the performance of the proposed method is significantly better than competing methods. The reviewer finds the strengths of this paper evident, and there appear to be no major flaws in the experiment design.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents an interactive segmentation model based on the Segment Anything Model (SAM) focusing on CTV contouring of nasopharyngeal carcinoma. The authors fine-tune SAM in two steps: 1) on GTVs and OARs using datasets from different centers and imaging modalities; and 2) on CTVs in a second fine-tuning step. They propose 2 novel components for the second fine-tuning: (1) a Sequential LoRA to retain the learned features from training on GTVs and OARs in the first stage; (2) a novel prompt-image fusion based on attention (ProViCMA) and gating (GaRPA). The authors conduct extensive experiments and ablation studies to justify their design decisions and compare their method to 10 other segmentation approaches, demonstrating promising results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) Excellent evaluation: The authors’ experiments are very thorough. In Tab. 1. they explore the impact of fine-tuning on OARs and GTVs on the performance, and how SAM improves with each of the authors’ proposed components. Tables 1-4 in the supplementary also contain multiple ablations regarding the SAM backbone, LoRA projection layers, LoRA rank size, and which combination of prompts works best with their method. They also compare to 10 other methods from previous work. All experiments are also examined with a Wilcoxon signed-rank test which gives the results more credibility and interpretability.

    (2) Well-supported design decisions: The authors propose multiple components to improve SAM’s fine-tuning, e.g., (a) SeqLoRA, (b) GaRPA, and (c) ProViCMA. The authors show that (a), (b), and (c) are all important by alternately excluding them from their model (seen in Tab. 1 and Fig. 2). This gives me a very good impression that all components as well as their combination into one pipeline makes sense.

    (3) Clear and relevant motivation: Incorporating expert knowledge in the domain of CTV contouring is an important topic as the models must be very robust to not only delineate the CTVs but also minimize the radiation damage to potential organs-at-risk. The author’s interactive model gives experts the control to correct predictions using clicks which is a strong advantage compared to automatic segmentation methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Too many hyperparameters: While the authors’ method makes sense and achieves good results, it also seems to contain multiple hyperparameters that need to be manually set (LoRA rank size, number of clicks in Q_n, and which ViT backbone to use). While the authors ablate this extensively on the task of CTV contouring and suggest an optimal parameter setting, the optimal hyperparameters might differ on a another task, limiting the generalizability of their approach.

    (2) Missing click simulation information: The authors do not describe how they simulate the clicks in Q_n. This is an important detail to both understand and reproduce their work as the simulation of clicks can be done in multiple ways (center of largest error, random sampling, etc.). I advise the authors to add a small description regarding the click simulation in their camera-ready manuscript.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The overall reproducibility of the paper is good as it describes the method in great detail. My only criticism is that the authors do not describe how they simulate clicks to prompt their model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Here are more detailed comments in order to improve the paper:

    (1) The authors should describe the click simulation procedure in more detail or reference an existing simulation scheme if they have adopted one from previous work. This is an important detail that readers require to reproduce the author’s method. This should include how a click is generated in the first iteration as well as how subsequent click are used to refine the initial prediction of the model (e.g. center of largest error component). It is also not clear if the authors simulate both foreground and background clicks. Then Q_10 would actually correspond to 20 clicks.

    (2) It took me a few reads to follow all the equation terms in Section 2, e.g., F^C, T_C, D_C, etc. It is all correct and makes sense after a few reads (natural images -> GTV + OAR -> CTV) but I would add a small notation table (perhaps in the supplementary since there is no space) explaining the notation so readers do not get lost. This would improve the reading flow and readers can always look up notation terms if something is not clear, instead of going back in the text to when the term was introduced.

    (3) I am interested in why the authors did not simply use the most confident output mask of SAM (last output) but opted for an argmax over all its predicted masks. Did they see any improvement in comparison when doing this or how do they justify this design decision? I would add a small justification of why the authors decided to deviate from the norm in the related literature.

    Minor comments that did not influence my decision but would be good to fix:

    • Typo on page 2: for CTV contouring task -> for the CTV contouring task
    • Typo on page 2: The learning of SAM-RT encompass -> The learning of SAM-RT encompasses
    • Typo on page 2: We aim to enhance decision function -> We aim to enhance a decision function
    • Typo on page 4: with limited dataset -> with limited datasets
    • Typo on page 4: We use bilinear upsample -> We use bilinear upsampling
    • Typo on page 5: denote the mask -> denotes the mask
    • Typo on page 6: then multiply with -> then multiplied with
    • Typo in Caption of Fig. 2: with red indicates -> where red indicates
    • Typo on page 7: treat -> treats
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper tackles a relevant problem and presents its methodology and results in a very structured and thorough way. There are some missing methodology details, such as the click simulation that can be added easily to the camera-ready version, and the author’s method requires multiple hyperparameters that might not be the same for other tasks. Nevertheless, I believe that the presented work is novel and technically sound. Hence, I would opt for an “accept” as it is a valuable contribution to the MICCAI community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thank you for all the reviewers’ (R3, R4, R5) valuable feedback.

  • (R4, R5) Prompt Usage: For Point Prompt Q_n, n positive points around the CTV boundary are identified using K-Medoids sampling in each axial plane with a ground-truth (GT) mask of the CTV. These n foreground points are generated without additional clicks. The Bounding Box Prompt duplicates the largest CTV mask’s bounding box across all axial planes with a CTV mask, while the Mask Prompt encircles five centroids of CTV areas with a 20-pixel radius, marking inside pixels as 1 and outside as 0. These prompts, along with CT images, are used by SAM-RT to generate predictions based on learned parameters.

  • (R3, R4) 2D SAM: While 3D CT segmentation ensures spatial continuity, SAM-RT is limited to a 2D structure due to 2D SAM’s pretraining on the SA-1B dataset of over 1 billion natural images. This gives 2D SAM an advantage over models like SAM-Med2D and SAM-Med3D, which used smaller datasets. We applied SAM-RT’s 2D capability to each CT slice independently, then stacked the results to form a 3D volume. Metrics like Dice score (DSC), average surface distance (ASD), and Hausdorff distance (HSD) were calculated per patient to evaluate 3D segmentation.

  • (R3, R4) SOTAs’ Training: Traditional models (nnUNet, UNETR, UNetGTV, DDNN, SI-Net) were trained from scratch on 121 clinical cases with consistent preprocessing and uniform hyperparameters. SAM-based models (SAM, SAMed, SAM-Med2D, SAM-Med3D) were fine-tuned using LoRA with pre-trained weights. These strategies ensured fair comparison with our SAM-RT model. Performance differences in Table 2 are due to architectural differences and SAM-RT’s techniques, not data adaptation or training setup.

  • (R3, R4, R5) Paper Structure: Our SAM-RT framework enhances CTV delineation for nasopharyngeal carcinoma by leveraging anatomical knowledge from diverse datasets. Unique to radiotherapy, it employs SAM with techniques like LoRA for fine-tuning, ProViCMA for improved CT and prompt interaction, GaRPA for focused prompts, and SeqLoRA for updating weights while preserving anatomical knowledge. We will include a notation table in the supplementary material and address concerns about hyperparameters by providing guidelines. Abbreviations have been defined at their first occurrence.

  • (R3, R4, R5) Experiment Analysis: (R3) Our paper enhances SAM-RT for CTV contouring by integrating GTV and OAR prior knowledge from multi-center, multi-modality datasets. Using SeqLoRA for fine-tuning, ProViCMA, and GaRPA for dense prompt interaction and task adaptation, SAM-RT leverages CTV dependency on GTV and OAR, guided by expert insights for accurate delineation. Utilizing clinical (121 cases) and public datasets (SegRap2023: 120 cases, HaN-Seg: 42 cases), we plan to expand our dataset for robust comparisons. Table 1 shows the benefits of leveraging prior knowledge, and future experiments with expanded datasets and baseline methods are planned.

(R4) SAM-RT* and SAM-RT in Table 1 represent the SAM-RT+LoRA configuration, which we will clarify in the manuscript. We used ASSD via MONAI for consistent metric computation. If both segmentations are empty, HD and MASD are 0. If one is empty, HD and MASD are set to a large value or positive infinity, indicating unbounded distance. We did not use Bonferroni correction and acknowledge that some comparisons in Tables 1 and 2 were not statistically significant due to inconsistencies in data acquisition and variations in CTV delineation by different oncologists.

(R5) To enhance CTV contouring accuracy, we used an argmax over all predicted masks instead of SAM-RT’s last output. This approach helps handle noisy, incomplete, or ambiguous masks by selecting the one with the highest probability, resulting in cleaner segmentation.




Meta-Review

Meta-review not available, early accepted paper.



back to top