Abstract

Assessment of the glomerular basement membrane (GBM) in transmission electron microscopy (TEM) is crucial for diagnosing chronic kidney disease (CKD). The lack of domain-independent automatic segmentation tools for the GBM necessitates an AI-based solution to automate the process. In this study, we introduce GBMSeg, a training-free framework designed to automatically segment the GBM in TEM images guided only by a one-shot annotated reference. Specifically, GBMSeg first exploits the robust feature matching capabilities of the pretrained foundation model to generate initial prompt points, then introduces a series of novel automatic prompt engineering techniques across the feature and physical space to optimize the prompt scheme. Finally, GBMSeg employs a class-agnostic foundation segmentation model with the generated prompt scheme to obtain accurate segmentation results. Experimental results on our collected 2538 TEM images confirm that GBMSeg achieves superior segmentation performance with a Dice similarity coefficient (DSC) of 87.27\% using only one labeled reference image in a training-free manner, outperforming recently proposed one-shot or few-shot methods. In summary, GBMSeg introduces a distinctive automatic prompt framework that facilitates robust domain-independent segmentation performance without training, particularly advancing the automatic prompting of foundation segmentation models for medical images. Future work involves automating the thickness measurement of segmented GBM and quantifying pathological indicators, holding significant potential for advancing pathology assessments in clinical applications. The source code is available on https://github.com/SnowRain510/GBMSeg. <br><br>

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1841_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/SnowRain510/GBMSeg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Liu_Featureprompting_MICCAI2024,
        author = { Liu, Xueyu and Shi, Guangze and Wang, Rui and Lai, Yexin and Zhang, Jianan and Sun, Lele and Yang, Quan and Wu, Yongfei and Li, Ming and Han, Weixia and Zheng, Wen},
        title = { { Feature-prompting GBMSeg: One-Shot Reference Guided Training-Free Prompt Engineering for Glomerular Basement Membrane Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a training-free model that uses a one-shot reference image to precisely segment GBM in TEM images. The approach relies on prompt engineering, with the authors designing several ways to prompt the model. Although the method is effective, the design lacks significant methodological novelty.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The utilization of a single image for GBM segmentation presents an interesting approach, potentially reducing annotations costs.
    2. The simplicity and effectiveness of the presented prompting method are commendable, providing an easy-to-implement solution for one-shot GBM segmentation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper predominantly focuses on ad-hoc engineering tricks for prompting rather than introducing a new methodology, which undermines its contribution and generalisation.
    2. The paper lacks clarity on how hyperparameters such as Dsp and Dex are optimized. Additionally, the robustness of the proposed automatic prompting system to variations in these hyperparameters remains vague, raising concerns about the presented comparative evaluation.
    3. The results are likely to be highly sensitive to the choice of reference image, yet the analysis presented uses only one reference image. Evaluating the method solely based on this single example is insufficient. It would be better to assess the method’s performance across varied one-reference image, observing how the scores vary, including the standard deviation (which I anticipate to be high) and cases where the prompting fails completely.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors didn’t claim to releasing the dataset, which is necessary for reproducing their results (including both the task data and the chosen reference image). Additionally, they didn’t disclose the hyperparameter Dsp and Dex used in their experiments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Discuss how the reference image is chosen for GBMSeg. What criteria are used for its selection, and how might performance change if a different image were used? Additionally, there’s no information on the variance of results when using this single reference image.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The innovation is limited by ad-hoc engineering approaches and the addition of various hyperparameters designed to manipulate prompt sampling. Moreover, the analysis lacks results on variance and fails to consider how different reference images might affect performance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thank the authors for the response, I changed the score but the paper still lacks methodological novelty as well as additional hyperparamters that need to be suited specifically for the experiments.



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors designed a training-free segment model named GBMSeg. This paper achieved good results in the segmentation task using only one annotated image in combination with existing released foundation models (such as DINOv2 and SAM). This paper designed a feature-based approach to automatic prompt, which provides reliable segmentation hints for SAM.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper proposed an interesting auto-prompting strategy for the glomerular basement membrane (GBM) segmentation task. The paper’s scheme of providing a reference image with annotation for the target image is relatively ingenious, and this retrieval-like scheme can provide more reliable information than traditional one-shot type learning models. The paper used only one labeled image in the design of the auto-prompt engineering, and combines multiple simple but effective positive and negative sample allocation strategies to give reliable points-based prompts to subsequent SAM.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of clarity in the description of the methodology. The authors said “we divide both into 16 × 16 patches pt and pr using sliding windows with overlap.” But there is no mention of the form and extent of overlap.
    2. The authors propose in Part 2.2 to assign hard samples as negative samples. Although the experiments show an improvement by assigning to negative samples, no reasonable analysis or explanation is given. It is hoped that the authors will carefully analyze the factors brought about by this enhancement.
    3. Insufficient experimentation. The authors describe the difficulty of segmenting the GBM task in the INTRO section, but do not give specific segmentation performance in the fully-supervised method there in the experimental results (Such as UNet or Deeplab). Moreover, in the ref-17 cited by the authors, the segmentation performance of the method can go up to 97% on DICE, which suggests that the task can be accomplished well in the supervised method.
    4. In Table 1, why does ViT-H not perform as well as ViT-L? There is something counter-intuitive about this, and I hope the authors can give some explanation for it.
    5. During the testing phase, the criteria for the selection of the reference image are not mentioned in the paper. Is the performance tested in the text obtained based on a specific image or is it an average obtained based on several different reference images?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The authors define this task as a one-shot segment task, but there is no mention of any work on one-shot in the INTRO section. We suggest that related work could be added to describe the differences and improvements between this and previous work.
    2. It would be interesting to use automatic feature-based prompt. Could the authors explore the differences before different feature extractors and if a fully supervised level of segmentation can be achieved after finetune based on a small number of images?
    3. When engineering auto-prompt, the authors suggest that too many or too few prompt points can affect segmentation performance, and hope to do can give their own understanding and prove it experimentally, or cite related work.
    4. Provide additional experimental basis. The paper lacks the segmentation performance of a fully supervised scheme, and although medical segmentation annotation is painful, it would actually be more helpful to the clinic if a small amount of annotation leads to good generalization, rather than abandoning annotation altogether because of the difficulty of annotation and pursuing a zero-shot/few-shot scheme to solve the clinical problem.
    5. Figure 3 lacks the necessary analytical explanation. After the introduction of the Sparse Sampling strategy, it seems that there are significantly fewer correct negative sample cue points, but the segmentation of negative samples becomes significantly better, which is a bit counter-intuitive, and I hope that the authors can give a reasonable analysis and explanation for these phenomena.
    6. Why should the difficult sample be used as a negative sample? Is it just that the experimental results prove that this is better? The authors should dig further into the rationale behind it rather than just focusing on the experimental phenomenon.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper designed an interesting auto-prompting project to provide reliable prompting points for SAM. In the paper, the authors explore a variety of allocation schemes for positive and negative sample prompt points, which are experimentally proven to be effective in improving segmentation performance. However, the experimental results in the paper are inadequate. The necessary comparative experiments (such as segmentation results based on fully supervised methods and whether different forms of overlap have a large effect on the results) and experimental analysis (why does ViT-H not perform as well as ViT-L) are missing. Finally, the description of the experimental results in question is not clear enough. I am confused whether there is a trick to the selection of the reference image and whether the experimental results are the result of averaging multiple randomly taken reference images.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors give strategies about ref-image selection, but the related experimental analysis is still imperfect. The authors did not deeply analyze why the segmentation result becomes better when the difficult sample is assigned as a negative sample, and it is not rigorous enough to set it up only through the experimental phenomenon. As mentioned in R5, the paper lacks too many explanations of the key hyper-parameter settings and selection rules. Although I think the idea of this article is interesting and provides a new way of thinking for the few-shot task, the dataset chosen in the article as well as the results of the experimental analysis are still lacking. Based on the encouragement of the author’s idea, we keep the original score unchanged.



Review #3

  • Please describe the contribution of the paper

    The paper proposed a one-shot prompt engineering framework to achieve glomerular basement membrane segmentation without model training. In the framework, the author proposed an automatic prompt process with (1) DINOv2 feature embedding (2) Key-point feature matching for prompt generation; (3) fine-tuning algorithms for final point selection to get membrane-relative positive points and negative points from one reference image and label that achieve better prompt for the SAM segmentation, showing the promising results of the proposed auto prompt engineering compared with other few-shot and one-shot methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The whole pipeline make good use of current foundation models (DINOv2 and SAM) to achieve one-shot learning without any training process.
    2. The solution paradigm would be interesting for the scared label learning and automatic segmentation with limited labels.
    3. The explicated ablation study demonstrate the step-by-step performance of each algorithm in prompt engineering pipeline, while the comparison of several baseline methods shows the capability of the proposed pipeline.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper claimed the question of the domain shift problem due to different digital devices. However, there is no result or discussion about the model to address the problem to show the capability of the pipeline (Either compare two domain-shift image features from DINOv2 or compare domain-shift reference/target segmentation from SAM).
    2. There are several processes unclear in the prompt engineering: (1) It is unclear when there is multiple points to one point or one point to multiple points in the forward matching and backward matching, what the select process going to get the consist key point pairs, (2) What if the hard sampling points should be positive points that very close to the boundary of membrane? (3) how accurate the center of each 16 * 16 patch are and consistent with the patch features (for example, in the boundary of the membrane, the feature might be similar to membrane but the center is located at the background region)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In Fig3. sparse sampling, there are only negative points are removed, while no positive points removed. Please clarify or discuss the reason according the sparse sampling algorithm.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The whole pipeline is interesting to provide a training-free one-shot segmentation empowered by foundation models. The experiments are explicit and the results is promising. However, some concerns and questions should be addressed for better understanding the details of the pipeline, which lean to a weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author addressed my questions and provided more clarification regarding my concerns, which led me to maintain my rating as the final rating.




Author Feedback

We sincerely appreciate the careful reviews and insightful comments from all reviewers and AC, which are invaluable for revising and enhancing our manuscript.

Common questions: Robustness of Reference Selection and Domain Independent: The reference images selected for the experiment were chosen randomly, and we acknowledge concerns about their potential impact. To assess robustness with the reference, we randomly selected reference images 10 times. The resulting average performance is actually 0.8727, with a standard deviation of 0.0381. The low standard deviation suggests relatively stable performance regardless of reference selection. Our method is training-free, and both prompt generation and segmentation are class-agnostic. Consequently, our method’s reliance on the chosen reference is relatively minimal and domain independent. Excessive Prompting Impact: We’ve noticed that over-prompting a specific region can lead SAM to neglect other areas. However, a dense distribution of negative prompts in regions with heterogeneous background features may degrade SAM segmentation performance. Conversely, in targets with homogeneous features, a dense positive prompt distribution doesn’t degrade SAM segmentation performance with ViT-L and helps counteract the impact of some erroneous negative prompts.

To R3: Insufficient Experiments: The limited availability of high-quality annotated medical images poses a challenge, leading us to focus on comparing one-shot and few-shot approaches instead of fully supervised methods. Additionally, Table 2 demonstrates that our method outperforms ref-17 in this task. Exploration of Few-Shot: We value the professional suggestion. While our proposed method has shown superiority over recently proposed few-shot methods, our next focus will be on exploring the potential of few-shot data in prompt generation. ViT-H vs. ViT-L Performance: We observed that ViT-H is more sensitive to prompts compared to ViT-L. In cases of GBM, ViT-H may overlook distant parts when many closely spaced prompts are present, while ViT-L is less affected by this. Hard Samples Analysis: We observed that negative prompts in background resembling the target can improve segmentation. We use hard samples to simulate this, aiming to enhance segmentation performance.

To R4: Patch Matching: For both forward and backward matching, we match the patches of the source image and the target image successively. Once a patch is matched, it is not considered in subsequent matches, thus preventing multiple points mapping to one point or one point to multiple points. Hard Sampling: The hard sampling only focuses on the negative prompts in our work, due to the heterogeneous of the background relative to foreground. Patch Center offset: It actually has relation to patch size, the large size may cause patch center offset, while relatively small sizes (In ours) reduce the probability of offset, allowing the patch feature to be roughly represented by its center point.

To R5: Novelty Concern: Although our approach doesn’t introduce a completely new model, like medical image-based foundational models, which often demand substantial data and computational resources for training. Instead, we focus on autonomously generating prompts to assist SAM in medical image segmentation, especially in low-resource scenarios like one-shot learning. This tackles specific challenges in medical imaging, reducing reliance on extensive resources while offering practical solutions for clinical applications. Moreover, our automated prompting framework effectively bridges the gap between SAM in medical and natural images, improving segmentation effectiveness and facilitating the integration of medical images with the foundational models of other domains in a training-free manner. Hyperparameter: Due to page limitation, hyperparameter selection is not included. Optimal results were actually achieved when setting Dsp of negative to 70, Dsp of positive to 0, and Dex to 140.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Accepts

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Accepts



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    2 of 3 reviewers agree to accept this paper. The remaining reviewer have increased the score after the rebuttal process,

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2 of 3 reviewers agree to accept this paper. The remaining reviewer have increased the score after the rebuttal process,



back to top