Abstract

Recent advances in large foundation models, such as the Segment Anything Model (SAM), have demonstrated considerable promise across various tasks. Despite their progress, these models still encounter challenges in specialized medical image analysis, especially in recognizing subtle inter-class differences in Diabetic Retinopathy (DR) lesion segmentation. In this paper, we propose a novel framework that customizes SAM for text-prompted DR lesion segmentation, termed TP-DRSeg. Our core idea involves exploiting language cues to inject medical prior knowledge into the vision-only segmentation network, thereby combining the advantages of different foundation models and enhancing the credibility of segmentation. Specifically, to unleash the potential of vision-language models in the recognition of medical concepts, we propose an explicit prior encoder that transfers implicit medical concepts into explicit prior knowledge, providing explainable clues to excavate low-level features associated with lesions. Furthermore, we design a prior-aligned injector to inject explicit priors into the segmentation process, which can facilitate knowledge sharing across multi-modality features and allow our framework to be trained in a parameter-efficient fashion. Experimental results demonstrate the superiority of our framework over other traditional models and foundation model variants. The code implementations are accessible at https://github.com/wxliii/TP-DRSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0014_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0014_supp.pdf

Link to the Code Repository

https://github.com/wxliii/TP-DRSeg

Link to the Dataset(s)

https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid https://github.com/nkicsl/DDR-dataset

BibTex

@InProceedings{Li_TPDRSeg_MICCAI2024,
        author = { Li, Wenxue and Xiong, Xinyu and Xia, Peng and Ju, Lie and Ge, Zongyuan},
        title = { { TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a method for segmenting lesions in Diabetic Retinopathy (DR) using a SAM-based approach. By utilizing explicit lesion descriptions instead of implicit class names, the proposed method generates interpretable cues for enhanced segmentation performance and explainability. These cues are integrated into the SAM encoder by the proposed injectors, resulting in improved segmentation accuracy and interpretability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-structured and well-written.
    2. The paper utilized explicit descriptions to bridge the gap of VLM well trained by natural image and its adaption to the Diabetic Retinopath segmentation.
    3. The segmentation performance significantly surpasses that of existing methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This method achieves exellent results in the performance of DR segmentation. However, I have few concerns:

    1. Limited innovation: 1) The authors emphasize the application of explicit descriptions instead of implicit descriptions to generate explicit priors using VLM (e.g., CLIP). However, to me, it seems more like common sense to feed in Fine-grained descriptions that describe the morphological features of the target area. For example, in CRIS [1], the language example ‘a blond-haired, blue-eyed young boy in a blue jacket’ not only provides the class name but also includes color/morphological descriptions of the target. 2) The injector module appears to resemble a cross-attention module, which shares similarities with some existing works. For instance, it bears resemblance to the ViT adapter [2] (the spatial feature injector module). [1] Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M. and Liu, T., 2022. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11686-11695). [2] Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J. and Qiao, Y., 2022. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534.
    2. The figure illustration is not clear. For instance, the feature embedding from the SAM encoder inputs to the Class-Specific Prompt Generator, but this is not shown in Fig. 2.
    3. It would enhance the performance evaluation to include comparisons with other existing methods that utilize transformer-based architectures other than SAM. For example, incorporating comparisons with methods utilizing architectures such as Swin Transformer [1] or CNN-based architectures would provide a more comprehensive analysis. [1] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. and Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Could you provide further elaboration on the unique aspects of how explicit descriptions were utilized? Specifically, could you detail the process of forming these descriptions? Additionally, I’m curious if there was any clinical guidance involved in this process.
    2. I wonder if you could shed some light on the injector module. It appears to resemble a cross-attention module. Could you clarify if a feed-forward network was utilized? Also, I’m interested to know if injecting to different layers makes a difference. Adding an ablation study for this aspect could provide more information.
    3. What is the motivation behind designing the Class-specific Prompt Generator? Considering that SAM already includes a prompt generator that takes text prompts as input, why not directly provide the text features for different classes but the explicit prior?
    4. Expanding the comparisons to include more state-of-the-art models, both representative transformer-based and CNN-based methods will be great.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    limited innovations; some of the modules are designed without a clear motivation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The submission proposes to use textual cues to guide a SAM model for the segmentation of lesions in DR. For this, they don’t convert the class names into a text embedding, but a prompt derived from the class name that is more “in the words of a VLM” that doesn’t know medical terminology. To use the embedding of this textual description (merged with the embedding of the image the text corresponds to), the authors propose a trainable Injector module that interacts with the SAM encoder blocks. The SAM decoder converts the SAM encoder result into the segmentation, again using a trainable prompt generator that utilizes the same embeddings and converts them into global (sparse) and local (dense) embeddings. In their various evaluations, the authors compare the model against similar approaches, against fine tuning and other approaches, and in an ablation study also against variants where certain blocks are not used. In all experiments, the proposed method is among the best.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors try to open a new path for segmentation models that includes textual cues into a segmentation algorithm. This is not only done in a small experiment, but in a very thorough and systematic way, with implementations that are not only concatenations of already existing ideas. Therefore, there is a good amount of novelty to the presented approach, both conceptually and technically. Also, the comparison of the proposed approach with other methods is comprehensive and conclusive; in addition, two datasets have been used. The authors apparently try to be explicit about the inner workings of the modules, giving semi-mathematical formulations for some aspects. It is appreciated that they also try to elucidate the dimensionality of tensors in many places, which for example helps to understand better how the textual information is converted into embeddings on the spatial scale of the image.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    To me it seems questionable how a textual prompt should improve the detection/segmentation of findings in images. If I understood the work correctly, the authors use CLIP in a frozen state, ie. not finetuned or adapted to the medical use case. Therefore, obviously CLIP can’t understand the medical terminology describing lesions in DR. This seems to lead the authors into converting the lesion names into descriptions that describe how the class usually presents in layman’s terms (“small white blobs not too far from contrasted lines” or whatever). This again means that the model will not be able to learn by itself which potential other features might indicate the presence of the disease, since it will be guided by the prompt that comes from a human understanding of “what I see when I diagnose”. We know very well already that diseases are much more complex than visible biomarkers, that diagnostically relevant information can be much more distributed etc. Therefore, it is actually a severe restriction to the capacity of a model to let it only look for things that humans can describe. Note that this criticism hinges on the question how the “explicit prior” is actually generated.

    I guess this is my major point about this submission. There are a few minor general points:

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • I don’t understand why the explicit textual description is helpful for interpretability.
    • I didn’t get how the explicit prior is derived from the class name. Is this done through the text/image embedding alignment in the Explicit Prior encoder? Fig. 2 seems to suggest that this conversion happens before the VLM text encoder goes to work. How, then, is the lookup of the “explicit description” from the “class names” generated/compiled…?
    • Before Eqn. 1, there is H_s and W_s – what does the subscript s relate to?
    • After Eqn. 1, you claim that the explicit description “unleashes” the potential, offering “credible global guidance”. Very strong wording that stands to be tested; yet your ablation study does not seem to contrast a class name-derived text prompt with the “explicit prior” text prompt therefore stopping short regarding those claims.
    • You mention “SAM-CLIP incorporating text” (p7) without a reference and then assess that it falls short understanding medical terminology. How did you remedy this, when it appears you are just using the pretrained CLIP? Is this why you actually use “explicit descriptions” instead of the correct medical terms?
    • In the cross-modality interaction, you use 1x1 conv with stride 4? This means you don’t even average, but essentially throw away everything between these every-fourth-voxel information? Some typos: p2: “This explainable cues” - “These…” p5 right before section 2.3: F’‘=upsample(F’’) – must be upsample F’ (only one prime) p6 right after Eqn. 4 second phi_dense should probably be phi_sparse p7 “which is show in Table 2” -> “is shown…” p8 “degradation when moving the prior-aligned injector)” -> “when removing…”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I’m not fully convinced of the idea to create SAM prompts using language models, but it is certainly a path that can be explored. That is why I would like to see the paper presented and discussed. Several points where unclear for me and should be addressed in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This work leverages text prompts to provide medical prior knowledge into the segmentation network. It proposes novel image segmentation architecture, which consists of aligned injector and classspecific prompt generator.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The performance improvement is notable, and its effectiveness is evaluated on two public datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Additional explanations are required on ‘Explicit description’ and ‘Implicit class name’ (Fig. 2). Also, conducting studies to assess their impact on model performance would be beneficial.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It would be beneficial to provide a more detailed explanation of why the predictions from previous work are considered ‘class-agnostic’ in Fig. 1.
    • In Fig. 4, the differences between methods are not clearly evident. Providing quantitative metric values for each qualitative result would enhance clarity.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    good performance and strong motivation

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Some issues remain, such as terms or claims that are not convincing, but this work shows significant quantitative performance improvement. I will maintain the current score.




Author Feedback

[R1&R3&R4] Q1: Clarification of Explicit Prior. Segmentation models, unlike classification models, focus on detecting predefined lesions from human-provided lesion masks, rather than other biomarkers. Given that the same lesion type consistently shares the same morphology, we can describe their morphology using predefined lesions. CLIP is pretrained on general domain datasets and has a natural ability to perceive color, shape, and other morphological features. However, CLIP’s understanding of medical concepts is limited. To address this limitation, we translated medical concepts (implicit class names) into more understandable descriptions (explicit descriptions) for CLIP, providing interpretable and additional references for segmentation. The conversion of class names to explicit descriptions occurs before the VLM text encoder. We map the class names to corresponding predefined explicit descriptions. These descriptions are crafted from ophthalmology literature and have undergone validation by both relevant experts and GPT-4. Moreover, our proposed injector further adapts and aligns CLIP within the medical domain, enabling efficient training with low computational resources. Despite being frozen, CLIP can learn to locate DR lesions through Injector. Preliminary ablation experiments have been performed but are not included due to space limitations. When not transferring to the explicit description, mDice drops by 3.06% on IDRiD and 2.89% on DDR. Implicit priors introduce more noise, leading to noisy information during training. [R4] Q2: Novelty concerns. 1) CRIS uses pre-existing descriptions without needing to handle implicit concepts and cannot be directly applied to our task. Our approach is specifically designed for medical scenarios and focuses on addressing the challenges in the medical domain by translating medical concepts into explicit descriptions and adapting them using our Injector. 2) Our injector integrates explicit textual descriptions into the segmentation model, specifically tailored for medical applications. Unlike ViT-adapter, which fuses visual features from ResNet and ViT backbones, our injector adjusts text-based priors and visual features for better alignment. Additionally, our definitions of query, key, and value differ from ViT-adapter, and we did not use a feed-forward network. [R4] Q3: Ablation study for Injector. The injector facilitates knowledge sharing between explicit priors and multi-level visual features. Our preliminary experiments find that using the injector only in the final encoder layer led to performance degradation, highlighting the importance of multi-layer adaptation. [R4] Q4: More comparisons. We have compared FCT [18] in our experiments, a recent SOTA transformer-based method. FCT has proven its effectiveness by outperforming SwinUNet, nnFormer, and other SOTA methods for medical tasks, indirectly highlighting the strength of our approach. Based on your valuable suggestion, we will replace the SAM encoder with the Swin Transformer and include more comparisons in future work. [R4] Q5: Motivation of Class-specific Prompt Generator (CPG). In our framework, the goal is to generate masks for specific classes. The CPG acts as a bridge between the prior information and the SAM decoder, providing the decoder with information about which class to segment and the relevant prior information. [R1&R4] Q6: About SAM’s text-prompt generator. It relies solely on original CLIP, resulting in a domain gap between the general and the medical domains. Thus, we have designed several modules to adapt SAM and VLMs for DR lesion segmentation task.  [R1] Q7: H_s and W_s. The subscripts relate to spatial. [R1] Q8: Average operation in Injector. Experimental results indicate that the average operation has minimal impact on performance. [R3] Q9: Detailed explanation of ‘class-agnostic’. We will include them in the final version. Q10: Minor issues. We will correct them in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two out of three reviewers recommended acception. The paper indicates a path that can be explored.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two out of three reviewers recommended acception. The paper indicates a path that can be explored.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although two reviewers gave ‘weakly accept’ for this work, but their confidences are relatively low. The negative reviewer raised the concern on novelty, which should be carefully considered.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although two reviewers gave ‘weakly accept’ for this work, but their confidences are relatively low. The negative reviewer raised the concern on novelty, which should be carefully considered.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I have carefully read the submission, reviews, feedback, and other meta-reviews. There are certain aspects of this submission that I do not particularly like. Mostly, I do not think that segmenting retinal lesions makes any sense from a clinical point of view, at most we could talk about detecting them, but the specific border of exudates or micro-aneurisms has no interest for DR diagnosis. I see that the authors used the area under the PR curve; I understand that this works at the lesion level, considering TP, FP, and FN lesions* which seems to indicate that they share this idea, but then they also reported averaged Dice (ridiculously low, pointing to how useless this metric is for this task) and AUC (I don’t get how they compute AUC in this scenario). In my opinion, the best way to evaluate things in this problem, which has some aspects of segmentation and some of detection, would be following a PICAI-style pipeline (https://github.com/DIAGNijmegen/picai_eval/). I don’t think the authors have been careful enough in considering the way they evaluate performance, as illustrated by the fact that they never even described the metrics they chose and why they chose them. I also share with R4 and R1 the question of how where the descriptions built, and believe this should be explained in the paper.

    This said, R1 made a good case about the paper being well-written, with a good effort regarding communicating what has been done. Two reviewers recommended acceptance, and the only reasons offered by the “rejecting” meta-reviewer were lack of novelty and low confidence of reviewers. I do not think there is enough arguments to reject this borderline paper, I believe it should be accepted.

    I know you don’t have to do this, but please try to incorporate as much feedback as possible into the camera-ready version, specially the explanation on how you came up with lesion descriptions, maybe sacrificing a row in Fig4 or so. I would also like to advice the authors to reflect in the way they evaluate performance for this problem in the future, and probably get rid of Dice/AUC.

    • This would require more work to define when a lesion is considered TP, FP, or FN in terms of overlap etc.
  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I have carefully read the submission, reviews, feedback, and other meta-reviews. There are certain aspects of this submission that I do not particularly like. Mostly, I do not think that segmenting retinal lesions makes any sense from a clinical point of view, at most we could talk about detecting them, but the specific border of exudates or micro-aneurisms has no interest for DR diagnosis. I see that the authors used the area under the PR curve; I understand that this works at the lesion level, considering TP, FP, and FN lesions* which seems to indicate that they share this idea, but then they also reported averaged Dice (ridiculously low, pointing to how useless this metric is for this task) and AUC (I don’t get how they compute AUC in this scenario). In my opinion, the best way to evaluate things in this problem, which has some aspects of segmentation and some of detection, would be following a PICAI-style pipeline (https://github.com/DIAGNijmegen/picai_eval/). I don’t think the authors have been careful enough in considering the way they evaluate performance, as illustrated by the fact that they never even described the metrics they chose and why they chose them. I also share with R4 and R1 the question of how where the descriptions built, and believe this should be explained in the paper.

    This said, R1 made a good case about the paper being well-written, with a good effort regarding communicating what has been done. Two reviewers recommended acceptance, and the only reasons offered by the “rejecting” meta-reviewer were lack of novelty and low confidence of reviewers. I do not think there is enough arguments to reject this borderline paper, I believe it should be accepted.

    I know you don’t have to do this, but please try to incorporate as much feedback as possible into the camera-ready version, specially the explanation on how you came up with lesion descriptions, maybe sacrificing a row in Fig4 or so. I would also like to advice the authors to reflect in the way they evaluate performance for this problem in the future, and probably get rid of Dice/AUC.

    • This would require more work to define when a lesion is considered TP, FP, or FN in terms of overlap etc.



back to top