Abstract

Deep learning-based segmentation models have made remarkable progress in aiding pulmonary disease diagnosis by segmenting lung lesion areas in large amounts of annotated X-ray images. Recently, to alleviate the demand for medical image data and further improve segmentation performance, various studies have extended mono-modal models to incorporate additional modalities, such as diagnostic textual notes. Despite the prevalent utilization of cross-attention mechanisms or their variants to model interactions between visual and textual features, current text-guided medical image segmentation approaches still face limitations. These include a lack of adaptive adjustments for text tokens to accommodate variations in image contexts, as well as a deficiency in exploring and utilizing text-prior information. To mitigate these limitations, we propose Asymmetric Bilateral Prompting (ABP), a novel method tailored for text-guided medical image segmentation. Specifically, we introduce an ABP block preceding each up-sample stage in the image decoder. This block first integrates a symmetric bilateral cross-attention module for both textual and visual branches to model preliminary multi-modal interactions. Then, guided by the opposite modality, two asymmetric operations are employed for further modality-specific refinement. Notably, we utilize attention scores from the image branch as attentiveness rankings to prune and remove redundant text tokens, ensuring that the image features are progressively interacted with more attentive text tokens during up-sampling. Asymmetrically, we integrate attention scores from the text branch as text-prior information to enhance visual representations and target predictions in the visual branch. Experimental results on the QaTa-COV19 dataset validate the superiority of our proposed method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1674_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

https://www.kaggle.com/datasets/aysendegerli/qatacov19-dataset

BibTex

@InProceedings{Zen_ABP_MICCAI2024,
        author = { Zeng, Xinyi and Zeng, Pinxian and Cui, Jiaqi and Li, Aibing and Liu, Bo and Wang, Chengdi and Wang, Yan},
        title = { { ABP: Asymmetric Bilateral Prompting for Text-guided Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Author proposed a dual-branch framework, Asymmetric Bilateral Prompting (ABP), for segmentation task using the guide from texts for X-ray images. The text branch and image branch interact with each other through ABP, along decoding steps (upsampling steps) and projection steps.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This design helps: 1. text branch keeps the important tokens and removes redundant tokens; 2. image branch enhances features representation and prediction using text-prior information.

    Interaction between text branch and image branch through decoding steps; the design of ABP block. In ABP block, there are two parts: text and image part. Attention map of image branch at each step is obtained by a Sigmoid function applied to Q_I x K_T/ sqrt(C_I) (same way is applied too attention map of text branch). Cross attention ends with the multiplication between A_T and V_I; A_I and V_T. Text token (O_T) is re-ordered using the mean of A_T: [CLS, top-M_(i+2)-2, FUSE]. In the other hand, CLS was attracted then reshaped, projected, interpolated to get text-prior prediction output which will be compared with ground truth using L_aux. Author then compared their approach with several mono model and multi-modal. The outperforms the state of the art Ariadne’s Thread: improvements by 1.25% for dice score, 2.08% for MioU, and 0.31% for Acc. They also do an ablation study where they progressively add bilateral attention, text-branch operation, image-branch operation. The result showed that the more they add, the better result is.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In state of the art part, I think author should review ref 26 because it is the state of the art for text guided segmentation on QaTa-COV19 dataset. Fig. 1 is too small, in print form I could not read the text. I do not think the third point in contribution is a contribution. It is the result of their method. Need detail or cite the “cascaded projectors” (page 4). End of page 4, need to explain Q, K, V before using them. The author actually defined them at the beginning of page 5 when they introduce Q, K, V second time. Page 6, what is MLP? I am concerning about the chosen of gamma. Is it adapted to each dataset? Do they have a global gamma that good for all? I think if it is adapted to each dataset, it will not be practical. Same question for M_1, M_2, M_3, C_1. Any pre-trained or transfer learning for text and image encoder? Author said they have “significant improvements” of 1.25% for dice score, 2.08% for MioU, and 0.31% for Acc. I think those numbers are not significant.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Fig. 1 and section methodology described global idea of the method. However, there is no detail given of the structure. If the author provide their code, it will be great.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In state of the art part, I think author should review ref 26 because it is the state of the art for text guided segmentation on QaTa-COV19 dataset. Fig. 1 is too small, in print form I could not read the text. Author can increase the size of text so that we can read the text in print form (I printed it and I could not read). I do not think the third point in contribution is a contribution. It is the result of their method. Need detail or cite the “cascaded projectors” (page 4). I think author should define QKV in page 4 when they first introduce them. Page 6, should explain MLP. Clear if there is any pre-trained or transfer learning for text and image encoder?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method has small improvement compare to SOTA however authors experimented on only one dataset. The chosen of parameters has no explanation and somehow made for this QaTa dataset. Paper needs some improvement in the writing so that reader can understand it easier (bigger text in figure, explanation of variables, etc)

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addressed all of my comments and promise to update them in the final version.



Review #2

  • Please describe the contribution of the paper

    Considering that static text embeddings fail to accommodate different semantic contexts in varied scales of image features, and the ignorance of text prior information, this work proposes Asymmetric Bilateral Prompting (ABP), a novel pipeline for text-guided segmentation of lung lesion areas in X-ray images. Specifically, building upon bilateral cross-attention, the authors use the attention scores from the image branch to rank the importance of the text tokens and remove redundant tokens. Moreover, an auxiliary loss is proposed to predict the text-prior information, and the attention scores from the text branch are incorporated into the image branch to further enhance the feature representations. The experiments on the Qata-COVID-19 dataset show the superiority of the proposed method compared to other mono-modal and text-guided methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-organized and written.
    2. The motivation to adaptively adjust the text tokens based on attention scores of the image features makes sense and is novel.
    3. Extensive experiments on the QaTa-COV19 dataset show the effectiveness of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Directly concatenating the text and image tokens is not reasonable for me.
    2. In the adaptive adjustment part, the unimportant tokens are fused via a weighted average operation. However, what’s the performance of straightforward omitting them, or other fusion strategies? Ablation studies should be done to verify the effectiveness of the fusion strategy.
    3. The authors apply bilinear interpolation to obtain the text-prior prediction, why is it necessary?
    4. There’s no ablation study on token lengths, M1, M2, and M3.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Please see both the strengths and weaknesses sections.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see both the strengths and weaknesses sections.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The rebuttal addressed most of my concerns and I tend to vote for acceptance.



Review #3

  • Please describe the contribution of the paper

    The paper proposes Asymmetric Bilateral Prompting, a dual-branch method for text-guided medical image segmentation to extend mono-modal models to incorporate diagnostic textual notes. The authors validate their method on the QaTa-COV19 dataset and compare it with four mono-modal segmentation models and three multi-modal text-guided segmentation models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed approach outperforms the state-of-the-art methods in the segmentation task.
    • Ablation studies show that each module of the proposed approach benefits the segmentation performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method was evaluated only on the QaTa-COV19 dataset, which may not be sufficient to understand its limitations.
    • The proposed approach requires image and text description pairs for training which may not be available for many datasets in real-world clinical applications. So, the applicability of the proposed approach could only be limited by such datasets.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    More details on the implementation of the proposed approach and data preprocessing steps is needed to enable reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • There was Equation (6) but no Equation (5).
    • Please elaborate on token pruning. What was the value chosen when preserving top-(M_(i+1)-2) tokens?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel and it has potential in improving the text-guided medical image segmentation. The experimental results on QaTa-COV19 dataset are promising by outperforming the state-of-the-art text-guided segmentation approaches.

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The manuscript proposes a novel lung lesion segmentation model incorporating chest X-ray imaging and diagnostic textual notes. Comparing to existing work of text-guided medical image segmentation task that guide the segmentation by encoded text token in unilateral way, the manuscript introduces another branch that specifically updates the text encoding parameters by evaluation the text’s relevance to image features, auxiliarily supervised by ground truth segmentation mask.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The writing of the article is excellent and the illustrative figure of the framework is exquisite, providing a pleasant reading experience. The authors pinpointed the shortcomings of related work, thus clearly elucidating the design motivation and rationale behind the proposed work. Constrained by the page number limit, the author made reasonable trade-offs in granularity across different sections, providing detailed explanations of the methods while also showcasing key performance metrics and comparisons of the models.
    • The bilateral text-image fusion block, termed ABP block is novel and theoretically efficient. Existing work mostly utilizes textual information to guide image segmentation in one direction while the proposal added a second direction that correct the text tokens’ attention by an auxillary segmentation task.
    • The experiments and evaluation are robust. The authors compared the proposal to conventional image-only and unilateral text-guided methods, then the ablation study justified the contribution of each novel blocks comparing to existing work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Honestly, the paper is quietly sound for me and I can barely find major concerns about the proposal.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Weighting on the final loss: is it possible to dynamically adjust the weighting parameter 𝛾 in Eq. 6 during the training, e.g., an intuitive idea is to enhance the weight of Laux when Laux is significantly larger than Lcls?
    • Typos: page 5: (eg. attention score map)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As I mentionned in “Main strengths”, the paper is well organized, the method is novel and sound, and the experiment design is intuitive but persuasive. Therefore, future work about the text-guided medical image segmentation task may benefit from the concept of bilateral prompting of the proposal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my minor concern about the parametrization strategy so I would like to maintain my recommandation.




Author Feedback

Q1: Performance Improvements (R1, R4) A1: In the research realm of using text prompts for segmentation improvement, AT (MICCAI 2023) currently stands as the SOTA method on QaTa-COV19, surpassing previous methods by 3-5% in Dice. Compared with AT, our method further boosts Dice/MIoU by 1.25%/2.08%, with the Dice score (91.03%) exceeding 90% on this dataset for the first time. We also conducted paired t-tests to verify the significance of our improvements. Results indicate that p-values on three metrics are all less than 0.05, demonstrating the statistical significance of our method.

Q2: Evaluation only on QaTa-COV19 (R4) A2: To ensure a fair comparison, we followed the same setup of AT, conducting experiments solely on QaTa-COV19. In the future, we plan to extend experiments to more datasets to further validate the effectiveness of our method.

Q3: Parameters Setting: gamma, token lengths, channel dimension, and Pretraining (R1, R3, R5) A3: In our experiments, we explored various gamma values ranging from 0 to 1.0 with a step of 0.2, finding 0.4 yields optimal results across three metrics. We also tested the dynamic adjusting strategy suggested by R5, but did not observe performance gains compared to using a fixed value of 0.4. As for token lengths, we referred to LViT and set three candidate values (36, 24, and 18) for M1, and explored two descending ratios to determine M2 and M3: an arithmetic progression (1, 3/4, 2/4) and a geometric progression (1, 1/2, 1/4). We found that the arithmetic progression with M1 set at 24 is optimal. Aligned with AT, the text encoder (CXR BERT) is pretrained and fixed, while the image encoder (ConvNeXt-Tiny) is trained from scratch with its output dimension C1 set to 768. We will add these details in the final paper.

Q4: Unreasonableness of directly concatenating text and image tokens (R3) A4: In fact, our method does not directly concatenate text and image tokens. Instead, we concatenate the cross-attention map with the interacted image feature (tokens) along the channel dimension. This aims to incorporate text-prior information to enhance representations.

Q5: Token pruning and fusion (R3, R4) A5: For pruning, we sort tokens based on their attention values, which were obtained by averaging the cross-attention map from the image branch along the last dimension. For fusion, we have already conducted an ablation experiment without the fusion of unimportant tokens, as shown by Model-D in Table 2. Experimental results show that without the fusion of such tokens, the performance drops by 0.3%. We also tried average fusion and observed a slight decline compared to our weighted fusion. The above analysis verifies the effectiveness of our fusion strategy.

Q6: The necessity of bilinear interpolation for text-prior predictions (R3) A6: Our auxiliary loss involves text-prior predictions with multiple spatial scales, necessitating their expansion to align with the scale of the ground truth. Therefore, we employ bilinear interpolation for this alignment purpose.

Q7: Clarification of MLP and cascaded projector (R1) A7: Sorry for the confusion. MLP stands for multi-layer perceptron. The cascaded projector is a custom module constructed by multiple MLPs hierarchically, which aims to align the channel dimensions of text tokens with those of image tokens at each scale.

Q8: Typos of Equation number, Definition of QKV, small Fig.1, and Review ref. 26 (R1, R4, R5) A8: Sorry for the typos. We will fix all the typos in the final version. Also, we will define QKV at their first usage and review Ref. 26 in the introduction. Fig. 1 will be enlarged for better clarity.

Q9: Limitation: Reliance on paired data (R4) A9: We agree with the reviewer that such text-guided works often require paired data. However, radiologists typically provide medical reports for each patient in practice, naturally pairing images with corresponding texts. Additionally, these texts can be also obtained by medical report captioning methods.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    accepts

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    accepts



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Given that static text embeddings struggle to adapt to different semantic contexts in various scales of image features and often overlook text prior information, this work proposes Asymmetric Bilateral Prompting (ABP), a novel pipeline for text-guided segmentation of lung lesion areas in X-ray images. Specifically, leveraging bilateral cross-attention, the authors use attention scores from the image branch to rank the importance of text tokens and remove redundant ones. Additionally, they introduce an auxiliary loss to predict text-prior information, incorporating attention scores from the text branch into the image branch to further enhance feature representations. Experiments on the Qata-COVID-19 dataset demonstrate the superiority of the proposed method compared to other mono-modal and text-guided approaches.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Given that static text embeddings struggle to adapt to different semantic contexts in various scales of image features and often overlook text prior information, this work proposes Asymmetric Bilateral Prompting (ABP), a novel pipeline for text-guided segmentation of lung lesion areas in X-ray images. Specifically, leveraging bilateral cross-attention, the authors use attention scores from the image branch to rank the importance of text tokens and remove redundant ones. Additionally, they introduce an auxiliary loss to predict text-prior information, incorporating attention scores from the text branch into the image branch to further enhance feature representations. Experiments on the Qata-COVID-19 dataset demonstrate the superiority of the proposed method compared to other mono-modal and text-guided approaches.



back to top