Abstract

Pneumonia, recognized as a severe respiratory disease, has attracted widespread attention in the wake of the COVID-19 pandemic, underscoring the critical need for precise diagnosis and effective treatment. Despite significant advancements in the automatic segmentation of lung infection areas using medical imaging, most current approaches rely solely on a large quantity of high-quality images for training, which is not practical in clinical settings. Moreover, the unimodal attention mechanisms adopted in conventional vision-language models encounter challenges in effectively preserving and integrating information across modalities. To alleviate these problems, we introduce Text-Guided Common Attention Model (TGCAM), a novel method for text-guided medical image segmentation of pneumonia. Text-Guided means inputting both an image and its corresponding text into the model simultaneously to obtain segmentation results. Specifically, TGCAM encompasses the introduction of Common Attention, a multimodal interaction paradigm between vision and language, applied during the decoding phase. In addition, we present an Iterative Text Enhancement Module that facilitates the progressive refinement of text, thereby augmenting multi-modal interactions. Experiments respectively on public CT and X-ray datasets demonstrated our method outperforms the state-of-the-art methods qualitatively and quantitatively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2599_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/G-peppa/TGCAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Guo_Common_MICCAI2024,
        author = { Guo, Yunpeng and Zeng, Xinyi and Zeng, Pinxian and Fei, Yuchen and Wen, Lu and Zhou, Jiliu and Wang, Yan},
        title = { { Common Vision-Language Attention for Text-Guided Medical Image Segmentation of Pneumonia } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this manuscript, the author introduced a Text-Guided Common Attention Model (TGCAM) for text-guided medical image segmentation of pneumonia.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This model features a novel multimodal interaction technique called Common Attention and an Iterative Text Enhancement Module that refines text to improve multimodal interactions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It is still not clear what is relationship between image and text.
    2. I am not sure whether the input is the image only or the combination of image and text.
    3. I think the accuracy of lung segmentation (i.e., COVID) is very high, while your Dice oerformance is 0.9
    4. Why are you using such base works [3-5] for comparison?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In this manuscript, the author introduced a Text-Guided Common Attention Model (TGCAM) for text-guided medical image segmentation of pneumonia. This model features a novel multimodal interaction technique called Common Attention and an Iterative Text Enhancement Module that refines text to improve multimodal interactions. The results of the experiments demonstrate that this approach outperforms existing methods. Overall, the manuscript exhibits a high level of clarity and well-written organization. However, certain aspects of the content require further elucidation. The following comments are provided for the author’s consideration:

    1. Training Splitting: Regarding the QaTa-COV19 dataset, the training sets are split 80% for training and 20% for validation, whereas the MosMedData split is 8:1:1. Could you explain the rationale behind these differing data splits?
    2. CT Image Selection: It appears that both X-ray and CT images were utilized in training the model. Given that X-ray and CT scans show different anatomical details of the lungs, how was the CT data selected? Was only one slice used per patient?
    3. Data Comparison: Additionally, Figure 3 only includes X-ray data; could you clarify this choice? Why the performance on MosMedData was lower than QaTa-COV19 dataset?
    4. Ablation Study: It seems that the ablation study was performed on QaTa-COV19 only. Could consider adding the ablation on the MosMedData.
    5. Experimental Setup and Reproducibility: Details about the implementation, such as the exact parameters used for the text and image encoders, or any preprocessing steps applied to the datasets, are crucial for reproducibility. Adding these specifics will enable other researchers to replicate your study accurately.
    6. Discussion on Model Efficiency: While the model’s performance is highlighted, there is little discussion on its computational efficiency. Given the potential application in clinical settings, understanding the model’s resource requirements and processing time is critical. Please include these metrics to provide a more comprehensive evaluation of the model’s practicality.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please show the Point 6

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for your detailed responses to the review comments. Overall, your revisions and clarifications have addressed the majority of the concerns raised. Here are a few remaining points that need to be considered: 1 Training Splitting: Although the rationale behind the differing data splits was mentioned, a clearer explanation in the manuscript would enhance understanding. 2 CT Image Selection: More details on the selection process for CT images would be beneficial to ensure consistency and reproducibility. 3 Data Comparison and Figure 3: Clarification on the inclusion of only X-ray data in Figure 3 and the performance differences between datasets should be provided in the manuscript. 4Ablation Study: Including ablation studies on both datasets in the final manuscript will ensure a comprehensive evaluation of your model. 5 Experimental Setup and Reproducibility: Additional implementation details, such as specific parameters and preprocessing steps, will aid in reproducibility. 6Discussion on Model Efficiency: Including more quantitative metrics on computational efficiency will provide a clearer picture of the model’s practical applicability.

    With these minor revisions, I believe the manuscript is worthy of acceptance.



Review #2

  • Please describe the contribution of the paper

    This work proposes a text-guided medical segmentation algorithm. Through experimentation on two public datasets they achieve a new state of the art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Using text information e.g. from radiology reports to aid image segmentation is an interesting and beneficial approach. Datasets used are adequate in size. The authors compare their method with multiple existing methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Clarity and organization of the paper (see comment below) can be improved. Statistical evaluation of the results is not clear/not done.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors state that the test requires text as well as image. Can the authors explain why that’s a limitation. Wasn’t the intent text guided segmentation?

    The paper would benefit from explaining the problem well before diving deep into the technical details. For example, provide more details about what these image-text pairs are and what text-guided means at the beginning of the paper. This helps the reader understand the importance of the paper.

    Also it is not entirely clear from the start what modalities the proposed method is useful for. While the authors eventually make this clear, it will be useful to provide enough information to make the methodology more understandable.

    It is not clear if statistical testing is done in the evaluation i.e. are results statistically significant?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is not well organized, little justification for why tackle this problem (leaving it to the reader to guess), it is not clear if the results are clinically significant.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for the clarifications of the work. Since a major portion of my concerns were around clarity and organization of the paper, I trust that the revision will correct these. Given the strengths of the work that I listed, and the post-rebuttal version, I believe this work could be suitable for publication.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel approach for medical image segmentation of pneumonia through a text-guided model named Text-Guided Common Attention Model (TGCAM). The model integrates vision and language modalities using a common attention mechanism during the decoding phase, alongside an Iterative Text Enhancement Module for progressive text refinement.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The research targets the enhancement of segmentation performance by leveraging multimodal data, specifically addressing the limitations of unimodal attention mechanisms in conventional vision-language models. Experimental evaluations on two public datasets demonstrate that the proposed method outperforms existing state-of-the-art approaches both qualitatively and quantitatively.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Please clarify how the skip connection features concat with the ITEM module which shown in Fig.2. in Fig.2(c) we could only see F_com and T, but don’t know how the skip features integrated.
    2. for the ablation study, it would be more helpful if verify the CAM vs normal attention module.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    NA

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I am inclined to accept this paper, based on its innovation and experimental results.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks to all reviewers (R1, R3, R4) for constructive comments.

Q1: Clarification on Text-guided Segmentation(R1,R4) A1: Sorry for any confusion. Here’s further clarification on our task: The essence of text-guided segmentation lies in utilizing additional textual information paired with images to enhance segmentation accuracy. This entails inputting both an image and its corresponding text (termed image-text pairs) into the model simultaneously to obtain segmentation results. We conducted independent experiments using two pneumonia datasets: QaTa-COV19 (QaTa), comprising X-ray slices, and MosMedData (Mos), comprising 2D CT slices. Each slice is paired with text descriptions structured by [10], containing information about infection type, the number of infected areas, and the location of the infection. Detailed explanations will be provided in the final manuscript. Despite superior performance compared to image-only methods, text-guided segmentation rigidly requires complete image-text pairs for both training and testing, which we identify as a limitation in our conclusion. To allow for missing text in partial cases, we plan to enhance our model by addressing missing modalities in the future.

Q2: Experimental details(R4) A2: Train split: The train, test, and validate sets division followed the guidelines from [10], ensuring consistency for reliable comparisons. CT Image Selection: In MosMedData, the dataset provider [10] has already selected multiple 2D images sliced from a single 3D CT volume of a patient. Data Comparison: As discussed in A1, since our model was trained independently on two datasets rather than jointly, performance metrics naturally vary due to differences in label quality and modal distribution. Notably, we achieved SOTA results on both datasets.

Q3: More implementation details and reproducibility(R1,R4) A3: Aligned with LGMS [12], our backbone uses pre-trained encoders (ConvNeXt for image and CXR-BERT for text) and the same simple data processing steps, making it easy to reproduce. We will add more implementation details to the final manuscript and release codes to ensure better reproducibility.

Q4: Comparison experiments and performance(R1,R4) A4: The comparison to unimodal methods [3-5] aims to validate the motivation and benefits of introducing the text modality, as outlined in A1. To make the comparison more comprehensive, we will add CT visualization on MosMedData in Figure 3. Since training on the QaTa dataset typically converges relatively easily, achieving further improvements becomes harder as metrics approach the bottleneck (near 90% dice). However, our model achieved a dice score of 90.6% on QaTa, surpassing the 90% threshold for the first time. Besides, remarkable results in MIoU on QaTa and two metrics on Mos underscore the effectiveness of our method. Paired t-tests between our model and the second-best LGMS on QaTa and Mos show p-values for Dice and MIoU less than 0.05, indicating statistical significance.

Q5: Model complexity(R4) A5: Compared to some methods [10,13] using Transformer as the image backbone, we opt for a more lightweight ConvNeXt with fewer parameters. Additionally, our proposed multimodal interaction module comprises just a few simple linear layers, contributing minimal parameters. For inference, we only need milliseconds to generate results, thus fitting clinical application scenarios.

Q6: Ablation Studies(R2,R4) A6: Due to space constraints, we only showed ablation results on QaTa, where the ablation effects are more pronounced. To ensure comprehensiveness, we’ll include ablation studies on Mos in the final manuscript. Also, we’ll incorporate normal attention as one ablation version to better validate our attention modules.

Q7: Illustration(R3) A7: In fact, the skip connection is to concat the updated Fcom and the image feature from the corresponding upsampling level, as described in section 2.4. We will improve our illustration in Fig.2.C to avoid misunderstanding.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper receives a mixed initial review, and two negative reviewers raise their score based on the rebuttal. After reading the paper and reviews, I believe the main initial issue with this paper is the clarity. After the rebuttal, majority concerns are solved. Hence, I also recommend acceptance of this work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper receives a mixed initial review, and two negative reviewers raise their score based on the rebuttal. After reading the paper and reviews, I believe the main initial issue with this paper is the clarity. After the rebuttal, majority concerns are solved. Hence, I also recommend acceptance of this work.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After carefully reading the authors’ rebuttal, I will vote for acceptance, as the major concerns raised by reviewers have been effectively addressed.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After carefully reading the authors’ rebuttal, I will vote for acceptance, as the major concerns raised by reviewers have been effectively addressed.



back to top