Abstract

Segmentation of infected areas in chest X-rays is pivotal for facilitating the accurate delineation of pulmonary structures and pathological anomalies. Recently, multi-modal language-guided image segmentation methods have emerged as a promising solution for chest X-rays where the clinical text reports, depicting the assessment of the images, are used as guidance. Nevertheless, existing language-guided methods require clinical reports alongside the images, and hence, they are not applicable for use in image segmentation in a decision support context, but rather limited to retrospective image analysis after clinical reporting has been completed. In this study, we propose a self-guided segmentation framework (SGSeg) that leverages language guidance for training (multi-modal) while enabling text-free inference (uni-modal), which is the first that enables text-free inference in language-guided segmentation. We exploit the critical location information of both pulmonary and pathological structures depicted in the text reports and introduce a novel localization-enhanced report generation (LERG) module to generate clinical reports for self-guidance. Our LERG integrates an object detector and a location-based attention aggregator, weakly-supervised by a location-aware pseudo-label extraction module. Extensive experiments on a well-benchmarked QaTa-COV19 dataset demonstrate that our SGSeg achieved superior performance than existing uni-modal segmentation methods and closely matched the state-of-the-art performance of multi-modal language-guided segmentation methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0556_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0556_supp.pdf

Link to the Code Repository

https://github.com/ShuchangYe-bib/SGSeg

Link to the Dataset(s)

https://www.kaggle.com/datasets/aysendegerli/qatacov19-dataset

BibTex

@InProceedings{Ye_Enabling_MICCAI2024,
        author = { Ye, Shuchang and Meng, Mingyuan and Li, Mingjian and Feng, Dagan and Kim, Jinman},
        title = { { Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors of this paper proposed a language-guided segmentation model for chest X-ray segmentation. The method utilizes multi-modal data during the training phase and enables text-free uni-modality inference through the use of generated texts as self-guidance. The proposed method demonstrates superior performance compared to conventional uni-modal methods and comparable performance to state-of-the-art multi-modality methods for the binary segmentation task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method utilizes multi-modal data during the training phase and enables text-free uni-modality inference through the use of generated texts as self-guidance.
    2. The proposed method demonstrates superior performance compared to conventional uni-modal methods and comparable performance to state-of-the-art multi-modality methods for the binary segmentation task.
    3. The motivation for the study is clearly presented, and the overall writing regarding the study is clear.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The writing about the methodology is not clear.
    2. The validation of the method is limited to just one dataset, and the generalization of the method is not evident.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The method introduction is not very clear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The motivation behind the selection of U-Net-like transformers is not clearly explained, nor are the other components in the framework.
    2. Why do the authors believe that the direct integration of image and text embeddings can be a good way to combine information from different modalities? Are those features naturally aligned?
    3. The training details are very unclear. The proposed framework comprises a few models, but it is unclear how many of them need to be trained. Does the text report generation part and detection part need to be trained or not?
    4. The generalization ability of the method is not demonstrated.
    5. Could human evaluation be more appropriate for evaluating text report generation performance?
    6. While the performance of the method appears impressive, it also utilizes significantly more models than the compared methods. The authors may want to compare the number of parameters and inference times with other methods.
    7. How about evaluating the performance of LanGuideSeg without text (training with text but inference without text), similar to the proposed method’s settings?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty, motivation about the study and methododology design, experiment, paper writing

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a segmentation network that can be trained multi-modal aka with segmentation labels as well as text based descriptions and can be used uni-modal meaning only the image is given. The network is studied for lung X-ray images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Description of a network consisting of a language guided U-Net coupled with a localization-enhanced report generation

    The paper provides a comprehensive comparison with different state-of-the-art segmentation methods and shows the results of an ablation study to study the effect of the major components.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    weaknesses are of minor character.

    The methods could have been a bit more elaborated, but that would have been resulted in shortening the studies and results section, which is also not desired.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code is provided along with the paper, datasets are cited

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Extending the methods section to provide more information about the implementation would help to understand the paper better

    Integration of the results of report generation from the supplement to the main text would improve the readability

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Very nice paper with good results and overall high interest to the community

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a self-guided segmentation framework tailored for text-free analysis. The method is used for chest X-rays analysis and was evaluated on the QaTa-COV19 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • comparative analyses were conducted with uni-modal and multi-modal segmentation models;
    • while the method leverages textual information during training, it does not depend on textual input during inference, presenting comparable performance to methods that rely on image + text for inference.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • since the method was developed using only one COVID dataset, it is not clear if it is limited to COVID infections and how it would perform on other lung findings.
    • there is no comment on limitations of the method
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Paper is well written and technically sound. Authors should discuss the limitations of the paper, considering that it was developed and tested on one specific COVID dataset.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Solid paper with very detailed experiments, good figures and relevant results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We are grateful for the Reviewers’ (Rs) constructive comments and valuable insights. We appreciate the acknowledgment by all the Rs on our novelty, where we propose a text-free inference in image segmentation compared to current language-guided models that necessitate clinical reports with the images. We also thank the Rs for commenting on our technical contribution to our framework, which learns cross-modal knowledge from multi-modal information during training and realizes text-free inference by self-guidance.

We respond to questions from the Rs below:

Method Elaboration (R1, R4-Q3): We were concise in our method (including training details), given the page limit, and therefore, some sections were abbreviated. We will shorten the instruction of existing components (e.g., UNet) and provide a more detailed explanation of the proposed methods (e.g., self-guidance). If space permits after adjustment, we will move the tables and figures to the main text while keeping the additional explanatory sections in the supplementary materials.

Generalizability (R3, R4-Q4): We focused on automated X-ray image analysis in this study. To this aim, we used the QaTa dataset as it is the only X-ray segmentation dataset with reports. This dataset is widely used as a benchmark in the field for language-guided segmentation tasks (e.g., LanGuideSeg [15], LViT [14] in the main text). The QaTa dataset provides a substantial, diverse collection that addresses the typical concerns of overfitting associated with smaller datasets and enhances clinical applicability with its wide range of thoracic disease representations (It is not limited to QaTa but covers 14 diseases). Future studies will extend our approach to other modalities, such as the MosMedData (CT-scan). Our model can be generalized to other datasets as the proposed attention mechanism and self-guidance framework (section 2.3) can autonomously identify and prioritize crucial information from reports (location information in the QaTa) regardless of the dataset. We will add a statement regarding this as a limitation and future work.

Method justification/clarification (R4-Q1,2): We used a U-shaped network, which has been demonstrated to be effective in medical image segmentation. Transformers were selected as they can be integrated with cross-attention modules, the most widely-used mechanism for multi-modal fusion. The alignment between image and text is not our primary focus. We employed the most effective multi-modal fusion technique proposed in LanGuideSeg. We will revise our text to emphasize our method justification.

Human Evaluation (R4-Q5): We agree that human evaluation offers different perspectives on the quality of generated reports. In this study, given that our reports are semi-structured, we adopted well-established metrics [1, 2] for assessing the quality of the generated text, which can greatly reflect the accuracy of the generated reports and provide a more reproducible and objective evaluation. [1] Vinyals, O. et al. “Show and tell: A neural image caption generator”, CVPR 2015 [2] Zhihong C. et al. “Generating Radiology Reports via Memory-driven Transformer”, EMNLP 2020

Computational Cost (R4-Q6): The network without self-guidance has 153M parameters, and the self-guidance network has 33M (17.74%). Despite this, Adding the text generation component does not increase inference time, as it processes parallel with the UNet’s encoding phase. This ensures the generation is completed before UNet needs the generated text for decoding guidance.

Comparison to LanGuideSeg without text (R4-Q7): The LanGuideSeg framework necessitates textual input during the inference phase. We already have experiments to replace the requisite text with noise representations or randomly selected textual data, but this resulted in a substantial performance degradation (~18%).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors introduce a language-guided model for chest X-ray segmentation. This model uses multi-modal data during training and facilitates text-free, single-modality inference by employing generated texts for self-guidance. The proposed method outperforms traditional single-modality approaches and achieves results comparable to state-of-the-art multi-modality methods in binary segmentation tasks. Although the study’s motivation and overall writing are clear, the methodology section is not well-explained. Furthermore, the method’s validation is restricted to one dataset, making its generalizability uncertain.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors introduce a language-guided model for chest X-ray segmentation. This model uses multi-modal data during training and facilitates text-free, single-modality inference by employing generated texts for self-guidance. The proposed method outperforms traditional single-modality approaches and achieves results comparable to state-of-the-art multi-modality methods in binary segmentation tasks. Although the study’s motivation and overall writing are clear, the methodology section is not well-explained. Furthermore, the method’s validation is restricted to one dataset, making its generalizability uncertain.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper presents a segmentation network that can be trained on multimodal data, including segmentation labels and text-based descriptions. Upon careful review of the feedback and rebuttal, it appears that most of the concerns have been addressed, despite no change in the reviewers’ ratings. Overall, the technology is deemed acceptable. Therefore, I will vote for its acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper presents a segmentation network that can be trained on multimodal data, including segmentation labels and text-based descriptions. Upon careful review of the feedback and rebuttal, it appears that most of the concerns have been addressed, despite no change in the reviewers’ ratings. Overall, the technology is deemed acceptable. Therefore, I will vote for its acceptance.



back to top