Abstract

Lesion segmentation in chest images is crucial for AI-assisted diagnostic systems of pulmonary conditions. The multi-modal approach, which combines image and text description, has achieved notable performance in medical image segmentation. However, the existing methods mainly focus on improving the decoder using the text information while the encoder remains unexplored. In this study, we introduce a Multi-Modal Input UNet model, namely MMI-UNet, which utilizes visual-textual matching (VTM) features for infected areas segmentation in chest X-ray images. These VTM features, which contain visual features that are relevant to the text description, are created by a combination of self-attention and cross-attention mechanisms in a novel Image-Text Matching (ITM) module integrated into the encoder. Empirically, extensive evaluations on the QaTa-Cov19 and MosMedData+ datasets demonstrate MMI-UNet’s state-of-the-art performance over both uni-modal and previous multi-modal methods. Furthermore, our method also outperforms the best uni-modal method even with 15% of the training data. These findings highlight the interpretability of our vision-language model, advancing the explainable diagnosis of pulmonary diseases and reducing the labeling cost for segmentation tasks in the medical field. The source code is made publicly available on GitHub.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2773_paper.pdf

SharedIt Link: https://rdcu.be/dV52d

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_67

Supplementary Material: N/A

Link to the Code Repository

https://github.com/nguyenpbui/MMI-UNet

Link to the Dataset(s)

https://www.kaggle.com/datasets/aysendegerli/qatacov19-dataset https://www.kaggle.com/datasets/maedemaftouni/covid19-ct-scan-lesion-segmentation-dataset

BibTex

@InProceedings{Bui_VisualTextual_MICCAI2024,
        author = { Bui, Phuoc-Nguyen and Le, Duc-Tai and Choo, Hyunseung},
        title = { { Visual-Textual Matching Attention for Lesion Segmentation in Chest Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {702 -- 711}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces Multi-Modal Input UNet (MMI-UNet), a model designed to take both images and text as inputs for segmenting lesions in lung X-rays/CT images. The authors report that this approach can outperform state-of-the-art (SOTA) methods with only 15% of the data, demonstrating the effectiveness of their proposed methodology.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The multi-modal approach, which incorporates visual-textual matching (VTM) features, is novel.
- This method outperforms a large number of competing approaches.
- The authors provide enough technical detail to make their methods easy to understand and follow.
- They conduct interesting and extensive experiments to evaluate the impact of varying training data sizes on the model’s segmentation performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The report text seems highly structured, resembling a concatenation of different structured labels. This raises two concerns: (i) At the testing stage, using such structured text as input might lead to label leakage, allowing the model to infer information it shouldn’t, thereby compromising its accuracy and fairness. (ii) During the training stage, this level of structured text could lead to degenerate representations in text embeddings. This could risk collapsing the representation learning and reduce the overall performance of the segmentation tasks.
- In Table 2, the authors conduct an ablation study on the GuideDecoder module, but there’s limited information about this component in the paper. I suggest that they provide more details to better explain its role and significance.
- The paper needs ablation studies to explore the performance of the model with and without additional text input. More studies in this area would be helpful.
- The authors claim that “these findings highlight the interpretability of our vision-language model,” but there are no visualizations of image/text feature maps or attention maps to support this statement.
- Intuitively, the gap between image and text features seems significant. What is the outcome of aligning these two types of features? The authors should discuss this aspect further to clarify the rationale behind their approach and underscore the motivation for their work.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

The authors provide enough technical detail to make their methods easy to understand and follow.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- Please refer to the comments regarding the weaknesses.
- The Dice coefficient is somewhat redundant when used alongside the Jaccard coefficient to evaluate segmentation results. It would be better to include distance-based metrics like HD95 for a more comprehensive evaluation.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty of the multi-modal method and convincing results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper introduces the MMI-UNet model, which leverages visual-textual matching features for lesion segmentation in chest images. By integrating self-attention and cross-attention mechanisms in the Image-Text Matching (ITM) module, the model captures relevant visual elements based on accompanying text descriptions. The study demonstrates state-of-the-art performance on lesion segmentation tasks, surpassing both uni-modal and previous multi-modal methods, even with limited training data. This approach not only advances the explainable diagnosis of pulmonary diseases but also shows potential in reducing the need for extensive and costly data labeling in medical image segmentation tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper introduces the MMI-UNet model, which integrates visual-textual matching features through the ITM module. This novel approach captures relevant visual elements based on accompanying text descriptions, enhancing the interpretability and performance of the segmentation model. The study demonstrates state-of-the-art performance on lesion segmentation tasks, outperforming both uni-modal and previous multi-modal methods. This highlights the effectiveness of the proposed visual-textual matching approach in improving segmentation accuracy and efficiency. By leveraging textual data from medical records alongside images, the model shows potential in reducing the need for extensive and expensive data labeling. This aspect is particularly strong as it addresses a significant challenge in the medical field and can lead to cost-effective and efficient segmentation solutions. The interpretability of the vision-language model in diagnosing pulmonary diseases is a notable strength. The ability to explain and understand the model’s decisions can enhance clinical feasibility and support healthcare professionals in making accurate diagnoses based on the segmentation results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

While the paper demonstrates state-of-the-art performance, a more comprehensive comparison to existing methods in terms of specific metrics, datasets, and experimental setups could provide a clearer understanding of the model’s strengths and limitations. The paper lacks detailed information on clinical validation or real-world application of the proposed MMI-UNet model. Including insights from medical experts or conducting clinical studies to evaluate the model’s performance in a clinical setting could strengthen the practical relevance of the study. The generalization of the proposed visual-textual matching approach to lesion segmentation tasks beyond chest images is not extensively discussed. Addressing the potential applicability of the model to other medical imaging modalities or datasets would enhance the study’s impact and relevance in broader healthcare contexts. The reliance on textual data from medical records may introduce biases or inaccuracies that could impact the model’s performance. Addressing the quality and diversity of textual data sources, as well as potential biases in annotations, could improve the robustness and reliability of the segmentation results. The paper does not extensively discuss the scalability and computational efficiency of the proposed MMI-UNet model, especially in large-scale or real-time clinical settings. Addressing the model’s scalability and computational requirements could enhance its practical utility in clinical applications.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

MIC (Methodological Innovation and Contribution): Consider providing a more detailed comparison with existing methods, including specific metrics, datasets, and experimental setups, to highlight the unique contributions and limitations of the proposed methodology. CAI (Clinical Applicability and Impact): To enhance clinical applicability, consider incorporating insights from medical experts, conducting clinical validation studies, and discussing the generalizability of the model to other medical imaging modalities beyond chest images. Clinical Translation of Methodology: To facilitate clinical translation, consider addressing scalability and efficiency concerns, potential biases in textual data sources, and conducting real-world validation studies to evaluate the model’s performance in clinical practice. Health Equity: To further promote health equity, consider discussing the implications of the proposed methodology on underserved populations, addressing potential biases in data sources, and ensuring the model’s reliability and fairness across diverse patient demographics. Incorporating these suggestions can strengthen the paper’s contributions, enhance its clinical relevance and impact, and promote the advancement of health equity through innovative and ethically sound research practices.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Considering the strengths and the weakness of the paper, I recommend the “Weak Accept” Decision. Addressing the identified areas for improvement through revisions and responses to reviewer feedback would strengthen the paper further and increase its overall impact and contribution to the field of medical image segmentation.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors claim that recent research has primarily focused on improving decoders, while encoders have not received enough attention; thus, there is significant room for improvement in AI-based models. Therefore, they propose MMI-UNet which is a multi-modal input model based on the UNet network. MMI-UNet leverages visual and text data to generate features that enhance segmentation in chest images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main contribution is the image-text matching module, which combines self-attention and cross-attention mechanisms. The matched features are then passed to four decoders to generate segmentation. Although the individual attention modules are known in the community, the idea of combining them to produce a more robust segmentation and reduce the need for large annotated datasets is interesting and innovative. In general, the manuscript is well-written and easy to follow, and the results appear technically sound.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is no discussion about the impact of each individual attention module, and the ablation and experimental sections are shallow.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Why is the image input HxWX3? Is there any benefits of using three bands? On page 4, Section 3.1 please add the corresponding acronyms: Dice similarity score (DSC) and Jaccard index (JI). In Tables 1,2, and 3, be consistent with terms “IoU” -> “JI” Apparently, the proposal tends to generate false negatives, as seen in Fig. 2 and Fig. 3. Can you elaborate on these results?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors present a contribution that matches text and visual data to enhance segmentation in chest imaging. The manuscript is clear and concise.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We would like to thank the reviewers for their valuable time and constructive comments. Reviewers acknowledged the novel Image-Text Matching (ITM) module in the proposed Multi-modal Input UNet (MMI-UNet) method, as well as the thorough experimental validation and clear writing and organization of the paper. We deeply appreciate the reviewers’ insightful suggestions, which have significantly contributed to enhancing the quality of our work. Our detailed responses to the reviewers’ comments are as follows: Reviewer 1:

The ConvNeXt-Tiny model is pretrained on images with a HxWx3 format. Since we utilize this pretrained model to extract visual features from the input images, the input size must also adhere to the H×W×3 format.

We will update the manuscript by (1) incorporating additional evaluation metrics, such as HD (Hausdorff Distance) and accuracy, (2) providing a more detailed discussion of the qualitative results presented in the figures, and (3) correcting typographical errors (e.g., changing “IoU” to “JI” for Jaccard Index) and addressing missing acronyms. Reviewer 2:

We would like to thank the reviewer for the helpful comments regarding label leakage in the testing phase and degenerate representations in the training phase. In this study, our goal is to enhance segmentation performance by leveraging text descriptions to learn the similarity between visual and textual features. We will take your suggestions into consideration in our future work, particularly for applications in real-world clinical settings.

Originally, GuideDecoder required the corresponding visual features from the encoder and the textual features generated by CXR-BERT. We strictly follow the implementations and replace the encoder features with enhanced visual features from the Image-Text Matching (ITM) module. The textual features are similar because we also use CXR-BERT to embed the text descriptions. The code will be made publicly available after cleaning it up. Additionally, we will include more metrics for a more comprehensive evaluation in the final version of the manuscript.

We appreciate the reviewer’s suggestion regarding (1) the performance of the model with and without additional text input and (2) the heatmap-based interpretability of the proposed MMI-UNet. We recognize the importance of these studies in providing a deeper understanding of our model’s performance. We plan to conduct these ablation studies in our future work to comprehensively assess the contribution of textual data to the segmentation task. Reviewer 3:

We will add the comparison of more metrics and computational requirements for a more comprehensive evaluation in the final version of this manuscript.

Regarding the generalization of the proposed method: To the best of our knowledge, QaTa-COV19 and MosMedDataPlus are only two datasets with both medical images and the corresponding text descriptions of lesions. We hope that our work could attract more researchers to invest in the construction of medical multi-modal datasets, which would benefit the development of large models in the field of medical image analysis.

We would like to thank the reviewer for the suggestions about clinical applications and robustness. Our ultimate goal is to build a model that is capable of both performing higher-quality segmentation and also generating image-related reports. The current work is part of this goal, and it remains significant, at least for multi-modal deep learning. In future work, we hope that we can collaborate with medical experts to develop, validate, and integrate our method into the clinical workflow.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Visual-Textual Matching Attention for Lesion Segmentation in Chest Images

Author(s):