Abstract

Conventional medical image segmentation methods are only based on images, implying a requirement for adequate high-quality labeled images. Text-guided segmentation methods have been widely regarded as a solution to break the performance bottleneck. In this study, we introduce a bidirectional Medical Adaptor (MAdapter) where visual and linguistic features extracted from pre-trained dual encoders undergo interactive fusion. Additionally, a specialized decoder is designed to further align the fusion representation and global textual representation. Besides, We extend the endoscopic polyp datasets with clinical-oriented text annotations, following the guidance of medical professionals. Extensive experiments conducted on both the extended endoscopic polyp dataset and additional lung infection datasets demonstrate the superiority of our method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2097_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2097_supp.pdf

Link to the Code Repository

https://github.com/XShadow22/MAdapter

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_MAdapter_MICCAI2024,
        author = { Zhang, Xu and Ni, Bo and Yang, Yang and Zhang, Lefei},
        title = { { MAdapter: A Better Interaction between Image and Language for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • This paper utilizes text prompts to enhance two types of medical image segmentation tasks.
    • This paper introduces the MAdapter to facilitate bidirectional interactive fusion, and a specialized decoder to facilitate alignment of multi-level visual and linguistic features.
    • This paper evaluates the performance on public medical image-text benchmark datasets, and demonstrate the superiority of the proposed method.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper presents a clear motivation for using the bidirectional interaction of visual and language features to mitigate noise interference caused by language features.
    2. A lightweight decoder accurately extracts visual and language features, enhancing language guidance, and can be adapted to any pre-trained model.
    3. The paper conducts comprehensive experiments.
    4. Text-assisted segmentation is a direction worth exploring, and this paper is developing in this direction, which is valuable
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The parameter definitions in the paper are confusing, with inconsistencies between the figures and the main text, making it difficult to understand the content.

    2. The adapter is an efficient parameter fine-tuning method, and the paper should specify the exact number of parameters required for training the adapter.

    3. The paper contains unclear descriptions in the methods section, such as the size of F_v , the processing of F_v in the MHCA module (whether it is patch-wise like ViT or pixel-wise like Non-local), the number of patches for partitioning an image, and how the reshaping process ensures that the sizes of F_v and F_l are equal.

    4. The paper does not clearly explain how F_g is obtained, nor does it specify the size of F_g . Besides, despite demonstrating the effectiveness of the decoder in Table 3, it fails to convincingly establish the necessity of F_g for the decoder.

    5. The choice of UNet and UNet++ as base architectures for comparison is outdated, as these are older algorithms. The paper lacks comparison with more recent algorithms (such as PraNet, SANet, Polyp-PVT, Polyper), which limits its ability to demonstrate the advantages of the model based on joint image and text features over purely image-based models.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. I am concerned about whether the bidirectional interaction of visual and language features might introduce visual-related noise to the model.

    2. The figure.1 is confusing. Differentiate visual and language features should use different colors. The arrows from self-attention to cross-attention are confusing; please label them as qkv.

    3. What are the differences between fg, ft, and fl? If the naming is not a typo, please clarify their distinctions.

    4. After the MAdapter, why are ft and fl not involved in further computations, and decoding is performed using fg?

    5. Why is there only a global textual feature fg and no corresponding global visual feature?

    6. Provide an overall evaluation (combine five testing sets) in Table 2.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is theoretically feasible. However, the content of the paper needs improvement and clarification to ensure the logical flow of the method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors proposed a bidirectional medical adaptor (MAdapter) where visual and linguistic features extracted from pre-trained dual encoders undergo interactive fusion, and design a specialized decoder to align the fusion representation and global textual representation. Extensive experiments on extended endoscopic polyp dataset and lung datasets proved the superiority of methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article has a clear and reasonable structure, with natural transitions between paragraphs and coherent logic. Moreover, the language is clear, concise, and highly readable. The figures and charts in the article are clear, which can effectively illustrate the ideas presented in the article, as well as comparisons between results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the author incorporated linguistic features into visual features and proposed MAdapter, it seems that some previous research has already done this work, so the novelty appears to be insufficient.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.The authors used pretrained visual and language encoders, I suggest discussing possible changes in model performance under different encoders or pretraining strategies. 2.The formulas in the article recommend a serial number for each formula, rather than a serial number for each process. 3.Punctuation marks are used incorrectly, and there are no punctuation marks in some places, for example: at the end of Figure 1. 4.The title of the table is not correctly capitalized. And The use of symbol formats in many parts of the article is irregular. For example: there are no spaces before and after some brackets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty appears to be insufficient.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper extends public endoscopic polyp datasets with detailed and clinical-oriented text annotations. It also proposes a novel way to facilitate fusion of multi-level visual and linguistic features to improve medical image segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Major contribution to the research community through creation of text annotations for endoscopic polyp datasets.
    2. The proposed feature fusion technique improves segmentation across different data modalities
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No obvious weaknesses

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It is not apparent from the paper to me why the MAdapter is bidirectional. The term ‘bidirectional’ already refers to using context from both sides of the sequence (BERT). It’s confusing when you use a sequence of MAdapters and call it bidirectional. Consider making a distinction from it.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The creation of text annotations for the polyp dataset would enable future research that explores text and image interactions. The paper also proposes a novel feature fusion scheme that yields superior results.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for the thoughtful feedback. We are encouraged by their recognition of the novelty and feasibility of our proposed method (R1, R3). We will incorporate all of the valuable suggestions in the camera-ready version.

R1 We greatly thank you for your appreciation. About ‘bidirectional’, our proposed MAdapter highlights the bidirectional interaction between textual and image features, distinguishing it from previous unidirectional text-guided medical image segmentation methods. We will provide a clearer statement in the next version.

R3 Q1: Method logical flow. We will revise the method section to keep it clear and logical. F_v & F_l: the vision and language features output by the encoders. The size of F_v is H_i×W_i×C_i, where i denotes a certain stage. f_v & f_l: features after interaction. We apologize for the typos of f_t, which should be corrected to f_l. Before interaction, we perform a resize operation to keep the sizes of f_v (i-1 stage) and F_v (i stage) the same. Suppose this same size is H1×W1×C1, then f_l is projected to L’ × C1 (L’ is the number of tokens after projection). And the consistency of C1 is necessary for MHCA. The image features, extracted from a CNN-based encoder, are inherently pixel-wise and do not involve partitioning. In MHCA, the processing of features includes matrix multiplication and attention score calculation for each head.

Q2: Questions about f_g. ‘f_g’ is created by pooling the text sequence from multi-stage into a vector of size C’. f_g is a global representation, providing sentence-level information that f_l lacks. In our ablation study, we do not use f_g to achieve global alignment. Since this is the only difference between our decoder and a normal segmentation head, the effectiveness of the decoder inherently proves the necessity of f_g. Introducing global visual features does not enhance the recognition of image structural details.

Q3: Comparison methods. Unet and Unet++ are widely used for single-modal medical image segmentation in recent MICCAI papers. Following prior works, we take them as our baselines. The mentioned methods are tailored for polyp segmentation and may not be suitable for another lung infection segmentation task. Moreover, our method significantly outperforms PRANet and SANet, and surpasses PVT and Polyper on some datasets, demonstrating the versatility and superiority of our multi-model method.

Q4: The visual-related noise? Image noise is fine-grained and at the pixel level. Typically, low-level visual noise has negligible impact on higher-level text features. In practice, we should focus more on text noise, which derives from the subjectivity of doctors and would introduce huge bias in segmentation.

Q5: Other questions. Our trainable parameter is 51.9M. In addition, we will calculate the average evaluation metrics across the five datasets, and modify the figures while ensuring their consistency with the text.

R4 We will carefully revise the issues about the paper writing. Q1: Different encoders or pretraining strategies. Our components can be integrated with most pre-trained encoders. We have used other vision and language encoders (ConvNeXt-seg, Vit-B, PubMedBert, etc.). Under different settings, our MAdapter improves segmentation performance compared to previous feature fusion methods. Our focus is the feature interaction during fine-tuning rather than pretraining strategies.

Q2: Novelty. As acknowledged by R1 and R3, our novel feature fusion scheme and significant annotation work are valuable. Here, we highlight our innovations: the MAdapter addresses the impact of textual noise by facilitating bidirectional interaction between multi-level features. A flexible decoder refines the global alignment by introducing sentence-level features. Also, we introduced a novel annotation scheme and applied it to polyp datasets. These contributions significantly expand the research in this field.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers evaluated the novelty of the method that uses multimodal data in image segmentation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers evaluated the novelty of the method that uses multimodal data in image segmentation.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    accepts

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    accepts



back to top