Abstract

Aside from offering state-of-the-art performance in medical image generation, denoising diffusion probabilistic models (DPM) can also serve as a representation learner to capture semantic information and potentially be used as an image representation for downstream tasks, e.g., segmentation. However, these latent semantic representations rely heavily on labor-intensive pixel-level annotations as supervision, limiting the usability of DPM in medical image segmentation. To address this limitation, we propose an enhanced diffusion segmentation model, called TextDiff, that improves semantic representation through inexpensive medical text annotations, thereby explicitly establishing semantic representation and language correspondence for diffusion models. Concretely, TextDiff extracts intermediate activations of the Markov step of the reverse diffusion process in a pretrained diffusion model on large-scale natural images and learns additional expert knowledge by combining them with complementary and readily available diagnostic text information. TextDiff freezes the dual-branch multi-modal structure and mines the latent alignment of semantic features in diffusion models with diagnostic descriptions by only training the cross-attention mechanism and pixel classifier, making it possible to enhance semantic representation with inexpensive text. Extensive experiments on public QaTa-COVID19 and MoNuSeg datasets show that our TextDiff is significantly superior to the state-of-the-art multi-modal segmentation methods with only a few training samples. Our code and models will be publicly available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0703_paper.pdf

SharedIt Link: https://rdcu.be/dZxdr

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_24

Supplementary Material: N/A

Link to the Code Repository

https://github.com/chunmeifeng/TextDiff

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Fen_Enhancing_MICCAI2024,
        author = { Feng, Chun-Mei},
        title = { { Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {253 -- 262}
}

Reviews

Review #1

Please describe the contribution of the paper
1. An enhanced label-efficient medical image segmentation method was proposed, termed TextDiff, to reduce the dependence of the diffusion model on pixel-level annotations by learning additional expert knowledge through medical text annotations.
2. Interpreting textual diagnostic annotations and intermediate activations of the Markov step of the reverse diffusion process in DPM, thereby improving visual-semantic representations in diffusion models.
3. Achieving better results than various state-of-the-art multi-modal segmentation methods on COVID and pathological images with very few training samples.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors propose a method that is capable of merging the text information and vision information by cross-modal attention module for better segmentation performance.
2. BioBert was used to extract text feature then fed into the model as a supervision information. Diffusion model was also used to extract vision feature. These two features were combined by a Cross-modal attention module.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The text data is not described in detail.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

Could you release a test demo or code to facilitate the reproducibility before the rebuttal deadline?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Introduction No comments

Methodology No comments

Experiments
1. In table 1. The difference IoU between GloRIA and UNet should be 25.58 instead of 24.58.
2. Some text examples and something related stuff should be showed in the paper, details of text dataset is not clear.
3. The result of LViT in Table 1 is extremely low, which seems to be unnormal. Is there any problems in the LViT implementation?
4. For UNet, TransUNet, SwinUNet, GLoRIA, LViT, how did you implement them? Use the original code or implement them by yourself?
5. The image encoder and text encoder are freeze during training, right?
6. After extracting image feature and text feature, did you normalize these two features? What operations were applied to these features? 7.How did you implement the pixel classifier?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There are some unclear points in the model implementation and experiment.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

In the paper a new technique called TextDiff is proposed to overcome the data limitation with manual pixel-level labeled data by learning expert knowledge through medical text annotations. The propose method uses pretrained text and image encoders to to produce high-level semantic information where their outputs are fed to a cross-modal attention block followed by a pixel classifier to generate the segmentation result. Experiments comparing state-of-the-art multi-modal segmentation methods as well as classical segmentation methods are presented.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed technique is novel in terms of leveraging text annotations to guide the segmentation process by extracting intermediate activations of the Markov step of the reverse diffusion process in a pretrained diffusion model.
- It requires fewer parameters than the discussed SOTA approaches thanks to using pretrained image and text encoders.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Throughout the paper the proposed method was described as “label-efficient” and text annotations were referred as “inexpensive”. Text annotations might not be as labor-intensive as pixel-level manual annotations but they still require expert reader’s time and effort. Also, in real world applications not every modality has text annotations that accompany images which could limit the applicability of the method in clinical settings.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- Please provide representative test cases where text guidance in fact makes a difference in the segmentation result by correctly including a challenging lesion in the result etc. compared to a classical medical segmentation method.
- Why was the pretrained diffusion model in reference [7] chosen for the image encoder? Did you experiment with any other pretrained models on natural images?
- Second paragraph of Section 2.1: What is meant by “Since these texts are generated simultaneously with the diagnostic images…”? This way it sounds like annotations are also generated by the method.
- Please provide comparison with Zhong et al. (MICCAI 2023 DOI: 10.1007/978-3-031-43901-8_69), especially for the QaTa-COVID19 dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is novel by combining text-guidance and diffusion models. The approach sounds practical in terms of using pretrained models for text and image encoders. Promising results are presented.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper proposes label-efficient method of segmenation to enhance segmentaion quality and reduce number of training samples through utilization of text data by the means of the cross-attention.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Interesting framework combining imaging and text features. The method uses already pretrained backbones, which save training cost. Requires a few training samples to produce good results.
- Study on the segmentation performance investigating usability of different blocks and diffusion steps of the denoising diffusion model. It motivates the choice of the blocks and diffusion steps to select features from. It can be a good starting point for applying the method on other datasets.
- Clear results and good presentation of the paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I don’t see any major issues with this paper, rather minor comments:
- In Eq (3) W_q (\hat t W_k)^T becomes W_q W_k^T \hat t^T. Is there any motivation to have two matrices W_q W_k^T when it’s a single matrix in fact?
- Could you please also highlight that the baseline methods were trained on the same limited number of the training samples
- Could you please add public sota of the baseline models trained on the entirety of the available data to the results table? Including the discussion, e.g. that just with a 1.5% training samples the method achieved 90% of sota performance in a few minutes.
- Have you tested the proposed method on the full training dataset? If not, do you expect to to scale well? Could you please include it into the discussion?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please refer to the weaknesses section
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is well written paper, it can be also improved a bit in terms of it’s presentation to better communicate strengths and weaknesses of the method. Overall, method might not be particularly novel, but the experiments and results are solid, it has a number of applications.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

N/A

Meta-Review

Meta-review not available, early accepted paper.

back to top

Enhancing Label-efficient Medical Image Segmentation with Text-guided Diffusion Models

Author(s):