Abstract

Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps’ high camouflage and redundant temporal cues. In this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at https://github.com/lydia-yllu/Diff-VPS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1334_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1334_supp.pdf

Link to the Code Repository

https://github.com/lydia-yllu/Diff-VPS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lu_DiffVPS_MICCAI2024,
        author = { Lu, Yingling and Yang, Yijun and Xing, Zhaohu and Wang, Qiong and Zhu, Lei},
        title = { { Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The author proposed a diffusion-based network for video polyg segmentation task to incorporate detection, classification and segementation tasks into diffusion models to promote diffusion models on pixel-by-pixel segmentation. The author also evaluate it on the medical video dataset and use ablation study to demonstrate the effectiveness of proposed modules. The author mentioned that they will release the code once acception.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The novel way to apply diffusion model on medical video data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Can author describe how multi-scale temporal features and spatial features are intergrated?

    • In the dataset part, I am a little confused about defination about “seen” and “unseen”. Can author elaborate it more about it?
    • Does author have a validation dataset to avoid overfitting?
    • Currently, the input for the model are 5 frames. Is that possible to extend it to whole video frames?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to box 6

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method, experiment and results

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work proposes a method for colorectal polyp segmentation in videos using the diffusion model, claiming to be the first to apply diffusion in such task. Specifically, this approach utilizes the diffusion model to capture image information and constructs a multitask learning framework, leveraging classification and detection tasks to improve the model’s effectiveness in polyp segmentation. Additionally, a temporal reasoning module is introduced to enhance the model’s utilization of information from preceding and subsequent frames in videos. The performance of the proposed method is validated on public datasets and compared with various popular image and video-level object/polyp segmentation methods. The results demonstrate the superiority of this approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel formulation- This work is the first to consider using the diffusion model to improve the segmentation of colorectal polyps in videos. It leverages the diffusion model to improve the extraction of lesion features, thus enhancing the model’s ability to segment lesions. Additionally, the temporal reasoning module facilitates better utilization of information from preceding frames in videos, leading to improved segmentation performance. A particularly strong evaluation- The model was tested on two sub-datasets and divided into seen and unseen categories based on data distribution. This setup effectively verifies the model’s real-world performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Complexity- The addition of diffusion, multitasking, and a generative self-supervised auxiliary task on the segmentation model has led to a highly complex training process for the entire model. The experiments were conducted with random grouping of training and testing, but cross-validation are preferable to better avoid randomness.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    This work is rather complex, involving diffusion models, GANs, as well as classification, segmentation, and object detection. All of these aspects contribute to the complexity of model training, making it potentially challenging when the methods are reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Section 3.1 Implementation details part:“We trained our model for 15 epochs with a batch size of 16. A video clip of 5 frames with a patch size of 224×224 was fed into the network.”Can you provide some additional details about the relationship between batch and patch in this section? Additionally, it seems that the formula L_TRM only appears in equation (6). It would be helpful to provide a complete description or algorithmic description to better assist readers in understanding.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is overall innovative, as it utilizes the diffusion model to assist in lesion segmentation. It represents an exploration of the application of diffusion in lesion segmentation of video within the medical imaging domain. Additionally, there is extensive validation of model performance through comparative experiments, and the model diagrams is clear.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents and evaluates a new multi-task deep learning-based network for video polyp segmentation (VPS) in colonoscopy videos using temporal information. In general, the authors integrate a Multi-task Diffusion Model (MDM) that mainly predicts the label of the lesion with a Temporal Reasoning Module (TRM) that feeds the MDM with temporal information from the previous frames of the video, also applying an adversarial self-supervised strategy to reconstruct the actual frame using the previous frames. The results demonstrate that the presented method achieved the state-of-the-art results on the publicly available SUN-SEG dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strengths of the paper are:

    • The development of a new multitask deep-learning based network for video polyp segmentation, in specific the first application of diffusion-based model on medical video lesion, proving the effectiveness of the method for the task.
    • The comparison with different strategies proving the added value of use each of the modules (MDM, TRM and ASS).
    • The use of a public available dataset that permits the reproducibility of the method and comparison with other state-of-the-Art methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness is the motivation of the paper. Despite the prevalence of colorectal cancer, it remains unclear how the proposed method can be effectively utilized. While it is mentioned that there is still a high missing rate in the diagnosis and treatment process, it is not explicitly stated how the method can mitigate this issue. It is essential to understand whether the method can assist doctors during colonoscopy procedures in making accurate diagnoses, or if it can be utilized post-diagnosis or in real-time. Furthermore, regarding the treatment process, it is unclear if the method offers any advantages or improvements. The paper lacks a clear explanation of how the proposed method can enhance diagnosis, treatment, or patient outcomes.

    Additionally, the explanation of the dataset provided in the paper lacks sufficient information. This omission necessitates the reader searching for the dataset’s origin to comprehend the data division and its relevance to the study.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    First of all, I would like to congratulate the authors for their work and significant contribution to the field of polyp segmentation in colonoscopy videos. The work presents an innovative and promising approach with the potential to positively impact the diagnosis and treatment of colorectal cancer. However, I would like to provide some suggestions to further enhance the clarity, clinical relevance, and impact of the work.

    The introduction of the paper lacks clarity regarding the primary objective of developing this method. It’s crucial to clearly state the research objectives, such as whether the goal is improving diagnosis, treatment, or patient outcomes. This clarification will help readers understand the significance of the research and its potential impact on healthcare.

    The paper would benefit from reorganizing the references to ensure they are in chronological order and properly cited in the text, instead of alphabetical order. Additionally, some references may not be relevant or appropriately placed within the context of the paper. Maybe some problem with the reference management software is responsible. Follow some examples:

    • The references do not seem to be well-placed, since statistical data are presented, it is necessary to present this source;
    • Where you have “…to Transformer [20,17,8,10,19,20] It would be better if you could achieve a review article or change to [n-N].
    • Where you read “Motivated by the nature of diffusion models, in this paper,” it remains to be said what the paper is.
    • Where you have “the size and morphological features [4,18]” the references are inappropriate.

    The experimental section should include a detailed description of the dataset used and why the methods selected for comparison were selected. Follow some suggestions: -In figure 1 it is not perceptible why the mask image has two dimensions, furthermore, the abbreviations would be in the figure legend.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents innovative contributions in the field of video polyp segmentation, however there is room for improvement in clarifying the research objectives, providing detailed explanations of the methodology, organizing references appropriately, and discussing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Reviewer#1 A1: The MDM generates multi-scale spatial features S_j. Meanwhile, the TRM module extracts the multi-scale features frame-by-frame. They are averaged along the temporal axis and aggregated into multi-scale temporal features R_j, the same dimension as S_j. Then S_j and R_j are integrated by pointwise addition. A2: The testing dataset is divided into seen and unseen categories based on data distribution. ‘Seen’ represents the visible case and ‘Unseen’ represents the invisible case. There is no intersection of ‘Seen’ and ‘Unseen’. ‘Seen’ is divided into two parts, one part is used as the training set S_tr, and the other part is used as the test set S_te, i.e. case7_2 in the training set and case7_1 in the testing set. A3: We used some of the data from the easy_seen dataset to prevent overfitting during training. A4: It’s impractical to extend the input to whole video frames due to insufficient memory. Our experiment uses two NVIDIA 3090 GPUs, each containing 24 GB of memory. It’s just the right amount of training. Reviewer#2 A1: The ‘batch size of 16’ means that each iteration of training will input 16 samples. The ‘patch size of 224×224’ means that the original image, for example, 1024×1024, is divided into small areas, which is the size of the local area used for feature extraction or analysis in image processing. A2: L_TRM is the same as L_G. The purpose of adversarial learning is to induce the Generator, which is the TRM module in this paper, to learn the image features of a clip of frames during the reconstruction process. Reviewer#3 A1: Since there is still a high missing rate in the diagnosis process, this paper aims to mitigate this issue by improving the accuracy of detection. However, it is undeniable that improvements need to be further explored in the clinical setting. A2: Thanks to the reviewer, we will improve the references in the final version. A3: For reasons of space, we point to the reference from which the dataset originated and do not further characterize the dataset. The comparison methods chosen are classic for this task. The dimension of the mask image is a small typo and has been amended to 1×H×W.




Meta-Review

Meta-review not available, early accepted paper.



back to top