Abstract

Endorectal ultrasound (ERUS) is an important imaging modality that provides high reliability for diagnosing the depth and boundary of invasion in colorectal cancer. However, the lack of a large-scale ERUS dataset with high-quality annotations hinders the development of automatic ultrasound diagnostics. In this paper, we collected and annotated the first benchmark dataset that covers diverse ERUS scenarios, \textit{i.e.} colorectal cancer segmentation, detection, and infiltration depth staging. Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames. Based on this dataset, we further introduce a benchmark model for colorectal cancer segmentation, named the \textbf{A}daptive \textbf{S}parse-context \textbf{TR}ansformer (\textbf{ASTR}). ASTR is designed based on three considerations: scanning mode discrepancy, temporal information, and low computational complexity. For generalizing to different scanning modes, the adaptive scanning-mode augmentation is proposed to convert between raw sector images and linear scan ones. For mining temporal information, the sparse-context transformer is incorporated to integrate inter-frame local and global features. For reducing computational complexity, the sparse-context block is introduced to extract contextual features from auxiliary frames. Finally, on the benchmark dataset, the proposed ASTR model achieves a $77.6\%$ Dice score in rectal cancer segmentation, largely outperforming previous state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0073_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0073_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Jia_Towards_MICCAI2024,
        author = { Jiang, Yuncheng and Hu, Yiwen and Zhang, Zixun and Wei, Jun and Feng, Chun-Mei and Tang, Xuemei and Wan, Xiang and Liu, Yong and Cui, Shuguang and Li, Zhen},
        title = { { Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper provides a labeled dataset of endorectal ultrasound videos. The dataset has 77 videos of various patients, 19 performed with a linear-array rectal cavity probe and the rest with a convex vaginal one. 57 of the videos have labels as to the cancer stage.

    Additionally, a segmentation model is proposed and an augmentation method. The augmentation method transforms images acquired with linear probes to look line convex ones and vice-versa. The transformation is done though a Polar-Cartesian coordinates transform.

    The proposed model incorporates temporal information by creating multi-frame contexts, in addition to per-frame context. The model utilizes attention-based learning and the authors reduce the computational complexity by creating sparse contexts.

    They have thorough evaluations comparing against 10 other methods for segmentation and also show ablation studies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a new dataset of endorectal ultrasound videos taken on patients in clinical setting with two different probes. The dataset is further annotated by experts and there is additional information about the staging of the cancer on a subset of it.

    The proposed benchmark model utilizes spatiotemporal information and state-of-the-art attention-based learning. Furthermore, computational complexity is addressed and minimized.

    The evaluations are thorough with a comparison to 10 other methods on 6 metrics. Additionally, there are ablation studies on the different component.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The data augmentation produces images that are not physically accurate. The authors take a convex ultrasound image and geometrically transform it to a linear one and vice-versa. This deforms the anatomy to something unrealistic. The transformation is even called anatomic-aware which it is not. This needs to be addressed. Not every augmentation needs to lead to realistic image, however, it should be mentioned. From my interpretation of the text, there is an implication, that this produces realistic images. It needs to be explicitly said that the images are not realistic and the name of the augmentation must be changed.

    In addition to the above comment, the authors apply a flip on the image for augmentation. Flipping left-right makes sense, however, flipping up and down does not - this is once again an image that cannot occur in reality, and thus the benefit of adding such data to the training pipeline remains unclear.

    Another point is limited information on the dataset especially in terms of demographic. This should be included upon release.

    A minor weakness is that the auxiliary loss is not described.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The data augmentation with the AAT module does not produce realistic images. Converting a convex image to a linear one geometrically put real world point further apart at the top of the image and closer together at the bottom, for example. I would advice to change the name of the transformation and also explicitly state it.

    Further expanding on the geometric transformation, there are a variety of geometries that an image can have. For example, a convex probe can have a a different opening angle. Have you considered varying the geometry parameters in the transformation?

    Was the data augmentation performed also for the training of the models used for comparison? This would be fair comparison between the architectures because the augmentation is independent and can be applied in any training pipeline.

    Statistical analysis of the evaluations would be appreciated.

    It would be good to add more information about the dataset upon release. Both a detailed description of the demography with information like age, sex, gender, ethnicity etc and some more details about the imaging parameters if possible. Furthermore, a few words on the limitations of the dataset is well-advised. If the industry develops a product based on the data, it would be important to have clearly stated constrains.

    Please, describe the computation of the auxiliary loss.

    Minor comments: I believe there is a mistake in having the FPS of the method in bold. The first three models have better FPS.

    There are minor spelling mistakes like having a capitalization of a word mid-sentence.

    Describing the loss function, I think you mean to say that the lambda_aux is the hyper-parameter weight instead of the alpha.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a strong contribution by providing a clinical labeled dataset and, additionally, proposing a model that takes into account spatiotemporal information and is computationally efficient.

    However, some questions need to addressed:

    • The wording of the augmentation is misleading and I would highly advice to change it.
    • Is the augmentation used during the training of the models used for comparison?
    • Can we have a detailed description of the dataset’s demographic?
    • The auxiliary loss needs to be described.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents an endorectal ultrasound dataset with 77 videos and 10,000 annotated frames. Further, a model is proposed for colorectal cancer segmentation using the same dataset. The model comprises modules focused on anatomic-aware transformation and contextual feature extraction from auxiliary frames. Comparison is performed with recent techniques. The results show that the proposed approach outperforms other methods while maintaining a decent frame per second computation rate.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Dataset Development: An endorectal ultrasound dataset has been developed that comprises 77 videos with 10,000 frames covering annotations for colorectal cancer segmentaion, detection and infiltration depth staging.
    • Generalizability across different scanning modes: The method introduces anatomic-aware transformation as a data augmentation technique. This transformation helps to generalize across two different scanning modes and is also used to avoid data imbalance issue.
    • Considered computational cost: The approach uses sparse context block to extract contextual features from auxiliary frames targeted to reduce computational complexity.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is mentioned that there are 77 videos with 19 videos recorded using the linear-array scanning mode and 57 videos underwent pathological examination. However, no class-wise distribution is provided. -The details of ethical approval and resolution of images are missing.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    To facilitate easier utilization by the research community, it is essential to provide detailed information regarding the sample count for each category within the dataset. Additionally, to ensure reproducibility and transparency, a more elaborative dataset split description should be provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is well-written and structured. However, there are some concerns that need to be addressed:

    • More details about the dataset should be mentioned. For instance, how many frames contain colorectal lesions? What sample count is under each stage of infiltration depth (T1 to T4)? Also, details about the ethical approval and image resolution should be added.
    • How many frames contain anomalies in the training, validation and test split? More details on the dataset split is required.
    • What is the final sample count after augmentation? Do the methods provided in Table 1 for comparison also use the same data augmentation techniques as the proposed method? There could be some possibility that the enhanced results are mainly due to the augmentation approach. This should be clarified.
    • Minor: (a) The paper initially mentions an acronym for colorectal cancer. However, it is not used in the rest of the sections. (b) ‘c’ is used in the cartesian transformation and to denote the channel. It is suggested that different notations be used to reduce ambiguity. (c) Why is only one reduction value green in color in Table 3? Does red and green color denote something? If yes, it is better to mention it. (d) ‘W’ in we should be small in two places: (i) page 2. ..benchmark dataset, We further propose… (ii) figure 3 caption. Furthermore, We devise…
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall paper is well-structured and easy to follow. The paper also presents a dataset which the authors claim to release publicly. It would be beneficial for the research community. However, the paper does not mention a detailed description about the dataset which must be mentioned to facilitate easier utilization.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    There are two main contributions in this paper: 1) collect and annotate 10,000 ultrasound image for colorectal lesion from 77 videos, where the dataset is claimed to be the first endorectal ultrasound dataset; 2) adopt Res2Net with additional mechanisms (AAT, augmenting dataset; SCB, pooling based on distance in time and coarse lesion mask).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper claims to collect the first endorectal ultrasound dataset and will release it upon acceptance. The authors in this paper collect 77 ultrasound videos and annotate 10,000 images for lesion region, bounding box. And 57 of 77 videos offer tumor infiltration stages. The collected dataset is potentially useful for evaluating segmentation, detection, and classification tasks for colon lesion.

    2. Regarding the characteristics of ultrasound images collected via different probe (linear and convex), this paper utilizes polar transformation and inverse polar transformation to augment dataset.

    3. This paper proposes an adaptive pooling strategy in representation selection. The author gives a larger pooling kernel to the image far from current image in time axis to reduce representation number and uses an additional light decoder to generate a coarse tumor mask to further reduce representation outside the tumor region. As a result, the computational cost can be reduced.

    4. Fair evaluation and ablation study (Table 2, Table 3).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Simply transforming ultrasound image collected via linear probe to polar coordinate (or vice versa) does not solve the gap between linear and convex modes since the frequency and details might be still different.

    2. Fig. 2 and Sec. 2.1 describes coordinate transformation. (r_c, \theta_c) is set to the center of the top but exact location of (x_c, y_c) is not mentioned (seems to be center of the top as well from Fig. 2).

    3. If possible, please also mention the number of cases within 77 videos (whether a patient take multiple videos?).

    4. Have no definition for representation dimension d in equation 3 and 4.

    5. Do the author evaluate the time cost for a coarse map decoder to generate a mask?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    If possible, please provide the split of the dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minors:

    If possible, please also mention the number of cases within 77 videos (whether a patient take multiple videos?).

    Better to define the representation dimension d in equation 3 and 4.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The dataset can help advance the development of segmentation, detection, and classification method for colon tumor.

    2. Combining coarse mask and time distance to select representation seems interesting.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We really appreciate your positive evaluations of our paper.

Common Question:

To Reviewer 1 and Reviewer 4: Inaccurate description of AAT: We apologize for the confusion caused by our description. The original purpose of our AAT was to address the issues of data insufficiency and scanning mode discrepancy. The transformation simulates the appearance of lesions and surrounding tissues in both scanning modes, but it does not generate real images. We will consider renaming the method in the future.

To Reviewer 1, Reviewer 3, Reviewer 4:

  1. Fair comparison In the method comparison, AAT is considered part of our overall ASTR framework. For a fair comparison, other methods are trained using the optimal strategies proposed in their original papers. However, in Table 3, we demonstrate that our proposed AAT has plug-and-play capabilities, effectively enhancing the performance of the segmentation model. Importantly, our method still outperforms the second-best model even when AAT is added to it (SLT-Net w/AAT). Additionally, Table 2 shows that each of our proposed strategies brings a positive improvement to performance, with the magnitude of improvement being similar.

  2. Dataset details During an endorectal ultrasound examination, the sonographer typically moves the probe around the lesion after it is detected to observe the lesion’s shape and depth of infiltration. Thus, in our dataset, each frame contains rectal lesions. We are collecting new normal cases and considering extending the dataset in the future. We will consider including other detailed descriptions and demographics in the final version.

Individual Question:

To Reviewer 1: Auxiliary loss The auxiliary loss uses the ground truth annotations of the reference frame to supervise SCB in providing a more accurate coarse mask during training. The loss function used is the sum of cross-entropy, dice, and mean absolute error.

To Reviewer 3: Minors: Thank you for your suggestion. We will carefully revise the sentences and correct any spelling errors in the final submission.

To Reviewer 4:

  1. Coordinate center Apologies for the misunderstanding. The center points for both polar coordinates and Cartesian coordinates are at the top center of the image.

  2. Definition of d Thank you for the reminder. The dimension d equals H times W. We will include this definition in the final submission.




Meta-Review

Meta-review not available, early accepted paper.



back to top