Abstract

Early diagnosis of rectal cancer is essential to improve patient survival. Existing diagnostic methods mainly rely on complex MRI as well as pathology-level co-diagnosis. In contrast, in this paper, we collect and annotate for the first time a rectal cancer ultrasound en- doscopy video dataset containing 207 patients for rectal cancer video risk assessment. Additionally, we introduce the Rectal Cancer Video Risk Assessment Network (RCVA-Net), a temporal logic-based framework designed to tackle the classification of rectal cancer ultrasound endoscopy videos. In RCVA-Net, we propose a novel adjacent frames fusion module that effectively integrates the temporal local features from the original video with the global features of the sampled video frames. The intra-video fusion module is employed to capture and learn the temporal dynamics between neighbouring video frames, enhancing the network’s ability to discern subtle nuances in video sequences. Furthermore, we enhance the classification of rectal cancer by randomly incorporating video-level features extracted from the original videos, thereby significantly boosting the performance of rectal cancer classification using ultrasound endoscopic videos. Experimental results on our labelled dataset show that our RCVA-Net can serve as a scalable baseline model with leading performance. The code of this paper can be accessed at https://github.com/JsongZhang/RCVA-Net.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/4098_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_ANew_MICCAI2024,
        author = { Zhang, Jiansong and Wu, Shengnan and Liu, Peizhong and Shen, Linlin},
        title = { { A New Dataset and Baseline Model for Rectal Cancer Risk Assessment in Endoscopic Ultrasound Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper offers a new dataset for US to detect rectal cancer, inventories results of some standard models on this dataset, and proposes a new model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces (sort of - see below) an important new medical imaging use case, along with a dataset to enable work on the use case.

    The inventory of results of various models is a valuable starting point for ML work in this space (though to properly interpret the results more detail is needed).

    This could potentially be an excellent paper in the “open datasets” workshop at MICCAI 2024 (https://conferences.miccai.org/2024/en/OPEN-DATA.html). It would need:

    1. More space and citations to describe the (non-ML) medical use case and the characteristics of the images.
    2. Less space devoted to model description.
    3. Table 1 is a good baseline summary, if the train-test structure is well-described and is patient-level (not image-level).
    4. The overall goal would be: introduce the medical use case, the ML relevance to the use case, the dataset, and the baseline inventory of ML approaches. Then a researcher can use the paper and dataset as a solid basis for new work in this area.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a large amount of low-information fill in the text, while conversely some crucial details are absent (eg train-test data structure for image-based vs video-based models).

    The paper provides only minimal detail as to what this use case consists of. Considerable non-ML literature exists that describes the use case in detail. This context is vital for any meaningful ML work. This literature should be included, along with an in-text summary of the ML-relevant aspects of the use case. Examples: Siddiqui “The role of endoscopic ultrasound in the evaluation of rectal cancer” 2006; Uberoi “Has the role of EUS in rectal cancer staging changed in the last decade” 2018.

    Table 1 appears to show that video-based models are inferior. While there is vague language implying that the nuances and complexities of videos is useful, this is not supported by evidence. If the literature or this research has found diagnostic value in videos vs still images, please present this specific detail.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Details of train/val splits (eg image level, patient-level?) is incomplete. Code for model will be available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. There seems to be a lot of low-information and repetitive text which would be best removed to make room for high-information content, eg concrete details about the subject matter and specific evidence for the claims. Examples:

    Abstract: “seamlessly integrates … subtle nuances”: this is not supported concretely in the text, and reads like advertising copy.

    Image quality and interpretability: much of this section consists of generalities and tautologies (eg “video data introduces … technically simpler.”

    Section 3.1: The multiple mentions of N frames. Perhaps it could be said, once, “A video contains N frames, each 224 x 224 pixels. N differs by video, averaging —.” Also describe sampling rate, pixel scale, what is shown, depth of penetration, variations in all these seen in different device outputs, etc.

    3.2: “Specifically, the random …the time dimension.” Repetition

    4.1: “this is because static …may lead to lower accuracy”.

    4.2: “RSEM complements … in the temporal dimension.” This is general but not supported by evidence

    1. Miscellaneous comments:

    “Whereas, for variable … also be more relevant.” Why is the temporal dimension relevant to the medical diagnostic use case? Please provide specific detail as to the added value.

    Section 2 last sentence: Is this saying that pathology provides a trustworthy ground truthfor this dataset?

    Section 2.1: what is the import of the various levels T0 - T4? what kind of misclassifications are less or more costly? A section on the medical use case (independent of ML) enables the reader to understand the essential task.

    “based on this be able”: typo

    “3x224x224 to 196x768”: unclear.

    “no playback sampling”: unclear

    “which is more relevant to real-world scenarios”: This is crucial information. Please add concrete evidence as to exactly why (in lit review or this research) videos add diagnostic value.

    “while accurately capturing key features”: what are these features? Please add concrete evidence.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper raises but does not properly introduce an important new medical imaging use case.

    The paper needs substantial editing to increase information density.

    It could be an excellent paper for the open data workshop.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Reject — must be rejected due to major flaws (1)

  • [Post rebuttal] Please justify your decision

    The authors’ responses focus mostly on model details, and do not address the issues I raised. So I retain my prior view that this could be a fine paper in the data workshop, if the authors put more detail into describing the clinical use case and ML’s role therein. Certainly the use case is novel and valuable to share. I’ll also note that Reviewer # 3 gave insightful and valuable comments. I hope the authors respond carefully to those comments.



Review #2

  • Please describe the contribution of the paper

    They introduced a rectal cancer endoscopic ultrasound video dataset, RCEUV-207, for computer-aided diagnosis.

    Building on this benchmarking, they propose RCVA-Net, a novel risk assessment network for analyzing rectal cancer ultrasound videos.

    Experimental results demonstrate the research potential of RCEUV-207, with RCVA-Net achieving leading performance in this area.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Authors curated a dataset of rectal cancer videos comprising five stages of the disease.

    Employed a transformer model and achieved high precision in classifying rectal cancer.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors have solely relied on the accuracy metric for comparing their study and showcasing improvements, which may not be sufficient. It is recommended that they include other evaluation metrics such as precision, recall, sensitivity, etc., as mentioned in section 4 of the paper, to provide a more comprehensive assessment of their proposed method.

    The results of the proposed method show a slight decrease compared to the studies listed in Table 1. It is important for the authors to provide a justification for this discrepancy, which could involve discussing potential limitations of their approach, differences in experimental setup, or challenges encountered during implementation.

    Since the authors have provided a new dataset, it is imperative for them to include a comparison with existing datasets in the field. This comparison would help establish the characteristics and quality of the newly introduced dataset relative to others, providing valuable insights for researchers and practitioners in the field.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors have solely relied on the accuracy metric for comparing their study and showcasing improvements, which may not be sufficient. It is recommended that they include other evaluation metrics such as precision, recall, sensitivity, etc., as mentioned in section 4 of the paper, to provide a more comprehensive assessment of their proposed method.

    The results of the proposed method show a slight decrease compared to the studies listed in Table 1. It is important for the authors to provide a justification for this discrepancy, which could involve discussing potential limitations of their approach, differences in experimental setup, or challenges encountered during implementation.

    Since the authors have provided a new dataset, it is imperative for them to include a comparison with existing datasets in the field. This comparison would help establish the characteristics and quality of the newly introduced dataset relative to others, providing valuable insights for researchers and practitioners in the field.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Originality and Contribution: The paper’s originality and contribution to the field were evaluated based on the novelty of the proposed method, the significance of the research problem addressed, and the extent to which the paper advances the current state of knowledge in the field. Methodological Rigor: The methodological rigor of the study was carefully evaluated, including the experimental design, data collection and preprocessing, choice of evaluation metrics, and statistical analysis. Clarity and Coherence: The clarity and coherence of the paper were assessed based on the organization of the content, clarity of presentation, and coherence of arguments and explanations. Technical Soundness: The technical soundness of the proposed method, including the validity of assumptions, correctness of algorithms, and appropriateness of methodologies, was thoroughly examined. Experimental Results and Analysis: The quality and significance of the experimental results, including the effectivenes

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Authors responded to my comments efficiently. However, paper needs more justification before acceptance as mentioned by R5.



Review #3

  • Please describe the contribution of the paper

    This paper proposed the first dataset for endoscopic ultrasound analysis of rectal cancer, which covers 207 cases and 5 different categories. On the dataset, they summarize three main challenges that will be faced by the task of grading and classifying endoscopic ultrasound images of rectal cancer based on deep learning. And, this paper proposed a baseline network Rectal Cancer Video Risk Assessment Network (RCVA-Net), to solve the classification problem of rectal cancer endoscopic ultrasound videos. Experimental results demonstrate its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) proposed the first dataset for endoscopic ultrasound analysis of rectal cancer (2) Neighbor Sequence Encoding Module and Random Sequence Encoding Module were designed to encode ultrasound video frames

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    dataset establishment lacks necessary clinical application support and more analysis

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No, based on providing complete code and dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) The title of the paper is risk assessment, the abstract background mentions early diagnosis, and the data set is the classification of T0-T4 stages. These concepts are related but have some essential differences. It is recommended that the author clarify what clinical problem the author is studying and how it relates to the classification of the dataset. (2) The dataset established in the paper is divided into T0-T4, five categories corresponding to the five stages of rectal cancer malignancy pathology. Is there a corresponding medical grade for rectal cancer? If not, why are the dataset divided in this way? It is recommended to analyze it in a clinical sense. The construction of the data set should not only explain how many images each type contains. It is recommended to further analyze the dataset and its characters. Generally speaking, papers about new dataset require nearly 3/4 of the content to be used for the analysis of the dataset itself, while the baseline is relatively minor. (3) A common ultrasound examination generally involves the radiologist moving the ultrasound probe in multiple directions to find the most representative ultrasound image(s). Due to the characteristics of ultrasound acquisition (different from CT, MRI, etc.), it is rare to use video recording together with for subsequent diagnosis. The data set provided by the author specifically emphasizes that it is a video. What is the reason for this design? What special medical and clinical significance does it have? At the same time, in the experimental part, the comparison of image classification methods also failed to show that the classification results based on ultrasound videos are better, “In terms of accuracy, image-based classification methods are higher than video classification models, ….”, Video introduces more unclear interference information. Is this the reason for the performance of image and video classification? Further, is this necessary? Or, “more relevant to real-world scenarios”, can frame-by-frame analysis be used to guide doctors to obtain more effective ultrasound tomographic images? It is recommended that the authors discuss it in depth to reflect the value of their work in actual clinical practice. (4) The core NSEM and RSEM modules in the RCVA-Net designed in the paper and the three challenges of analysis have not been analyzed more clearly. In addition, detailed method descriptions of “data units” and “data unit integration methods” in 3.1 and 3.2 should be given to further enhance the reproducibility of the method. (5) There are a few errors in the writing of the paper. For example, references should start from [1] and be quoted in full; abbreviations should be used in a standardized way, such as “Neighbor Sequence Encoding Module”, “NSEM” should be given at the same time, and should not be directly followed. Use NSEM; the third point in the challenge analysis is missing “:” after “Neighbor Sequence Encoding Module”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Propose new dataset is meaningful, but lacks necessary analysis.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The author explained my concerns about the article in detail and indicated that changes would be made in the article, and I agreed to the publication of the article.




Author Feedback

The RCEUV-207 was imaged radially at 10-20 Mhz by a Hitachi ultrasound imaging system using pathology biopsy T-grade results as video labels. 1.Differences Between Frame-by-Frame Analysis and Video Analysis: The main distinction lies in the consideration of temporal context information. Andrej et al. [1] discussed the differences between frame-by-frame and holistic video analysis, noting that while frame-by-frame analysis is simpler, it lacks inter-frame contextual information, which could reduce the robustness of model deployment. Similarly, Geert et al. [2] noted that conducting video analysis in ultrasound is closer to clinical reality. Recent trends in ultrasound analysis for breast cancer benign-malignant classification have moved from static 2D images to video analysis [3], suggesting that dynamic ultrasound analysis using video formats is more aligned with real-world scenarios and has greater research potential. 2.Method and Experimental Detail Additions: (1) In RCVA-Net, we introduced two modules (NSEM and RSEM) to link temporal context information and enhance the global recognition capabilities of video-based models. In NSEM, we used the ViT patch method for tensor dimension transformation, which can currently be implemented using the Hugging Face transformers method with the following code: “inputs = ViTFeatureExtractor.from_pretrained(‘google/vit-base-patch16-224-in21k’)(images=dummy_image.permute(0, 2, 3, 1).numpy(), return_tensors=”pt”).” In the RSEM, the tensor size transformation is consistent, but the selected video frame and its adjacent frames (M-1, M+1) are sampled without replacement along the time axis. This means that the selected frame M and the frames immediately before and after it are extracted only once and not reloaded for training within the same video feature extraction process. We will add these details to Sections 3.1 and 3.2 to clarify. (2) Addressing the question, “Why do image-based classifications seem to perform better than video classifications?” Table 1 shows the baseline capabilities of different data loading methods and model approaches on RCEUV-207. Apart from the ease of feature extraction and pattern recognition in image-based analysis, where training and testing data are usually divided proportionally by category, video analysis divides each case as either training or testing. We believe this is a key reason why image-based classification methods can achieve high-performance levels. Video-based analysis methods not only need to consider information from the temporal dimension but also require the model to have robustness against unknown cases to perform well in tests. Our paper’s proposed RCVA-Net achieves leading baseline performance in this regard. We will add additional metrics such as precision, recall, and specificity to Table 1 and include the aforementioned reasons in Section 4.1 to provide a comprehensive understanding of the experiments. 3.Clarifications on Other Comments: Some comments have pointed out the need for a comparison with datasets related to this paper. To our best knowledge, this is the first video analysis dataset proposed in the field of colorectal cancer endoscopic ultrasound. Although there are a few statistical reports, none have disclosed their datasets or corresponding machine learning methodologies. We are willing and hope to add citations to address any shortcomings in clinical descriptions objectively. [1] Litjens, Geert, et al. “A survey on deep learning in medical image analysis.” Medical image analysis 42 (2017): 60-88. [2] Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014. [3] Lin, Zhi, et al. “A new dataset and a baseline model for breast lesion detection in ultrasound videos.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While there are some weaknesses in presentation highlighted by R5, the contributions overshadows them.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While there are some weaknesses in presentation highlighted by R5, the contributions overshadows them.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I recommend accept provided that the authors will make the revisions as promised.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I recommend accept provided that the authors will make the revisions as promised.



back to top