Abstract

Accurate dense depth prediction of monocular endoscopic images is essential in expanding the surgical field and augmenting the perception of depth for surgeons. However, it remains challenging since endoscopic videos generally suffer from limited field of view, illumination variations, and weak texture. This work proposes LGIN, a new architecture with unsupervised learning for accurate dense depth recovery of monocular endoscopic images. Specifically, LGIN creates a hybrid encoder using dense convolution and pyramid vision transformer to extract local textural features and global spatial-temporal features in parallel, while building a decoder to effectively integrate the local and global features and use two-heads to estimate dense depth and odometry simultaneously, respectively. Additionally, we extract structure-valid regions to assist odometry prediction and unsupervised training to improve the accuracy of depth prediction. We evaluated our model on both clinical and synthetic unannotated colonoscopic video images, with the experimental results demonstrating that our model can achieve more accurate depth distribution and more sufficient textures. Both the qualitative and quantitative assessment results of our method are better than current monocular dense depth estimation models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2617_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Fan_Simultaneous_MICCAI2024,
        author = { Fan, Wenkang and Jiang, Wenjing and Fang, Hao and Shi, Hong and Chen, Jianhua and Luo, Xiongbiao},
        title = { { Simultaneous Monocular Endoscopic Dense Depth and Odometry Estimation Using Local-Global Integration Networks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a new architecture named LGIN, which leverages unsupervised learning to estimate dense depth and odometry from monocular endoscopic images. LGIN combines a hybrid encoder, which integrates dense convolution and pyramid vision transformers to simultaneously extract local textural and global spatial-temporal features. A decoder is used to merge these features and utilizes dual heads for simultaneous depth and odometry estimation. Additionally, the model employs structure-valid regions to enhance odometry prediction and improve depth estimation accuracy through unsupervised training methods. The proposed model outperforms existing monocular dense depth estimation models in both qualitative and quantitative assessments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main contributions of the paper are as follows:

    1. Hybrid Encoder Architecture: The paper presents a new hybrid encoder that integrates dense convolution and pyramid vision transformers (PVT) to capture both local textural features (via DenseNet) and global spatial-temporal features (via PVT) in parallel, enhancing the depth and odometry estimation accuracy.

    2. Evaluation: The paper demonstrates a good evaluation protocol by testing the model on both clinical and synthetic colonoscopic video images.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper presents several strengths, some potential weaknesses or areas for improvement include:

    1. The major contribution comes from the encoder to integrate local and global features for depth estimation and camera pose estimation. However, the motivation of introducing the global features and local features and combining them are not clear, especially from the perspective of the properties of surgical images and the tasks themselves.
    2. Since the proposed method is performing the depth estimation using monocular image, the method cannot ensures the scale of the depth is physically correct for clinical images.
    3. How would the authors comment on using the large model (such as depth anything, Surgical-DINO) for depth estimation in this scenario? Will they perform better than the proposed methods?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Add detailed motivation and analysis on why bringing the local and global features are important for depth estimation and pose estimation.
    2. Discuss the capability of the large foundation model in this setting compared with the proposed methods.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The relatively good experimental results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel self-supervised architecture for endoscopic depth and odometry estimation. Specifically, this paper designs a decoder to effectively integrate features from CNNs and ViTs and use dual heads to simultaneously output depth and odometry. It also introduces a masking strategy to improve prediction accuracy. Experiments show that the proposed solution achieves superior performance over baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength is the technical novelty. This paper designs a novel architecture to integrate local textural features (from CNN) and global spatial-temporal features (from ViT) hieratically. It simultaneously estimates depth and poses in a feature-shared fashion, while previous work tends to estimate them with different networks. The proposed method achieves better performance than other CNN and ViT solutions. Besides, this paper is well organized with clear visualization and good writing skills.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Clarification of Self-Supervised and Unsupervised Learning: Self-supervised learning is commonly regarded as a specialized form of unsupervised learning, where the input data itself acts as supervision without relying on external labels. Consequently, it’s not practical to strictly categorize related work into self-supervised and unsupervised, as they are fundamentally self-supervised. A revision of the related work in the introduction is expected.
    2. Warping Process: While this paper extensively covers feature extraction and integration, the warping process (depicted by the sky blue box in Fig. 1) remains unaddressed. Despite its potential simplicity, a brief discussion would aid readers in comprehending the complete pipeline.
    3. Loss Function Clarification: Some symbols in Equation 4, such as Vij, Vji, and the summation, lack explanation. Enhancing clarity on these symbols would improve readers’ understanding of the loss function.
    4. Fig. 4 Title and Contents Misalignment: A discrepancy exists between the title and the contents of Fig. 4. In the title “Rows 1~4… and Rows 5~8…” but only six rows are shown in the picture.
    5. The overall design appears cumbersome due to the substantial presence of CNNs and VITs compared to other methods. Including evaluation metrics like running time and memory consumption in the experiments would offer insights into the efficiency of different approaches.
    6. Since some metrics are higher-better and others are lower-better, it is advisable to incorporate up and down arrows adjacent to the metric titles.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    For reproducibility purposes, it’s advisable to include details about the hardware setup, such as the GPU type, the number of GPUs utilized, and the training time, within Section 3.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper proposes a novel architecture for endoscopic depth and odometry estimation. The feature integration and dual-head estimation modules are innovative and inspiring. Experiments show the effectiveness of the proposed design. This paper is well-written with good writing skills and drawings. However, some parts are not clear and worth rephrasing, especially the related work and the loss function. Besides, the design is considered cumbersome. Discussion on efficiency is expected. Please refer to the weakness part for details.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Same as comments

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a ML approach that integrates CNN and pyramid vision transformer to extract local and global features and generates depth maps and camera pose. The method is tested on synthetic and clinical dataset and provides a better accuracy compared to baseline approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors claimed that the novelty of the paper is the combination of CNNs and transformer for local and global feature extraction in parallel and the simultaneous output of depth maps and odometry with the assistance of structure-valid regions. Implementation details including the parameter settings are clearly provided which increases the reproducibility of the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no efficiency result provided for the proposed method, e.g. the inference time.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The manually selected frames with a relatively large camera motion used as training data affects the reproducibility of the method on other dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I presume the abbreviation “PH” stands for pose head and the symbol “M^sc” should be “M^sv”. “we scale the predicted depth map through a median ratio with the ground truth”. Please provide the equation of the scaling. There are some typos in the caption of Fig 4. it should be Rows 1~3 and Rows 4~6. Please proof read the paper and correct any grammar error in the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and results are clearly presented and it shows improvements compared to baseline approaches.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top