Abstract

Monocular endoscopic depth estimation is a key to expand the surgical field and visually navigate the endoscope, augmenting the perception of surgeons and reducing inadvertent damages during robotic surgery. Unfortunately, current deep learning methods still suffer from limited field of view, moving and limited artificial optic-fiber light sources (illumination variations), and weak textures or structures in monocular endoscopic video images collected from complex surgical scenarios, as well as they also get trapped in depth overestimation. This work first explores a small deep learning model of densely convolved pyramid transformer to simultaneously predict monocular depth and pose of the endoscope without using any annotation data. Specifically, this small model employs dense convolution and hierarchical transformer to encode multiscale local and global features, while it uses residual attention to effectively fuse or decode these features. Then, a photometric structure-aware consistency mechanism is introduced to deal with the problems of weak texture and depth overestimation, refining endoscopic depth and pose estimation. We evaluated our methods on both synthetic and clinical colonoscopic video images, with the experimental results showing that our unsupervised learning methods can attain higher accurate depth distribution and more sufficient textures, and better qualitative and quantitative results than state-of-the-art monocular depth estimation models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4981_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{FanWen_Unsupervised_MICCAI2025,
        author = { Fan, Wenkang and Qiu, Enqi and Xu, Hongzhi and Luo, Xiongbiao},
        title = { { Unsupervised Structure-Geometric Consistency for Monocular Endoscopic Depth Overestimation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents an unsupervised learning method that simultaneously estimates dense depth and camera pose from monocular endoscopic data. The paper further addresses the problem of depth overestimation via a photometric structure-aware consistency function and a 3D geometric consistency function which identifies regions depicting negligible movement across endoscopic frames and enforces smooth depth estimates in these regions, respectively.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method identifies a critical problem with monocular depth estimation which is the overestimation of depth in regions that depict negligible movement across endoscopic frames and proposes a solution to this problem. Authors provide details on what various variables were set to.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Authors make several claims throughout the paper that are not immediately obvious to the reader. Authors should try to explain these further to make the intuition behind some of these claims clearer to the reader. For instance, authors mention that “HTB captures.. illumination variations for robust training..”. How HTB is able to capture these variations is unclear. Is this due to the variations present in the dataset? Or is data augmentation used to allow the transformer to observe such global variations in the dataset? Please specify.

    Results could also be presented more clearly.

    Overall presentation of content in the paper could be improved. Authors could consult a non-author to read through the paper for readability.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Section 2, page 3, below Eq. 2: “Note that U is placed before..”. Where is U in any of these equations? Please define U.

    Fig. 2: Please increase the font size of your axis labels and legend.

    Fig. 3: Please identify which row is Row 1 and Row 2. I’m assuming 1 and 2 are top and bottom but please use more informative labels.

    Fig. 4: Is it possible to include predicted depth maps from DCHT1/2 in addition to the error maps. While it is helpful to see the error maps, it is difficult to gauge the scale of the errors and, therefore, the quality of the predicted depth maps.

    Fig. 5: It would also be interesting to see predicted depth maps from DCHT1 in the comparison.

    Table 1: Please indicate via up or down arrows which columns should the reader expect to see increase or decrease in value for better results (e.g., Abs Rel \downarrow).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Authors need to explain their claims better and present their results better (see feedback).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel unsupervised deep learning framework for monocular endoscopic depth estimation that addresses the critical issue of depth overestimation in static regions commonly encountered during robotic-assisted surgery. The authors propose a compact model called the Densely Convolved Hierarchical Transformer (DCHT), which integrates local texture and global spatial-temporal features using dense convolutions and hierarchical transformers, fused through residual attention mechanisms. To overcome the limitations of traditional photometric consistency losses, they introduce a photometric structure-aware consistency (PSC) loss that masks out static and textureless regions, and a 3D geometric consistency loss to ensure smooth and accurate depth prediction. Validated on both synthetic and clinical colonoscopic datasets, their approach significantly outperforms existing unsupervised methods in both qualitative and quantitative metrics, offering improved accuracy, texture detail, and robustness to illumination changes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents several key strengths: it clearly identifies and addresses the problem of depth overestimation in unsupervised monocular endoscopic depth estimation, particularly in static and low-texture regions; introduces a novel and compact Densely Convolved Hierarchical Transformer (DCHT) that effectively integrates dense convolutions and hierarchical transformers to capture local and global features; proposes a robust unsupervised training strategy with photometric structure-aware consistency (PSC) and 3D geometric consistency losses to improve depth accuracy and temporal coherence; demonstrates comprehensive performance gains over state-of-the-art methods through both qualitative and quantitative evaluations on synthetic and real clinical datasets; and offers a fully annotation-free, practical solution that is highly applicable to clinical environments, especially in robotic and minimally invasive surgery.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There is limited analysis on the individual contributions of the model components (e.g., dense convolution vs. transformer blocks), making it harder to assess the relative importance of each. The paper does not provide benchmarks on computational cost, inference time, or real-time applicability, which are crucial for deployment in surgical environments. The approach assumes that some image regions remain static due to consistent endoscope motion, which may not generalize well to more dynamic surgical scenes or diverse organ types.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Static image region assumption should be loosened.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a photometric error to better learn monocular depth estimation along with a comprehensive evaluation and model. They compare to other off the shelf methods quantitatively and qualitatively. The paper specifically presents a loss to help deal with low-motion (far-field) pixels which can cause overestimation of depth. They use a transformer + conv model (named DCHT) to estimate relative depth and pose, and combine a 3D point matching loss with a loss (named Photometric structure-aware consistency, PSC) to deal with the problem of over-estimation. The mask used in PSC down-weights non-useful contributions up to a threshold, better enabling performance in semi-static regions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper provides a clear problem statement (depth over-estimation) along with a clear solution (masking a loss) and numerical measurements backing it up. My primary concerns come from wanting to better understand the math and problem formulation as a whole.

    The paper also provides clear figures to help illustrate the methods and model for depth estimation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I would provide more in-depth discussion of failure modes. For example, why is mask extraction problematic as you mention in the discussion.

    Better explanation of mask for PSC. Usually I see a mask as 1 or 0-valued, but in this case it seems to be weighing the SSIM with a floating point number. I would recommend to explain this more clearly, since it was a bit confusing to me.

    More detailed explanation of the over-estimation problem. To me, the over-estimation issue seems close (albeit in a dense manner) to the need for ‘close’ and ‘far’ features in ORB-SLAM. Is the primary reason for over-estimation since the pixel locations are rounded to the nearest integer pixel location thus little motion becomes zero motion? If so, are there other ways to deal with this ‘subpixel’ issue.

    7:3 train/test splits. Are these splits across different cases or frames? For example can one of the 33 sequences occur in both splits?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    For DCHT1 and DCHT2, I would recommend naming DCHT2 more clearly, since the latter is the primary contribution for the overestimation problem (using all losses). Otherwise the reader has to return to the text to remember which is which.

    Math clarity: Some of these terms (U, H, X) were not defined in the text.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper provides a clear problem statement along with a clear solution and numerical measurements backing it up. My primary concerns come from wanting to better understand the math and problem formulation as a whole.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal has addressed all my concerns, and I believe the problem of overestimation, and this paper’s proposed approach are of use to the MICCAI community.




Author Feedback

We thank the reviewers for their constructive comments to strengthen our submission. We are modifying our paper in accordance with all the comments.

Reviewer#1 Q1: Limited analysis on the individual contributions of the model components. A1: We are sorry for the missed contributions. We believe accurate monocular depth prediction depends on how to precisely extract local and global structural features and fuse these features in endoscopic images. While DCB contributes to accurately extract multiscale local texture features, HTB can successfully perceive global spatial and temporal features. The residual attention fusion (RAF) block contributes to aggregate multiscale local features and global spatial-temporal features for decoding depth information. We will add more details in our revision.

Q2: Do not provide benchmarks on computational cost. A2: The computational cost (floating point operations) of DCHT is 186 GFLOPs for three images input with the resolution of 256*256. The inference time on an RTX 3090 GPU is 36 frames per second, which meets the real-time requirement in clinical applications. We will add these descriptions in our revision.

Q3: The assumption may not generalize well to other surgical scenes. A3: We would like to say that the assumption is applicable to other surgical scenarios such as colonoscopy, ureteroscopy, bronchoscopy, since they are natural orifice translumenal endoscopic surgical procedures, flying the endoscope through tubular structures. The overestimation problem and its corresponding solution are occurred and used to these procedures.

Reviewer#2 Q1: Should provide more in-depth discussion of failure modes. A1: We will add some endoscopic images where our method cannot be completely masked out relatively static regions due to photometric variations, artifacts, and occlusions.

Q2: The design of the mask should be explained more clearly. A2: It is true that our mask design uses a floating-point range from 0 to 1 to represent the degree of motion at each position. Therefore, the mask can be seen as a weight map that participates in the photometric loss. We found this way performs better than directly setting a threshold to obtain a binary (0 or 1) mask because this allows relatively static regions to still provide partial gradient information, leading to more stable model training.

Q3: More detailed explanation of the over-estimation problem. A3: Thanks for the constructive comments. We believe that the overestimation problem is complicated. On one hand, it is caused by the surgeons who usually operate the endoscopic camera along the centerlines of the tubular organs and barely change the direction. In this way, we can observe some regions orthogonal to the direction of the endoscope trajectory (white-square regions in Fig. 1) are relatively static in consecutive frames. On the other hand, using the photometric loss to make the optimizer locally converged can bring the overestimation problem, as we also theoretically prove in Eq. (7) on page 4. Additionally, we will further investigate this problem within the scientific community.

Reviewer#3 Q1: How HTB can capture illumination variations is unclear A1: We apologize for this confusion. Actually, HTB can capture or perceive structural features in endoscopic images under illumination variations. It employs pyramid transformer blocks to extract global spatial and temporal features at the structural regions. Additionally, by interacting among frame-level tokens, the model can perceive photometric variations. Hence, our model is robust to illumination changes for precisely extracting structural features.

Q2: Results and contents should be presented more clearly. A2: We are sorry for unclear presentation. We had proofread our submission in accordance with your comments to rephrase and clarify unclear claims, remove writing issues, typos, and grammatical errors, and improve the readability. We will make our code and data publicly available soon.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The three reviewers agree the targeted problem of depth overestimation in static regions during robotic-assisted surgery is critical(R3) and has been well identified (R1) and clearly stated (R2). There seems to be some novel aspects to the proposed methodology and a good experimental evaluation. Questions regarding the inference time were addressed in the rebuttal indicating the proposed solution is real-time. Discussion of failure cases and improved explanations and analysis of the individual contributions (ablation) should be added to the final version of the paper. The three reviewers have noted a limited reproducibility (R1,R2,R3), we encourage authors to make their approach available to the community.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I have read the manuscript, review comments, rebuttal letter. There exist mixed reviews, one reject, and two accept. The reviews from rejection are mainly about the presentation of content. The rest comments had been addressed. Thus, this meta reviewer believes the authors did a good job and need to modify their manuscrip according to the reviews.



back to top