Abstract

Video depth estimation has been applied to various endoscopy tasks, such as reconstruction, navigation, and surgery. Recently, many methods focus on directly applying or adapting depth estimation foundation models to endoscopy scenes. However, these methods do not consider temporal information, leading to an inconsistent prediction. We propose Endoscopic Depth Any Video (EndoDAV) to estimate spatially accurate and temporally consistent endoscopic video depth, which significantly expands the usability of depth estimation in downstream tasks. Specifically, we parameter-efficiently finetune a video depth estimation foundation model to endoscopy scenes, utilizing a self-supervised depth estimation framework which simultaneously learns depth and camera pose. Considering the distinct characteristics of endoscopic videos compared to common videos, we further design a novel loss function and a depth alignment inference strategy to enhance the temporal consistency. Experiments on two public endoscopy datasets demonstrate that our method presents superior performance in both spatial accuracy and temporal consistency. Code is available at~\url{https://github.com/Zanue/EndoDAV}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1355_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1355_supp.zip

Link to the Code Repository

https://github.com/Zanue/EndoDAV

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhoZan_EndoDAV_MICCAI2025,
        author = { Zhou, Zanwei and Yang, Chen and Yang, Piao and Yang, Xiaokang and Shen, Wei},
        title = { { EndoDAV: Depth Any Video in Endoscopy with Spatiotemporal Accuracy } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {192 -- 201}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors present an endoscopic depth estimation method that performs a self-supervised finetuning of a Video Depth Anything model in a parameter-efficient manner using a projection loss and a depth alignment strategy that generates temporally consistent depth estimates for a video sequence.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Authors explain the limitations of the methods they build upon clearly and motivate why their proposed loss function or alignment strategy is necessary.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While authors motivate their contributions well, there are several details of their methods that seem to be missing. For instance, Fig. 1 indicates an Image Reconstruction Loss, however this loss is not explained anywhere in the text. Even if this is not a contribution of this paper, authors should either briefly explain the loss (preferred) or cite a reference to it. Other details like what are variables L and T set to are also not specified.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • In L_proj (eq. 3), can the authors explain why they don’t instead compute a loss between z_{s->t} and z_t?
    • “With enough overlapped frames, the adjacent depth snippets are aligned accurately” - what is enough overlapped frames? Authors should specify somewhere what variables like L and T are set to.

    Typos:

    • Section 2.2, line 1: finetuning should be changed to finetune, i.e., “We parameter-efficiently finetune Video…”
    • Section 3.2, Quanlitative results, line 5: issue should be tissue, i.e., “Depth of the soft tissue in the left…”
    • Section 3.3, line 2: ith should be with, i.e., “We set four experiments: with/without projection…”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In its current state, there is not enough information to have a complete picture of the implementation described in this paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    Monocular depth prediction for endoscopic scenes. They propose a self-supervised depth estimation framework to fine-tune based on state-of-the-art Video Depth Anything with long endoscopic videos. The training process is parameter-efficient, because they train only 0.17% of the parameters of the pretrained model. They achieve this by adding a SSB Lora layer. Also, they propose 2 strategies to keep temporal consistency on the predictions. First, a custom projection loss and second a depth alignment inference strategy taking into account that the camera during a laparoscopic surgery does not move often and when it does, the movement is slow and progressive.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Adapted fine tuning framework for specific use case: endoscopic videos
    • Results are compared with SOTA methods, and demonstrate robustness
    • Evaluation metrics test depth estimation, camera pose and depth sequence alignment
    • The number of fine tuning parameters is much smaller than other methods (as they use SSB LoRA layers)
    • Trained on public available datasets
    • Method uses single image sequences to estimate depth map and camera pose of endoscopic scenes
    • Proposed loss function and depth alignment strategy specifically for their use case
    • Positive ablation study on proposed loss function and alignment strategy, demonstrating that their proposed methods have positive influence in final model performance
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • No mention if the code/models will be made available upon publication
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Well written paper presenting a novel method for a relevant problem.
    • Evaluation on secondary data, i.e. dataset not used for training, shows promising generalizability.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    In my opinion the authors were able to answer the open questions satisfactorily.



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper is the development of EndoDAV (Depth Any Video in Endoscopy), a self-supervised and parameter-efficient framework for monocular video depth estimation in endoscopic scenes. The method builds upon the Video Depth Anything foundation model (2025) and introduces a fine-tuning strategy using SSB-LoRA, enabling adaptation with 0.17% of trainable parameters. To address the specific challenges of endoscopic video—such as soft tissue deformation and gradual camera motion—the authors incorporate two key components: 1) A novel projection loss that leverages predicted camera poses to enforce temporal consistency by aligning depth maps between adjacent frames. 2) A depth alignment inference strategy that stitches overlapping video snippets, correcting for scale and shift discrepancies to ensure consistent depth predictions over long video sequences. The approach is evaluated on the SCARED and Hamlyn datasets, demonstrating improved spatial accuracy and temporal consistency compared to existing methods, while maintaining efficient inference performance. The proposed framework offers an advance for depth estimation in surgical and endoscopic applications.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Parameter-Efficient Adaptation for Endoscopy: The paper’s fine-tuning strategy using SSB-LoRA, which adapts a foundation video depth model to endoscopic scenes with only 0.17% of trainable parameters, is particularly innovative. This efficient adaptation is impactful in the medical imaging domain, where computational resources are often constrained. 2) Temporal Consistency Enhancement: The introduction of a projection loss that enforces temporal depth consistency by leveraging relative pose and depth information between adjacent frames is a key contribution. This mechanism is crucial in endoscopy, where frame-to-frame continuity supports tasks such as navigation and 3D reconstruction. 3) Inference-Aware Design with Depth Alignment Strategy: The proposed depth alignment strategy, which uses overlapping video snippets and scale-shift normalization to ensure smooth transitions between depth segments, offers a practical and lightweight solution compared to more computationally intensive temporal models or post-processing methods. 4) Strong Experimental Validation: The framework is comprehensively evaluated on two publicly available endoscopic datasets (SCARED and Hamlyn), demonstrating consistent improvements over prior methods across multiple spatial and temporal metrics. The combination of quantitative and qualitative results bolsters the paper’s claims. 5) Applicability: With an inference speed of approximately 16 ms per frame, the method meets the critical real-time requirements for practical deployment in minimally invasive procedures, making it a viable option for future clinical use.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Moderate Concerns:

    • Clinical Impact Evaluation: The paper does not discuss how improved depth estimation could benefit downstream tasks (e.g., tool tracking, organ reconstruction, lesion measurement).
    • Experimental setup: The authors follow established splits and protocols from the literature. However, it might benefit from additional clarification on whether all baseline results were reproduced using the authors’ implementation of pretrained models or if some values were directly taken from the original publications. This detail would further strengthen the reproducibility and transparency of the comparison. -Real-World Evaluation: While the method is validated on two public datasets captured under controlled conditions, there is no discussion of performance in real clinical scenarios, where challenging conditions could affect robustness and generalizability. 2) Minor Concerns:
    • Sensitivity Analysis: Further details on how tuning parameters (e.g., the LoRA percentage) influence performance would support reproducibility and understanding.
    • Reference Renumbering: The references are not arranged in the order they appear in the text (e.g., the introduction begins with “[2]” instead of “[1]”). Renumbering is needed for consistency (see LNCS style).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    On balance, the paper presents a solid and well-executed contribution with clear practical relevance. Its strengths include efficient use of a foundation model, strong performance improvements in spatial accuracy and temporal consistency, and real-time capability, demonstrate that the authors effectively address the critical issue of temporal inconsistency in endoscopic depth estimation. While the method advances state-of-the-art performance, it does so in an incremental manner rather than redefining the problem, and its immediate clinical applicability is unverified without further evaluation in real-world settings. In summary, the paper is technically sound and shows promising results, justifying a reasonably favorable recommendation. The overall score reflects this balance: the work is a worthwhile contribution to endoscopic vision research based on its efficiency and performance, and with further validation and deeper exploration of clinical integration, it has the potential to make a significant impact in real-world environments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal clearly and convincingly addresses the primary concerns, particularly those related to reproducibility, methodological clarity, and evaluation scope. The authors confirmed that they will release the code and pretrained weights, significantly enhancing the reproducibility of the work. They also clarified how the image reconstruction loss is computed and provided values for key hyperparameters, such as window size and temporal overlap. Additionally, the authors expanded on important question, including the rationale behind the design of the image reconstruction loss, the use of temporal consistency during training and inference, and how improved depth estimation supports downstream clinical tasks, such as organ reconstruction demonstrated through performance improvements on the EndoNeRF and Endo4DGS datasets. Given the quality of the rebuttal, the strength of the technical contributions, and the clinical relevance of the proposed approach, I maintain my initial score and recommend acceptance.




Author Feedback

We sincerely appreciate the reviewers’ insightful comments and constructive feedback. The reviewers have acknowledged the well-organized paper structure (R1, R3), clear motivation (R1, R2, R3) and solid contributions (R1, R3). We provide explanations to address the concerns below (W represents weakness, C represents additional comments):

[R1.W1&R2&R3: Reproducibility] To ensure full reproducibility and contribute to the community, we will open-source both our code and model weights in the neat future.

[R2.W1: Image Reconstruction Loss] We use the same loss function in AF-SfMLearner and EndoDAC, which is a combination of L1 loss and structural similarities (SSIM) term: L_p = \alpha * (1 - SSIM(I_t,I_{s->t}))/2 + (1-\alpha) L1(I_t,I_{s->t}), where I_{s-> t} is the reprojected image from I_s to =I_t, and \alpha is set to be 0.85.

[R2.W2&C2: what is enough overlapped frames] We use a slide window size 32 to process the input video. Frames in each window consist of T=2 keyframes from the previous video snippet and 30 frames in the current video snippet, where the first L=8 frames are the overlapped frames.

[R2.C1: Why don’t instead compute a loss between z_{s->t} and z_t] There is no direct way to calculate the loss between z_{s->t} and z_t. Since the depth map consists of discrete pixels, the pixel coordinates of z_{s->t} do not correspond to the pixel coordinates of z_t. Therefore, we need to resample z_t using the projected pixel coordinate u_{s -> t}.

[R3.W1: Clinical Impact Evaluation] The improved depth estimation can benefit downstream tasks, such as organ reconstruction. A more accurate and stable depth estimation method would enhance the reconstruction quality. We test our method on the Tearing Tissues scene from the Endonerf dataset, which is an endoscopic surgery video. We conduct the dynamic reconstruction task with Endo-4dgs, and replace the depth estimation model with our Endodav. The psnr is increased from 30.57 to 30.96.

[R3.W2: Experimental setup] Video Depth Anything is a pretrained model wuithout retraining. EndoDAC is trained from scratch by our experimental setup. We will clarify this in Sec. 3.2.

[R3.W3: Real-World Evaluation] Experiments on the real-video SCARED and Hamlyn datasets demonstrate our method’s robustness to the conditions they represent. Thanks to the temporal attention module and depth alignment strategy, when facing challenging conditions, our model can still produce relative stable depth estimation. For generalizability, we agree that performance on limited fine-tuning data warrants further investigation in broader real clinical scenarios, which will be a dedicated focus of our future research and collaborations.

[R3.W4: Sensitivity Analysis] We do not specifically select the hyper parameters, generally following previous methods using default values. E.g., following EndoDAC, we also set the SSB LoRA rank to be 4. Further more, we will release our code and model weights to ensure reproducibility.

[R2.W3&R3.W5: Typos and reference renumbering]
Thanks for pointing out! We will address these issues in the camera-ready version of our paper.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I have read the manuscript, review comments, rebuttal letter. There exist mixed reviews, one reject, and two accept. Two acceptance reviews provides few weakness comments and one rejection reviews pointed out details of their methods. None of them noticed the results of Table 1, which are inconsist with Table 1 in EndoDAC. For example, on the SCARED dataset in Table 1 of EndoDAC, the results of ENdoDAC are EndoDAC (Ours) - Abs Rel 0.052; Sq Rel 0.362; RMSE 4.464; RMSE log 0.073; δ 0.979, Total.(M) 99.0, Train.(M) 1.6, Speed (ms) 17.7. In Table 1 of this work, Abs Rel 0.201; Sq Rel 5.163; RMSE 16.421; RMSE log 0.238; δ 0.653; Total.(M) 99.0; Train.(M) 1.6; Speed (ms) 15.0. This work didn’t provide any explaination of the differences. ENdoDAC has the same the model size and trained parameters, while has different results. Based on the concerns, this meta reviewer recommends a rejection.



back to top