Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate depth and camera pose estimation is essential for achieving high-quality 3D visualisations in robotic-assisted surgery. Despite recent advancements in foundation model adaptation to monocular depth estimation of endoscopic scenes via self-supervised learning (SSL), no prior work has explored their use for pose estimation. These methods rely on low rank-based adaptation approaches, which constrain model updates to a low-rank space. We propose Endo-FASt3r, the first monocular SSL depth and pose estimation framework that uses foundation models for both tasks. We extend the Reloc3r relative pose estimation foundation model by designing Reloc3rX, introducing modifications necessary for convergence in SSL. We also present DoMoRA, a novel adaptation technique that enables higher-rank updates and faster convergence. Experiments on the SCARED dataset show that Endo-FASt3r achieves a substantial 10% improvement in pose estimation and a 2% improvement in depth estimation over prior work. Similar performance gains on the Hamlyn and StereoMIS datasets reinforce the generalizability of Endo-FASt3r across different datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4574_paper.pdf

SharedIt Link: https://rdcu.be/eHw5X

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_12

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Mona-ShZeinoddin/Endo_FASt3r.git

Link to the Dataset(s)

SCARED dataset: https://endovissub2019-scared.grand-challenge.org/ Hamlyn dataset: https://hamlyn.doc.ic.ac.uk/vision/ StereoMIS dataset: https://zenodo.org/records/8154924

BibTex

@InProceedings{SheMon_EndoFASt3r_MICCAI2025,
        author = { Sheikh Zeinoddin, Mona AND Hoque, Mobarak I. AND Tandogdu, Zafer AND Shaw, Greg L. AND Clarkson, Matthew J. AND Mazomenos, Evangelos B. AND Stoyanov, Danail},
        title = { { Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {117 -- 126}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of this paper is the introduction of Endo-FASt3r, the first self-supervised learning (SSL) framework that leverages foundation models for both monocular depth and pose estimation in endoscopic scenes.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Key contributions include (i) adpatation of Reloc3rX, a pose model tailored for self-supervised settings with axis-angle and scale-aware modifications, and (ii) DoMoRA, a new adaptation technique combining low- and full-rank updates for efficient and expressive fine-tuning. The method shows strong performance across three diverse surgical datasets, achieving up to 10% improvement in pose accuracy and 2% in depth.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper does not sufficiently address potential limitations or failure scenarios. For instance, there is no discussion of how the model handles extreme occlusions or challenging intraoperative conditions such as smoke, bleeding, or surgical tools. How does the method performs under fast camera motion or dynamic scene changes?

Although DoMoRA is presented as a novel hybrid adaptation mechanism that combines DoRA and MoRA, its added complexity is not justified. The rationale for blending full-rank and low-rank updates is only briefly mentioned, and the reported gains over simpler approaches like DoRA alone are relatively modest. There is no comparison to more recent and competitive PEFT methods, such as QLoRA or adapter tuning techniques widely used with vision transformers. The ablation (Table 2) does not make it clear how DoMoRA interacts with different components of the transformer, such as the Q, K, and V matrices, or its role across layers. A deeper breakdown could help explain why DoMoRA performs better. Overall, the paper reads more as a combination of existing techniques rather than a clearly motivated, novel framework. The design choices feel arbitrary without stronger empirical or theoretical justification. Despite addressing a structure-from-motion problem, the framework processes only isolated image pairs and lacks any explicit temporal consistency modeling. This is a major omission for surgical navigation, where smooth and stable camera tracking is critical, especially for pose prediction. While the framework is tested on three datasets, all evaluations are confined to a specific surgical domain—laparoscopic RAS using the da Vinci platform. There is no validation on other endoscopic modalities, such as bronchoscopy, cystoscopy, or colonoscopy. This raises concerns about the model’s generalizability. A method cannot claim to be a general-purpose or “foundation” model for endoscopy if it has only been evaluated on one procedure type.

Key figures (e.g., Fig. 1 and Fig. 2) are dense and visually overloaded, making it difficult to follow the architectural layout and parameter flow. Important components (e.g., transformer adaptations and pose regression details) are buried in cluttered visuals. A cleaner, modular diagram would significantly improve readability and help While the paper uses established metrics such as AbsRel and ATE, these do not fully capture the clinical or spatial relevance of the predictions. Including scene reconstruction quality metrics (e.g., surface completeness, point cloud alignment) or task-specific performance indicators (e.g., navigation accuracy) would provide more insight into the method’s real-world value.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Lack of Validation out of training domain, limited analysis of failure cases, and overly complex adaptation mechanism without deep justification.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This work presents a method for joint monocular depth and pose estimation for endoscopic images. The method employs existing general-purpose foundation models both for depth and pose while proposing a series of incremental edits to improve their performance when applied to the surgical domain. Benchmarks with SOTA on public datasets as well as ablation studies are reported.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The manuscript reads well, and it is well-structured. Figures and tables are of sufficient readability.
- The proposed method presents the first application of a foundation model for pose estimation in endoscopy, providing detailed insights into the limitations of existing alternatives and on the necessary upgrades to adapt SOTA foundation models to the specific application domain.
- Qualitative results for pose estimation (plotted trajectories vs. GT) are significantly better than the compared second-best method, showing the advantage of introducing and adapting the foundation model pose regressor Reloc3rX.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The qualitative and quantitative results for the depth estimation show very marginal improvement w.r.t. the compared second-best method (2% SCARED, 1.19% Hamlyn). Even though depth estimation is not the innovative focus of the contribution, the marginal advantages raise doubt on the actual strength of the proposed DoMoRa over LoRA, which is instead highlighted by the authors as the second major contribution of the paper.
- The pose estimation quantitative results (7-10%) on absolute trajectory metric seem marginal w.r.t. the instead visible advantage in qualitative trajectory accuracy (in the plotted 3D trajectories against GT), raising the concern on the appropriateness of reporting/using only this metric. Could there be a better metric to quantify the advantage that is clear when looking at the trajectory plots?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Regarding the two sentences: “…enables higher-rank updates and faster convergence.” and “We also introduce DoMoRA, a novel adaptation technique that enables both low-rank and full-rank updates while benefitting from faster convergence.” I am not sure how to evaluate if that actually leads to a faster convergence. Where is this faster convergence capability shown/proven in the manuscript?

Concerning Table 1, why is inference time reported only for the depth module and not for pose estimation? Given that the pose estimation is the main goal/innovation in the proposed method, why are inference times for Reloc3rX pose estimation omitted? Is that still real-time as the depth module?

Concerning Table 2, why is there no ablation case using Reloc3rX with LoRA? This would have given precious insights on the individual contributions of ReloC3rX and DoMoRA, and not simply on the marginal contributions of DoRA Reloc3rX vs DARES or the internal combinations of DoRA and MoRA in Reloc3rX.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I am overall happy with the presented work, even though I do not think this represents revolutionary ideas for the field, I still believe there should be space for incremental work that builds and adapts existing SOTA methods to specific application domains like minimally invasive and robot-assisted surgical imaging, given that the methods are fully justified and backed by solid quantitative results. I think the manuscript can be accepted given the authors provide some clarifications regarding the performance of the presented methods and the metrics used, as requested in my comments and feedback above.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper presented an SSL-based method for depth and camera pose estimation used in the endoscopic robot-assisted surgery applications. The main contribution of this paper is that both the depth estimation and pose estimation modules are transformer-based, such that both modules can be adapted in the same manner.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
There are two main strengths of the paper:
1. Both the depth estimation and pose estimation modules are transformer-based, such they can be adapted with a unified adaptation method.
2. The proposed adaptation method, DoMoRA, takes advantages of two existing methods, DoRA and MoRA, for better adaptation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The main weakness of this paper lies on the technical novelty. Compared to one of STOA, DARES, it’s mainly the PoseNet got replaced by a revised version of Reloc3r network, and the Vector LoRA adaptation got replaced by DoMoRA. The evaluation results confirmed the slight improvement due to these changes.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. If space permitted, I would suggest elaborating the training details. For example, it’s unclear whether two modules were trained from scratch or they were adapted from existing models pretrained with non-RAS data.
2. Please comment on the discrepancy between the quantitative comparison (small difference) and qualitative comparison (huge difference) against the DARES method.
3. In Fig. 4, it would be nice to show the GT depth map for visual comparison if available.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the technical novelty is not significant, the new combinations of transformer-based depth and pose modules with unified adaptation framework could be of interest in RAS applications.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #4

Please describe the contribution of the paper

Author claimed a redesigned version of Reloc3r called Reloc3rX and claimed to be the first to put a foundation model for pose estimation in Robot-assisted Surgeries scenario. And they claimed a novel PEFT DOMORA which is a combination of DORA and MORA and reported 7~10% improvements on pose estimation and 2% increase in the absolute error for depth estimation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Authors tried to ensure comprehensive evaluation on multiple datasets (SCARED, Hamlyn, and StereoMIS), ensuring the generalizability and reliability of the results across different scenarios.
2. The authors expanded the realm of foundation models, which is a significant effort given the increasing importance of foundation models in general vision tasks
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. As paper is structured clearly this is a adaptation paper to repurpose the foundation model in new domain, therefore the core novelty is the PEFT adaptation, change of modeling is limited. Therefore the dataset selection criteria should be carefully checked to really validate the foundation model performance in RAS setting.
2. The figures should be redesigned for ease of reading. They need to be self-explanatory, or proper explanations should be provided to fully harness their power.
3. PEFT novelty is questionable since I came across a discussion in https://github.com/kongds/MoRA/issues/10 which is essentially a variant of MORA combined with DORA implementation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Based on the strength and weakness stated above.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the AC and reviewers for their time. We are very glad that the reviewers recognize the novelty (R4-5, R3-5, R2-6, R1-5), rigorousness in our experimental design (R4-6, R2-6), structure (R2-6), writing quality (R2-5), and the significance of the topic explored (R4-6, R1-6). We would like to provide some clarification regarding the received comments:

R1,R2-How do the performance gains justify the added complexity over DoRA/LoRA alone as opposed to DoMoRA?

While the improvement in depth estimation between DoRA+Reloc3rX and DoMoRA+Reloc3rX is modest (~1.9%), the gain in pose estimation is notably larger, 4.5% and 9.9% on T1 and T2. This discrepancy arises because pose estimation, which involves capturing global scene geometry, benefits more from the higher-rank updates of DoMoRA, compared to depth estimation, which relies on local features.

R1,R4-Key figures are dense and need more explanations:

We will add more explanation in the caption of Fig.1 and remove the transformers q,k,v details as we have also explained it in Fig.2.

R1,R2-Use of alternative metrics for evaluation that are better suited to RAS:

We follow standard practice in SSL depth and pose estimation-consistent with prior works such as [22,5,6,29]-which rely on AbsRel and ATE. However, we acknowledge the need for better metrics, yet this is not the main aim of our study. We will address this in future work.

R1-Addressing challenging scenarios

While space constraints limited discussion of all scenarios, the evaluation does indeed address some challenging conditions typical in RAS. The StereoMIS dataset was intentionally selected for this reason, as highlighted in Section 3: “…. the StereoMIS dataset [9], which exhibits significant tissue deformation and camera motion in an in vivo setup….”

R1-How is the claim of generalizability or positioning the model as a foundation model for endoscopy justified, given that evaluations are limited to RAS and exclude other endoscopic modalities?

We do not claim to develop a general-purpose foundation model for all endoscopy, but rather “the first SSL depth and pose framework which uses foundation models for both tasks in the RAS domain”. Throughout the text, we emphasize our focus on RAS many times, rather than all possible endoscopic environments.

R2-Pose module inference time:

Consistent with prior works in SSL-based depth and pose estimation [22,29,6], we reported inference time only for the depth module. However, we confirm that Reloc3rX operates in real time, 26ms. We will clarify this in the revised text.

R3-It is unclear whether the two modules were trained from scratch or adapted from non-RAS pretrained models.

As stated in Sec2.1 both modules are based on foundation models pretrained on non-RAS data. As detailed in Sec2.2, the models are mostly frozen, with updates applied only to the adaptation layers and specific modules noted. Other training details such as epoch number, learning rate, etc. can be found in Sec3.

R3-Faster convergence properties not explained:

Endo-FASt3r achieves convergence after 10 epochs (as noted in the text) while the second-best approach, DARES, achieves convergence after 20 epochs. This shall be noted in the text.

R3-GT depth maps not provided:

While GT is available for few frames in the SCARED dataset, most frames rely on extrapolated depth from static scenes, resulting in sparse GT. For this reason, and consistent with prior works [22,29,6] we chose not to include them.

R4-Github link to informal discussion on PEFT design: While we acknowledge informal discussions on similar PEFTs, to our knowledge, no formal implementation or peer-reviewed work exists.

R1,R2-Further ablations on the effect of DoMoRA on each q,k,v component and across layers:

To isolate the impact of DoRA, MoRA, and DoMoRA, we included four ablations. A full breakdown across q,k,v and all layers (12 in DA V2, 24 in Reloc3rX) would greatly expand the scope and is beyond the space available.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

This paper is well-written and clearly structured with solid baseline comparisons. I recommend that the authors clarify some questions raised by the reviewers (mainly in the contribution and limitations) and recompile the tables and figures to improve the readability in the camera-ready version.

back to top

Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion

Author(s):