Abstract

Capturing the hand movements of physicians and their interactions with medical instruments plays a critical role in behavior analysis and surgical skill assessment. However, hand-instrument interaction in medical contexts is far more challenging than in general tasks. The weak texture and reflective properties of surgical instruments frequently result in failures in pose estimation. Moreover, the long and thin shape characteristics of the instruments and the sparse points of the reconstructed hand lead to difficulties in accurately grasping the instrument or may result in spatial penetration during interaction. To address failures in pose estimation, we build 3D models of medical instruments as priors to optimize instrument pose estimation. To resolve the issues of inaccurate grasping and minimize spatial penetration, we propose a contact-point-centered interaction module by refining the surface details of the fingers to optimize the hand-instrument relationship computation. Experiments on medical scenario da

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2293_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XuMia_Reconstructing_MICCAI2025,
        author = { Xu, Miao and Zhu, Xiangyu and Wu, Jinlin and Feng, Ming and Zang, Zelin and Liu, Hongbin and Lei, Zhen},
        title = { { Reconstructing 3D Hand-Instrument Interaction from a Single 2D Image in Medical Scenes } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {423 -- 433}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework capable of estimating hand-instrument interaction and pose. It performs interaction reconstruction and pose estimation based on a parametric hand model and 3D models of surgical instruments on the backend. The highlight lies in its introduction of a contact-point-centered module, which achieves improved performance across some metrics compared to previous works.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The structure of the paper is clear and identifies an important challenge in surgical reconstruction.
    2. The experimental and evaluation work of the article was done well, comparing multiple existing methods and performing ablation experiments.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The three contributions listed by the authors, only the first one is clearly articulated. The second—creating 3D models of surgical instruments—seems more like the work of a digital artist and lacks sufficient image sequences and corresponding valid parameters. The third claimed contribution is merely a presentation of experimental results. If these two points are considered contributions, then in my view, the work lacks real highlights. I suggest the authors revise the contribution statement to better showcase the truly noteworthy aspects of this work.

    2. Following up on the previous weaknesses regarding the insufficient presentation of contributions: Sections 2.1 and 2.2 describe how the MANO parametric hand model and static surgical instrument models are utilized for initialization within the BundleSDF in the specific scenarios. The ture novelty lies in the Contact-Point-Centered Interaction Module introduced in Section 2.3.

    3. As mentioned in the first point, the surgical instrument model itself is not sufficient to be considered one of the main contributions of a MICCAI paper. Even if it were to be counted as a contribution, it would likely be seen as a drawback, as it is neither novel nor complete. First, some of the multi-body surgical instruments in this dataset do not have the correct kinematic structure and, as observed from the material presented, the accuracy of the modeling is perhaps not good enough to achieve the refinement of the results claimed by the authors. Secondly, the authors neglected to previous very similar work such as [Li, J., Zhou, Z., Yang, J., et al., MedShapeNet - A Large-Scale Dataset of 3D Medical Shapes for Computer Vision. arXiv preprint, 2023], which are publicly available datasets that are completer and more refined than this work.

    4. The latter part of Section 2.3 is not fully articulated. First, what exactly are the 2D vertices obtained from the rendered 2D images of the two hands and the instrument? Are they contour sampling points? Or are they projections of the original 3D vertices onto the 2D plane? If it’s the former, how are they registered with the 3D vertices? If it’s the latter, many 3D vertices may be occluded after projection—how is registration handled in such cases? The paper mentions that only the overlapping region at the fingertips is calculated, specifically, which region is this? How is this region defined for each frame? The statement that the overlapping region is used to determine whether it’s the left or right hand is somewhat confusing—how exactly is the fingertip region information utilized? Additionally, the description of the computation in the latter part is quite vague. While R and T are indeed solved using the least squares method, the specific solving procedure and steps also need to be explained.

    5. Although the authors present a deep learning-based framework in Figure 1 and mention it in the keywords, it appears to mainly refer to the application of prior works such as MANO and BundleSDF, the three networks used in the method are simply applied using previous work. The actual computation takes place in Section 2.3, but as noted in point 4, the computational process is not clearly explained.

    6. The final point is that, although the quantitative metrics reported in the paper are good, the qualitative results are not impressive. The video results show noticeable jitter in the hand and instrument, indicating a high variance in errors across consecutive frames—the performance is clearly suboptimal. This is quite evident in the video materials, although this alone should not serve as the definitive basis for evaluation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper proposes a framework for estimating hand-instrument interaction and pose using a parametric hand model and 3D surgical instrument models, it suffers from several critical issues that undermine its contribution and technical clarity. Please refer to Weaknesses. Given these issues, especially the weak contribution statement, lack of novelty, insufficient technical detail, the paper does not meet the standard expected for publication in a top-tier venue like MICCAI. I therefore recommend rejection.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper describes a method to track a surgeon’s hands together with instruments that the hands are holding from RGB frames. The method also optimizes the hand-instrument relationship with a contact point interaction module. The method relies on the MANO hand model. Further they build and intend to release a dataset of 3d models of surgical instruments. The authors test their method on publicly available datasets, e.g. POV-Surgery, using several metrics such as mean vertex position error and mean per joint position error.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The introduction and related works section is well done and gives a lot of background information. The contribution is very clearly stated. The evaluation with so many other methods is strong. The results of the method speaks for itself, as it beats the compared to methods in almost all metrics consistently.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the Method section gives a lot of information and formulas, I think it is still quite vague. If I was tasked to re-implement this work, I think important details are missing. Further, the authors did not indicate that they intend to provide source code for their method. The keywords list “Mixed Reality”. No information is given on compute time. Can the method run in real time? The conclusion is very generic, and gives almost no further information or summary.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The abstract mentions that the method “achieves state of-the-art performance”, but the metrics in the tables seem to indicate that the method improves upon the compared state of the art methods. The authors can consider to make a bolder statement. Keywords: I don’t think “MANO” should be considered a keyword. Further I don’t see anything related to “Mixed Reality”, as I think this would involve a user interaction system. This paper is about tracking and reconstruction. The first paragraph of the introduction is very vague. I think the paper may benefit from shortening this paragraph to make it more concise and then be able to write more in other sections. The abbreviation RGB was not introduced. Further, I think writing about a “video stream” instead of “RGB data” would be clearer. What do the authors mean by “The process of estimating the contact surface between the hand and object typically proves to be time-intensive”? The computation time? For which application? Real-Time? Why do the authors mention Tse et al. by name, and none of the other works? At least from the writing, the work of Tse et al. does not seem to be more related to the current paper, than the other references. MANO on the other hand could maybe be explicitly referred to by the author’s name. “presume the availability of a 3D object model.” I do not understand this statement. Can the authors describe the nature of their released dataset of the medical instruments in more detail? E.g. how were they created, are these models based on real instruments. (If yes, is this a copyright problem?) I’d be interested what happens when the instrument objects get deformed. Especially the tweezers and scissors in the supplementary material should be articulate, therefore this might limit their tracking capabilities. Fig1: The Pose Regression Network is for the instrument? Could the authors make this clearer? I feel like the authors should give a bit more details on how the image features of the left and right hand are disentangled. Do the authors by P_c refer to intrinsic or extrinsic or both camera parameters? The way reference [23] and [2] are connected in the end of 2.1 make it sound like [2] is criticizing [23] for its constrained capacity. But [2] is older than [23]? “The another branch segments the mask of the instrument” What do the authors mean by “branch”. Also I think it should be “The other branch”. Does the method automatically choose the fitting 3d model of the instrument automatically, or must this information be provided to the method? “Through S, the scale of the instrument is adjusted.” this is trivial. Could the authors give more implementation details, like version numbers of the libraries used etc? “sota hand-object interaction methods” - either the authors introduce the abbreviation or spell it out. Table1 was so far away from its mention in the text, that I initially did not find it and only on a second read through found it. “Our method significantly outperforms other approaches.” - Did the authors perform a statistical test? If not, the authors should use a different word. “and our method avoids spatial penetration and unable grasping.” - Something is wrong with this sentence. 3.4 How was the finetuning on all the other methods done? 3.5 But the values for CPCI w/o UP are almost the same as for CPCI? Why do the authors claim that UP is a very important aspect of their method?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I found this paper to be solid overall, without major weaknesses or standout strengths. Most of my comments are editorial in nature, as the core contribution seems technically sound. That said, I’m not a specialist in hand pose estimation, so I can’t fully assess whether all key related work has been covered. In my view, the paper is slightly above the bar for acceptance.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    For me, this paper is hard to accept or reject. For me its very borderline, just slightly leaning towards acceptance. I already mentioned that the paper lacks major strengths, but at the same time lacks major flaws as well. Therefore, this may not be a very exciting paper, but should still be considered for acceptance.

    The other reviewer, who recommended rejection, apparently also does not list very major weaknesses. Therefore, I recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    The authors present a framework for reconstructing 3D hand-instrument interactions, which has potential applications in surgical skill assessment and training. They introduce a CAD dataset, the MedIns-3D dataset, which they plan to release as an open-source resource. Additionally, the authors design a scale alignment module to ensure consistency between hand and instrument dimensions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The motivations and challenges of translating hand-object interaction techniques to surgical settings are well-described.

    2. The methodology is clearly presented and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed Contact-Point-Interaction module shows only marginal improvements in evaluation metrics as reported in Table 1 (e.g., IV: 1.14 → 1.12, Penetration Depth: 1.63 → 1.62), which does not strongly support the claim of “effectiveness” or “superiority.” Additionally, the explanation of the contact points in Section 2.3 lacks clarity, making it difficult to fully understand the module’s design and contribution.

    2. The evaluation metrics are introduced in a rather simplistic manner, without sufficient explanation of their physical meaning. This could hinder readers’ understanding, especially those unfamiliar with the specific context. Providing a more detailed explanation—possibly in the supplementary materials—would enhance clarity. Furthermore, in the context of Computer-Assisted Interventions (CAI), the authors should clarify which evaluation metrics are most relevant and why.

    3. The claim of an “innovative framework” would be more convincing if the authors clearly articulated how it differs from existing approaches. Sections 2.1 and 2.2 appear to adapt known methods, while the proposed module in Section 2.3 shows limited empirical impact. To strengthen this claim, the authors could elaborate on whether similar frameworks currently exist in the CAI domain, and what specific efforts were made to tailor this framework for CAI applications. Alternatively, the term “innovative” could be reconsidered to better reflect the work’s actual contributions.

    4. The authors only claimed to open source 3D CAD instrument models. As largely of this framework is based on existing models and hyperparameter tuning, to increase impacts and reproducibility, it is better to release the whole framework.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a 3D hand-instrument interaction framework with potential applications in computer-assisted interventions (CAI). However, the evaluation metrics are not clearly introduced in the context of CAI, raising concerns about the validity and relevance of the evaluations. Additionally, the technical contribution appears limited, with only marginal improvements demonstrated, making it difficult to assess the framework’s overall impact.

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors address well of the reviewer’s concerns.




Author Feedback

To Reviewers We thank all reviewers for their thoughtful feedback and have revised the manuscript accordingly. Below is our point-by-point response. Reviewer1: R1.1We revised our contributions to clearly state: (1) CPCI-based hand-instrument interaction framework(2)A hand mesh upsampling strategy for detailed contact modeling(3)3D surgical instrument models. R1.2 Emphasize CPCI:CPCI is now highlighted in contributions and method sections. R1.3 Surgical Modeling Not a Contribution:We cite MedShapeNet to reflect related work. The 3D shapes of this work are an incomplete subset—some instruments such as blades and certain scissors are missing. Our goal is to enrich the shape repository by releasing a more comprehensive and diverse set of instrument models after acceptance to support future research. R1.4 2D Vertices and Overlap:We clarified that the 2D vertices are projections of coarse 3D contact points. As the hand mesh topology is consistent, the fingertip region can be identified by fixed vertex indices. R1.5 Optimization Details:Section 2.3 is revised to clarify how (R, T) is estimated from overlapping regions. R1.6 Use of Prior Work:We clarified that MANO and BundleSDF serve as a rough initialization step. Our novelty lies in integrating these tools into a new CPCI-based framework R1.7 Jitter:We agree that jitter is unacceptable in medical scenarios.But no existing method has addressed stable and precise hand-instrument interaction in medical scenes. Our goal includes achieving jitter-free and accurate results, and we believe this work marks an important step toward that direction. We also applied a simple filtering strategy, which effectively reduced the jitter and improved the stability of the interaction. As a result, the P2D metric was lowered to 11.26. Reviewer2: R2.1We revised the statement “State-of-the-Art” in main text to align with the abstract. R2.2“MANO” and “Mixed Reality” were replaced with more relevant terms. R2.3 Intro:The first paragraph is now more concise and better motivated. R2.4 RGB Definition:“RGB” is defined and replaced with “video stream”. R2.5 Contact Estimation:Prior approaches estimate contact by ray casting between hand and object point clouds, which typically operate under 10 fps and are not real-time. R2.6 Citation:We standardized all citations, avoiding name-based references. R2.7 3D Model Clarification:We clarified that our method doesn’t require exact 3D models at runtime. R2.8 Dataset Details:Our 3D models are constructed in Blender matching real instruments and there are no copyright concerns. R2.9 Articulated Tools:We now state this as a known limitation and future work direction. R2.10 Pose Regression in Fig 1:Caption and figure now clarify that it estimates instrument pose. R2.11 Hand Feature:We add related content to the revised manuscript. R2.12 Camera Parameters:“Pc” now explicitly refers to extrinsic. R2.13 Reference Misconnection:We revised the sentence to remove the misleading dependency. R2.14 Corrected “The another” to “The other.” R2.15 Instrument Selection:Currently assumed known; mentioned as a limitation. R2.16 We removed the “Through S…” sentence. R2.17 Implementation Details:Details added to the supplement. R2.18 SOTA Spelled Out:“SOTA” now appears in full as “state-of-the-art.” R2.19 Table Placement Fixed:We moved Table 1 closer to Section 3.4. R2.20 “Significantly” Adjusted:We changed to “consistently outperforms.” R2.21 Grammar Fix:Rephrased to: “…reduces spatial penetration and improves grasp.” R2.22 Finetuning Clarified:Add to section 3.4. R2.23 Importance of UP UP refines the hand mesh for better contact modeling. Reviewer3: R3.1 CPCI Effectiveness:While numerical gains are small, CPCI improves physical realism. This is emphasized via videos and figures. R3.2 Metric:Detailed metric explanations and CAI relevance added to the revised manuscript. R3.3 Innovation Clarified:Please refer to R1.1. R3.4 We commit to releasing full code and dataset after acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While I acknowledge that this work does show some interesting techniques, I share the concerns of the R1. In my view, the contributions of this specific work are severely limited. Firstly, 3D models of surgical tools have been widely explored and employed for AR/VR-based solutions, and in my view, adding additional tools amounts to very little contribution. The experiments are also not extensive, as they lack a multi-fold test. The lack of details on experimental setup/dataset and the lack of a multi-fold test raise questions on dataset bias. While the paper claims to have included three bloodied textures to test model robustness, the paper lacks at least a qualitative analysis on the difference in visual features (of the blooded textures) and specific quantitative performance on those textures.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top