Abstract

We present HUP-3D, a 3D multiview multimodal synthetic dataset for hand ultrasound (US) probe pose estimation in the context of obstetric ultrasound. Egocentric markerless 3D joint pose estimation has potential applications in mixed reality medical education. The ability to understand hand and probe movements opens the door to tailored guidance and mentoring applications.
Our dataset consists of over 31k sets of RGB, depth, and segmentation mask frames, including pose-related reference data, with an emphasis on image diversity and complexity. Adopting a camera viewpoint-based sphere concept allows us to capture a variety of views and generate multiple hand grasps poses using a pre-trained network. Additionally, our approach includes a software-based image rendering concept, enhancing diversity with various hand and arm textures, lighting conditions, and background images. We validated our proposed dataset with state-of-the-art learning models and we obtained the lowest hand-object keypoint errors. The supplementary material details the parameters for sphere-based camera view angles and the grasp generation and rendering pipeline configuration. The source code for our grasp generation and rendering pipeline, along with the dataset, is publicly available at https://manuelbirlo.github.io/HUP-3D/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1531_paper.pdf

SharedIt Link: https://rdcu.be/dVZiE

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_40

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1531_supp.pdf

Link to the Code Repository

https://github.com/manuelbirlo/US_GrabNet_grasp_generation https://github.com/manuelbirlo/HUP-3D_renderer https://github.com/razvancaramalau/HUP-3D-model

Link to the Dataset(s)

https://drive.google.com/file/d/1_MDn7AaansvGdU_wd_eiFO4n95R-Ri9L/view?usp=sharing

BibTex

@InProceedings{Bir_HUP3D_MICCAI2024,
        author = { Birlo, Manuel and Caramalau, Razvan and Edwards, Philip J. “Eddie” and Dromey, Brian and Clarkson, Matthew J. and Stoyanov, Danail},
        title = { { HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound-probe pose estimation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {430 -- 436}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Following a recent trend towards the investigation of the use of synthetic data for ML techniques, this paper details the creation of a synthetic dataset for hand/ultrasound pose estimation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Quite a number of synthetic images were created – 31,000 if I’m reading correctly. Synthetic images offer built-in ground truth from 3D models, simulating realistic grasping scenarios with the benefits of easy scalability and generalizability to real images. The paper is clear and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The formulation seems to be based on Euler angles, not quaternions. This drawback manifests in the development of formulas: There seems to be a factor of 2 in the denominator that should just be unity. The spherical coordinates do not seem to lend themselves to a uniform distribution of circles of equal radius. The authors are advised to make use of a uniform sampling on the unit sphere.
    Since the technique is meant to estimate hand position and grasp configuration, it is not at all clear how claims such as 8.65mm error – when it’s the angle error that is more critical for the US probe angle, and ultimately the metrics of grasp should include the vertices of all joints (for example, Google’s mediapipe tracks almost two dozen vertices).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Since the technique is meant to estimate hand position and grasp configuration, it is not at all clear how claims such as 8.65mm error – when it’s the angle error that is more critical for the US probe angle, and ultimately the metrics of grasp should include the vertices of all joints (for example, Google’s mediapipe tracks almost two dozen vertices).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Since the technique is meant to estimate hand position and grasp configuration, it is not at all clear how claims such as 8.65mm error – when it’s the angle error that is more critical for the US probe angle, and ultimately the metrics of grasp should include the vertices of all joints (for example, Google’s mediapipe tracks almost two dozen vertices).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    There are still problems associated with the development of the spherical coverage of views in the “as-is” paper – but somewhat addressed by the author rebuttal. I’m somewhat on the fence as to how to proceed but changing to a ‘weak accept’



Review #2

  • Please describe the contribution of the paper
    1. A scalable synthetic multimodal image generation pipeline that can produce a variety of realistic hand-ultrasound probe grasp frames, independent of previous external data recording requirements;
    2. A novel camera concept based on a spherical perspective, combining egocentric head-hand distance with non-egocentric camera viewpoints;
    3. A pioneering multi-view 3D hand-object dataset tailored for obstetrics ultrasound hand-probe grasps, HUP-3D;
    4. Lowest hand and object 3D pose estimation errors for a synthetic dataset with a trained state-of-the-art model, HOPE-net.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Synthetic images offer built-in ground truth from 3D models, simulating realistic grasping scenarios with the benefits of easy scalability and generalizability to real images;
    2. Synthetic ground truth addresses mutual occlusions resulting from hand-tool interactions usefully;
    3. Integrate egocentric head-to-hand distances with non-egocentric camera perspectives to effectively enhance the generalizability of the dataset when creating training images for pose estimation;
    4. Utilize a generative model for machine learning-based grasp generation to reduce restrictions on specific hand-tool contact areas and orientations, thereby improving the realism of hand grasps;
    5. This synthetic dataset is the largest multi-view dataset for clinical applications, presenting three possible modalities, RGB-DS (color, depth and segmentation maps). It has an advantage in terms of the complexity and diversity of its images.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The generalizability of this dataset has not been fully validated, as it has only demonstrated its advantages on a single model and has not undergone extensive testing and comparison on other potential models.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The codes about how to use the proposed dataset for future research (such as preprocessing and formatting) and reproducing the results reported in the paper can be released along with the dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The unique values of the proposed dataset compared with the existing ones should be further discussed. Currently from Table 1 this point is not very clear, e.g., more frames for the previous datasets.
    2. Attempt to test the dataset on other models, conduct a comprehensive comparison and evaluation of the dataset to achieve more fair and objective results.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed dataset is valuable for the research community with its unique features.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    This work is meaningful and the author has addressed most of the comments from the reviewers. The unique values of the presented dataset with respect to the existing ones should be further highlighted in the final paper, as well as the generalization study, as promised by the author.



Review #3

  • Please describe the contribution of the paper

    This paper presents HUP-3D, a 3D multi-view multi-modal synthetic dataset designed for hand-ultrasound probe pose estimation in obstetric ultrasound. The dataset aims to support applications in mixed reality-based medical education by enabling tailored guidance and mentoring through programmatically understanding hand and probe movements. The key contributions of the paper include a scalable synthetic image generation pipeline, a novel sphere-based camera viewpoint concept for enhanced frame generalizability, and the creation of a diverse multimodal synthetic dataset for joint 3D hand and tool pose estimation. The research methodology involves grasp generation and rendering processes, utilizing generative models and Blender software for realistic grasp image generation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents HUP-3D, a 3D multi-view multi-modal synthetic dataset designed for hand-ultrasound probe pose estimation in obstetric ultrasound. The dataset aims to support applications in mixed reality-based medical education by enabling tailored guidance and mentoring through programmatically understanding hand and probe movements. The key contributions of the paper include a scalable synthetic image generation pipeline, a novel sphere-based camera viewpoint concept for enhanced frame generalizability, and the creation of a diverse multimodal synthetic dataset for joint 3D hand and tool pose estimation. The research methodology involves grasp generation and rendering processes, utilizing generative models and Blender software for realistic grasp image generation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Questions and suggestions for improvement:

    1. This is an interesting piece of work, but I didn’t fully grasp its significance. How does it contribute to clinical practice and what practical applications does it have in the clinical setting?
    2. What are the advantages of your data compared to Obman’s data? Also, what does the last column in Table 1 represent?
    3. You provided formulas 1 and 2 to calculate Nlatitude_floors and N (i) circles, but many parameters within these formulas are still unclear to me. Could you provide a more detailed explanation?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper presents HUP-3D, a 3D multi-view multi-modal synthetic dataset designed for hand-ultrasound probe pose estimation in obstetric ultrasound. The dataset aims to support applications in mixed reality-based medical education by enabling tailored guidance and mentoring through programmatically understanding hand and probe movements. The key contributions of the paper include a scalable synthetic image generation pipeline, a novel sphere-based camera viewpoint concept for enhanced frame generalizability, and the creation of a diverse multimodal synthetic dataset for joint 3D hand and tool pose estimation. The research methodology involves grasp generation and rendering processes, utilizing generative models and Blender software for realistic grasp image generation.

    Questions and suggestions for improvement:

    1. This is an interesting piece of work, but I didn’t fully grasp its significance. How does it contribute to clinical practice and what practical applications does it have in the clinical setting?
    2. What are the advantages of your data compared to Obman’s data? Also, what does the last column in Table 1 represent?
    3. You provided formulas 1 and 2 to calculate Nlatitude_floors and N (i) circles, but many parameters within these formulas are still unclear to me. Could you provide a more detailed explanation?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Questions and suggestions for improvement:

    1. This is an interesting piece of work, but I didn’t fully grasp its significance. How does it contribute to clinical practice and what practical applications does it have in the clinical setting?
    2. What are the advantages of your data compared to Obman’s data? Also, what does the last column in Table 1 represent?
    3. You provided formulas 1 and 2 to calculate Nlatitude_floors and N (i) circles, but many parameters within these formulas are still unclear to me. Could you provide a more detailed explanation?
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank the reviewers for their time and for their helpful and constructive feedback. R1, R3 and R4 highlight the advantages of synthetic data generation - inherent ground truth, scalability, and generalizability to real images. R1 and R3 acknowledge the quantity of synthetic frames generated. R3 notes our dataset accounts for mutual occlusions between hand and tool, and our generative model that enhances grasp realism. R4 acknowledges our novel sphere-based camera viewpoint generation that includes egocentric and non-egocentric perspectives, which was key to improving generalizability of our dataset. We address the reviewers’ concerns as follows: R1: We opted for Euler angles in our sphere-based camera view angle generation as the most straightforward method for camera orientation. The sphere’s division into latitude floors was chosen to control camera placement and minimize frame redundancy, rather than achieve perfect circle uniformity. The factor of 2 prevents overlap between circles, enhancing visual coverage. Uniform sphere sampling will be considered in subsequent enhancements of our model. R1: Our hand model is based on a 3D skeleton estimation formed of 21 joints that was standardised in Erol et al. Obman, HO3D datasets and MediaPipe (https://tinyurl.com/handlandmarker) use similar annotations to measure the 3D joint error. Our task is primarily focused on 3D localising the hand and the probe as in the previous mentioned datasets (in reference to the camera location); however, the angle deviation can still be computed from the 8 keypoints as the error is measured from the 3D distances between predicted and ground-truth. R3: Dataset Generalizability We have actually tested on 2 models. One based on ResNet-50, similar to DeepPrior and another one based on HopeNet. We will further clarify in the final version. HopeNet has already shown great generalizability on other hand-object datasets like FHDB and HO3D. R4: Clinical significance and practical applications We outline the potential of our work for medical education, particularly in obstetric ultrasound, by utilizing mixed reality technologies for training and skill assessment, which is directly relevant to clinical practice. Improved hand and probe tracking may support the development of standardized training protocols, reduce the learning curve and aid automated assessment of clinical performance. We will revise this section to clarify potential clinical applications. R4: Clarifications Table 1: The last column shows whether the dataset is clinical and lists the number of clinical tools used. Comparison to other datasets: ObMan is not clinical and features grasps on household objects. POV-Surgery and Hein et al focus solely on egocentric viewpoints. Our dataset includes realistic backgrounds, surgical gloves, and clinical tools. Spherical viewpoint sampling provides near-uniform distribution around the viewpoint. Compared to ObMan and Hein et al we provide improved depth images by combining hand and object in relation to each other. We will amend the text to clarify these aspects. R4: Explanation of formulas 1 and 2: The terms r_{circ} and r_{sph} represent the radii of the surface circles and the sphere, respectively, which are illustrated in Fig. 2. We defined each parameter and give value ranges in the supplementary material, Table 1. Regarding reproducibility, we have provided a link to the HUP-3D dataset. Our manuscript outlines the grasp generation and rendering pipeline in Section 2, complemented by Fig. 1 and 2 for clarity and builds on well-documented prior works [7, 8] with available code. We will release complete source code with a detailed installation guide upon paper acceptance, ensuring full reproducible. Our intention is to facilitate easy reproduction and adaptability of our method, fostering further research in this domain. We are grateful for your insights and guidance, which will improve our work on this paper and beyond.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a borderline paper, as clearly identified by the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This is a borderline paper, as clearly identified by the reviewers.



back to top