Abstract

Accurate segmentation of 3D tooth point clouds from intraoral scanner (IOS) data is crucial for orthodontic applications. While current methods show promise, their reliance on high-quality labeled datasets is limited due to costly annotation processes, which further constrain their practical generalizability. We address this challenge with STEAM, a self-supervised learning framework that learns comprehensive features from large-scale unlabeled tooth point clouds. Built upon the masked autoencoder, our framework incorporates two key innovations: Gradient-guided Adaptive Masking (GAM), which adaptively identifies and prioritizes challenging regions by analyzing local feature variations during the training process, and Multi-attribute Geometric Reconstruction (MGR), which reconstructs multiple geometric attributes including point distributions, normals, and curvatures to capture geometric features of different granularity. Through extensive experiments on public datasets, our approach demonstrates superior performance in downstream segmentation tasks with minimal labeled data, achieving significant improvements over existing methods. The results validate STEAM effectiveness in maximizing the utility of limited labeled data for practical dental applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3394_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuYif_STEAM_MICCAI2025,
        author = { Liu, Yifan and Yang, Chen and Yu, Weihao and Liu, Xinyu and Chen, Hui and Meng, Max Q.-H. and Yuan, Yixuan},
        title = { { STEAM: Self-supervised TEeth Analysis and Modeling for Point Cloud Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {545 -- 554}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a method for segmentation of individual teeth from 3D point clouds. The approach begins by pre-training a transformer architecture based on Masked Autoencoders (MAE) using unlabeled point clouds, followed by supervised fine-tuning on labeled data. The main contribution lies in adapting the MAE paradigm to tooth segmentation using two components - Gradient Adaptive Masking (GAM) and Multi-Attribute Geometric Representation (MGR). GAM focuses on selecting challenging regions in the point cloud for masking, while MGR enhances segmentation performance by recontructing fine-grained geometric attributes such as normals and curvatures, in addition to point distributions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper uses clear language and is easy to follow.
    2. The proposed GAM and MGR modules present interesting and thoughtful adaptations of Masked Autoencoders (MAE) for the specific task of teeth segmentation.
    3. The proposed method demonsrates significant improvements over the evaluated baselines.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper extensively discusses the challenges associated with supervised learning for teeth point cloud segmentation and emphasizes the potential benefits of leveraging unlabeled data. However, it does not acknowledge the STSNet [a] paper, which employs contrastive learning-based pretraining for teeth segmentation, in the Introduction or Related Work section. Even though STSNet has been included as a baseline in the experimental comparisons, there is no discussion regarding how the proposed method compares to STSNet in terms of strengths, limitations, or methodological distinctions.

    2. The Geo-Net [b] paper deserves attention, as it bears substantial similarity to the proposed approach. Geo-Net also builds upon masked autoencoders and introduces a specialized patching strategy to select informative patches for masking. However, the paper has not been cited by the authors. A comprehensive comparison with Geo-Net—both theoretically and empirically—is warranted to better assess the contribution of the proposed method.

    [a] Liu, Zuozhu, et al. “Hierarchical self-supervised learning for 3D tooth segmentation in intra-oral mesh scans.” IEEE Transactions on Medical Imaging 42.2 (2022): 467-480. [b] Liu, Y., et al. “Geo-Net: Geometry-Guided Pretraining for Tooth Point Cloud Segmentation.” Journal of Dental Research 103.13 (2024): 1358-1364.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. The role of the teacher network lacks clarity. Initially, the paper states in Sec 2.2 that the teacher network is frozen; however, a subsequent paragraph indicates that the teacher is updated in conjunction with the student network. If the authors intend to imply that the teacher network is updated using an Exponential Moving Average (EMA) of the student, then the update rate (i.e., the EMA decay factor) should be clearly specified.
    2. The values of the λ parameters in Equation (5) appear to be selected empirically. A more detailed ablation or sensitivity analysis exploring the impact of different λ values on the network’s final performance would significantly strengthen the paper.
    3. The learning rate schedule is described as “an initial learning rate of 5e−4 that decays to 5e−2.” This appears to be a typo, since the learning rate is typically decreases rather than increased.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main reason for my recommendation is the lack of proper comparison (both empirically and methodologically) with related work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My major concern was the lack of comparison with related work. Authors have addressed these comparisons in the rebuttal. I would encourage the authors to clearly state the methodological differences between the proposed and related works in the updated draft.



Review #2

  • Please describe the contribution of the paper

    This work seeks to develop a self-supervised learning framework for analyzing 3D point cloud data from dental scans based on a masked auto-encoder model. The key novelties are the use of a sampling approach which preferentially masks patches which are estimate to have a large reconstruction loss gradient (i.e., harder to reconstruct patches) and decoders which reconstruct both the 3D point cloud distribution as well as surface normals and curvatures which can be informative for the downstream application.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The approaches seem reasonable and well tailored to the specific application.

    There is perhaps the potential for some of the approaches (such as the gradient-guided masking) to be employed in more general applications of MAEs.

    Experimentally, the proposed technique appears to achieve strong performance relative to existing self-supervised learning models for 3D point sets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Some of the technical description of the method is a bit insufficient. For example, what is the metric for ranking the patches with the largest gradient? L2 norm of the gradient? Likewise, the dimensionality of the grouped patches (G) is presented a bit strange. G is a Mx3K matrix, but is shown as having M columns. What is M, and why is the second dimension 3K?

    2) What is the purpose of the student-teacher configuration for the gradient-guided masking? Namely, why is a teacher network needed to estimate patches with a large gradient vs simply doing a forward pass through the ‘student’ network? How the weights are related between the student and teacher network is also a bit unclear. It is mentioned that the weights are shared between the two networks but also mentions that the teacher network is frozen. Are the weights between the two networks set to be equal at every iteration?

    3) The application experiments appear to be limited to 3D dental scans. While this is understandably the focus of the paper, it would be interesting to see if some of the techniques like the gradient-guided masking could improve self-supervised methods for other point-cloud datasets.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the method appears to make some novel contributions and achieves reasonable experimental performance, but there is some need to improve a few aspects of how the method is presented.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper tackles the problem of point-cloud segmentation, in the context of data coming from intra oral scanners in orthodontic applications.

    Due to a lack of large-scale datasets, the authors turn to self-supervsion to pretrain their model. Because of the large number of points that don’t represent interesting regions and the detailled nature of teeth surfaces, the authors introduce two components: Gradient-guided Adaptive Masking, which selects patches for MAEs to be masked instead of randomly choosing, and **Multi-attribute Geometric Reconstruction, which in essence adds additional pre-training tasks.

    The authors report state-of-the-art results compared to previous work, and perform an ablation study on their proposed improvements.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is very straightforward and presents the two contributions in an easy-to-follow manner. The two proposed pre-training improvements are well motivated, well presented and both seem to be useful in their own way. The results, both qualitative and quantitative, are convincing.

    Wrt. the contributions, the adaptive masking seems the most promising with definite applications outside of the field.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper may be fairly obtuse to readers not acquainted with point cloud segmentation. The paper defers a lot of the contextualization to references and cannot be described as self-contained. The patch generation process, key to MAEs and therefore the paper, is not easily explained to a reader outside of the field.

    The presented work can also hardly be deemed reproducible, as the code and datasets do not appear to be public. The authors also refer to their architecture as a “standard” or “vanilla” transformer without any more details.

    Finally, the proposed method was only pretrained and trained/tested on one pair of datasets, and its applicability to larger or smaller, or of varying quality datasets as well as its generalizaing capabilities are unknown.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    What is the size of the architecture ? The teacher-student framework for GAM especially seems compute-heavy. On what hardware was the method trained ?

    For 3.2, it seems all the other methods were re-trained for this work. Is it correct ? Did you use publicly available implementations or did you reimplement them ?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As mentionned above, this is a very straightforward paper. There is nothing inherently flawed with it, although some concerns were raised above. The first contribution (GAM) is exciting in its novelty and applicability to other domains, whereas the second is more specific and really amounts to pretext tasks.

    This is a perfectly fine paper and deserves to be accepted.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The reviewers have done an OK job of answering comments. The revised version should clear some confusion present in the initial version. My opinion of the paper stands as a weak accept.




Author Feedback

Dear Reviewers and Area Chairs, We appreciate the reviewers’ valuable feedback and are encouraged by their recognition of our well-tailored approaches for 3D point set learning (R1). We are humbled by their acknowledgment that our paper is straightforward and easy to follow (R2, R3), with well-motivated and clearly presented contributions (R2, R3). The reviewers’ positive remarks on our gradient-guided adaptive masking (GAM) and its potential applications beyond our specific field (R1, R2) are particularly encouraging. We are pleased that our experimental results demonstrate strong performance compared to existing self-supervised learning models (R1, R3), with both qualitative and quantitative results being convincing (R2). Below, we address the raised concerns.

  1. Techinical details of patch masking (R1). We use the L2 norm to measure the magnitude of gradient tensors for ranking patches. Regarding the dimensionality, in the Mx3K matrix, M represents the number of patches, K denotes the number of points within each patch, and 3 corresponds to the point coordinates. Each patch is processed through PointNet to obtain patch tokens. To avoid confusion, we will include these detailed descriptions in the revised version.
  2. Model design of teacher network (R1, R3). The teacher network is a copy of student work, i.e., all the model paramters remain the same at each iteration. The choice of freezing weights of teacher network is to avoid any unnecessary gradient here as we only requires the teacher network to tell us which patches are hard to reconstruct.
  3. Preliminary information (R2). Sorry for the confusion. We will provide more preliminaries of masked autoencodering in Sec. 2.1 in the revised manuscript.
  4. More implementation details (R2). The standard transformer we use refers to ViT-S/16 implemented in the original ViT paper. The size of model is 120MB, and the model is trained using one RTX4090Ti. For comparisons, we use publicly available implementations.
  5. Discussion on recent methods (R3). Thanks for the suggestion. STS-Net firstly investigates the self-supervised learning for tooth point cloud analysis, which adopts the constrastive learning paradigm. We will include more discussions with STS-Net in the introduction and related work.
  6. Comparison with Geo-Net (R3). Geo-Net uses the curvature score (geometry metric) to select important/hard patches, while our method uses the network gradient (learning-based metric). We implement Geo-Net using the open-source code and evaluate it on the same benchmark. Results shown our method delivers better results with 2.35 IoU enhancement, indicating learning-based metric may be more robust than the geometry one).
  7. Sensitivity analysis and typos (R3). Thanks for the suggestion. We will provide more experiments of using different parameters and fix the typos of learning rate.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After reading the paper, review and rebuttal, I feel that the paper lacks citation and comparison with Geo-Net, which is a key baseline which need to be cited. The authors included Geo-Net results in the rebuttal, which violates the rebuttal policy prohibiting the introduction of new results. Such a violation result in a desk rejection.



back to top