Abstract

The detection of cephalometric landmarks is crucial for orthodontic diagnosis. Current methods mainly focus on utilizing contextual information to detect landmarks while overlooking the challenges posed by domain gaps. In this paper, we propose a contour-guided framework that leverages cranial soft/hard tissue contours as domain-invariant anatomical priors. The method introduces a joint attention module to fuse the topological features corresponding to the contours with contextual features, ensuring the accuracy of landmark positioning. Additionally, we address anisotropic prediction uncertainty in unseen domains through a direction-aware regression module, which incorporates contour geometry to regularize error distributions. Evaluated on the multi-domain datasets with five source and three unseen target domains, our framework demonstrates superior robustness to domain shifts while maintaining anatomical plausibility, achieving state-of-the-art cross-domain localization accuracy.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5412_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiaXin_Contour_MICCAI2025,
        author = { Liang, Xinyue and Chen, Runnan and Wei, Guangshun and Zhuang, Shaojie and Zhou, Yuanfeng},
        title = { { Contour Makes It Stronger: Cross-Domain Cephalometric Landmark Detection Based on Contour Priors } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper propose the contour-aware framework that leverages the tissue contours as domain variant anatomical priors by introducing a joint attention module that generates contour-landmetk joint features to module global consistent hierarchical structure, and a direction-sensitve regression module to address anisotopic prediction uncertainly in unseen domains, incorporating contuor geometry to constrain prediction deviations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The proposed method demonstrates clinical feasibility for cephalometric landmark detection. (2) By addressing a critical unmet need in clinical practice, the proposed framework effectively advances the field of cephalometric landmark detection.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (1) The specific definition of the domain gap addressed by the proposed method (e.g., image source or image modality) needs clearer articulation to avoid confusion. For instance, the distinction between gaps arising from different imaging devices should be explicitly stated. (2) The authors should elaborate on the contour generation process, a core component of the method. Key questions include:

    • How many contours are generated?
    • What criteria are used to group landmarks into contours?
    • Are contours consistent across datasets with varying numbers of ground-truth landmarks? (3) The authors claim that structural features are extracted via Vision Transformer (ViT) and contextual features via CNN. However, both architectures can capture structural and contextual information. Clarification is needed on how these features differ in the proposed framework and why combining them benefits cephalometric landmark detection. (4) The rationale for “alternatively selecting” queries in the joint attention module requires elaboration. Specific criteria or rules governing this selection process should be defined. (5) The total dataset size is reported as 1,100 images (ISBI2023 + ISBI2015), but the training/test split sums to 1,232 images. This discrepancy needs resolution for reproducibility. (6) Details on image preprocessing (e.g., resizing, normalization) should be provided to ensure reproducibility, given the variability in input resolutions. (7) With ~1,200 images, the authors should clarify whether K-fold cross-validation or data augmentation was employed to validate robustness. This information is critical to assess generalizability.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the study introduces novel and clinically relevant ideas with clear potential, critical methodological ambiguities—such as insufficient detail on contour generation criteria, domain gap definitions, and query selection rules—undermine its reproducibility and scientific rigor. Additionally, missing preprocessing steps for varied resolutions, unaddressed validation strategies (e.g., K-fold cross-validation), and unresolved dataset inconsistencies (1,100 vs. 1,232 images) raise concerns about experimental validity and result robustness. These issues collectively weaken the work’s ability to substantiate its clinical feasibility claims, despite its promising conceptual foundation. Addressing these gaps is essential to validate the methodology’s rigor and practical impact.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors adequately addressed all concerns in their rebuttal.



Review #2

  • Please describe the contribution of the paper
    1. Proposed the Contour-aware Joint Learning (CJL) framework, which innovatively leverages cranial soft/hard tissue contours as domain-invariant anatomical priors to enhance robustness and accuracy in cross-domain cephalometric landmark detection. Designed the Joint Attention Module (JAM) to fuse contour and landmark features, creating globally consistent hierarchical structures and improving the ability to model complex anatomical relationships.
    2. Introduced the Direction-sensitive Regression Module (DRM), which utilizes contour geometry to address directional uncertainty in landmark prediction, further enhancing robustness and precision.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. By leveraging contours as domain-invariant geometric priors and combining them with landmark features, the method effectively addresses the reliance on image appearance in traditional approaches, showing strong cross-domain robustness. The direction-sensitive regression module (DRM) reduces anisotropic prediction errors in landmark detection, making the approach particularly suitable for anatomy-related tasks while improving robustness and interpretability.
    2. CJL outperforms all compared methods in the target domain across key metrics, especially excelling in high-precision tasks, such as achieving a 55.25% SDR within the 1mm range. The framework exhibits minimal performance degradation in cross-domain tasks, with only a 0.34mm increase in MRE, demonstrating superior robustness and generalization compared to other methods.
    3. Each component (CAM, JAM, DRM) is well-defined and complementary, with ablation studies verifying their contributions to the overall performance. The JAM module introduces an effective hierarchical consistency mechanism that enhances feature integration, with potential applicability to other anatomical tasks.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The framework employs common architectures, such as cascaded CNNs and multi-scale feature extraction, which, while effective, lack groundbreaking innovations and primarily build upon existing techniques. Although the design is task-specific and well-suited for cephalometric landmark detection, the approach may not generalize as effectively to tasks where contours are not strongly correlated with the target features.

    2. The reliance on high-quality contour extraction could impact the model’s performance in cases where the input image quality is compromised (e.g., noisy, low-resolution, or poorly defined contours). For scenarios where landmarks are far from contours or the contours themselves are ambiguous, the DRM module may struggle to optimize error distributions effectively.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    please refer to strength and weakness

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The method is novel and practical, with strong cross-domain results and clear contributions. The authors addressed major concerns in the rebuttal, making the paper suitable for acceptance.



Review #3

  • Please describe the contribution of the paper

    Author’s propose a new architecture and loss function for anatomical landmark detection. The Author’s aim to improve cross domain performance by optimising the landmarks to be nearer the contours. To do this, they propose a ViT/CNN hybrid architecture that directly applies attention to contours with a novel regression loss. This is compared against 5 prior works that I believe are reimplemented in Table 1. They also qualitatively evaluate the work against 4 previous works in Fig 3 and ablate the effect of the proposed components in Table 2.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Best results on target domain test set and improved precision on source domain set (Table 1). The direction regression module is an interesting and innovative step for landmark detecion that handles both isotropic and anisotropic cases well. As prior works have different source / target domains and do not evaluate on 1mm. I assume all works have been reimplemented to evaluate their performance. ViT/CNN hybrid architectures are incredibly popular at the moment and this work does well to apply the contour attention at intermediary VIT layers. Further, accuracy on 1mm is impressive in Table 1. Table 2 ablates the components well and shows clear improvements mainly within the 1mm boundary (2-3mm accuracies are very similar on source/target results).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Several omissions, likely due to space:

    • Sufficient detail is missing from Sec 2.3 that would allow for reimplementation. - “From the contour heatmaps Mj generated by the contour-aware module” M_j is defined as the ground truth (Eq. 3) and not prediction from ViT. - “We then project the offset between the predicted landmark p and its ground truth p∗ onto the directional vectors” This implies coordinate space; is this done in heatmap-space by calculating $|y_i-\hat{y}_i|$ (for landmark i with prediction y and ground truth \hat{y}) between predicted and ground truth heatmaps and then determining $\delta t$ and $\delta n$ from this?
    • Is L_reg the only loss function involved? The heatmap in Fig 2 suggests otherwise. They seem like large gaussian maps learned through bce or mse?
    • What is the data preprocessing pipeline including data augmentation and resolution? - high resolution used in “Hyatt-Net is Grand” indicates performance correlates with resolution due to quantisation of landmarks.

    “Hyatt-Net is Grand” - Zhou, X., Huang, Z., Zhu, H., Yao, Q. and Zhou, S.K., 2024. Hybrid Attention Network: An efficient approach for anatomy-free landmark detection. arXiv preprint arXiv:2412.06499.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It is noted Eq 4 or 5 is used, is this with rate 50%? What was the motivation for the random selection? Table 1 should cite the models rather than label the past work by year (or both). Fig 1b, shows the contour following the nose, however they highlight no landmark on the nose in that figure. Has the additional landmarks from CEPHA29 been used in contour generation? I am under the impression the spline based contour generation determines a line of best-fit through the landmarks for that contour and therefore should omit the nose if there is no landmark with the nose.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Solid proposal that I feel is important to the landmark detection community. Would give a higher score but I feel some of the wording and clarity issues can be cleared up in the paragraphs before Eq 7. Plus some indication of the an alternative loss or whether L_reg is the only optimisation involved.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Following the rebuttal, this reviewer has decided to keep their acceptance recommendation. This reviewer mainly had issues with minor clarity points and omissions, likely due to space availability. The core issues have been indicated that they will be fixed in the final manuscript. Despite this, this reviewer is unsure how well the clarity will be improved in Sec 2.3 and therefore recommend also providing code or taking the extra time to ensure this can be as clear as possible.

    Further, this reviewer and reviewer 1, brought up similar points on how the availability of ground truth landmarks between datasets affect the localisation of the contours. ie in this reviewer’s example the nose-landmark is not available in the ISBI2015 dataset, yet the figures show the contour following the nose. Therefore, it would’ve been interesting to see the flipped source / target domain performances & per domain performances.

    Additionally, the 7-fold cross validation performance and additional loss functions involved should be discussed in the final version. This is indicated in the rebuttal.

    This reviewer mentioned additional loss functions involved, and it was indicated these would be updated in the final manuscript, any preprocessing on this ground truth should also be mentioned.




Author Feedback

We thank all reviewers (R1, R2, R3) for their constructive feedback. We are glad all reviewers appreciated our work as a valuable contribution to the landmark detection community. We address all concerns below. Q1(R1) Criteria for domain gap: The domain gap in radiographic cephalometry arises within the same imaging modality (i.e., cephalometric X-rays), which is caused by variations in imaging device manufacturers, acquisition parameters, and post-processing pipelines. These differences lead to variations in image appearance (e.g., grayscale histograms, contrast, and noise), as shown in Fig.1a. Q2(R1) Contour generation: Our method involves eight contours (e.g., maxillary bone outline, mandibular bone outline). These contours are well-established anatomically meaningful structures in cephalometric literature[1] and have been validated by clinical experts. The contour structures and landmark-contour mapping rules (e.g., Points Pogonion, Meton, Gonion on the mandibular bone outline) remain consistent and valid even with an increasing number of landmarks. We will include the relevant definitions in the revised version. Q3(R1, R2) Architecture design and clarify: ViT is used to extract contours due to its strength in modeling long-range dependencies, while CNN is used to capture contextual information because of its local receptive fields. In the Joint Attention Module (JAM), “alternatively selecting” refers to iteratively applying Eq.4 and Eq.5 for N rounds, not randomly choosing operation once. We will rephrase corresponding part to improve clarity. Q4(R1, R2) Experiment Details:

  • Dataset Split: Our dataset contains 1100 images across 8 domains. Among them, 3 domains are entirely unseen during training and used only for testing (127 images), the remaining 5 domains include 427 for training, 273 for validation, 273 for testing. We performed 7-fold cross-validation, and the averaged results are reported in Table 1, outperforming prior SOTA methods. We will correct the dataset-related typo. -Preprocessing: To address variations in resolution, our preprocessing pipeline includes resizing, padding, and random color jitter. All images are finally resized to 1024×1024. Q5(R2) Loss function: The total loss includes: the mean logistic loss L_{h} for supervising landmark heatmaps, the MSE loss L_{mse} for supervising contour heatmaps, and anisotropic regression loss L_{reg}. The final loss: L = L_{reg}+λ1L_h+λ2L_{mse}, where λ1 =2.0, λ2=5.0. We will elaborate it in the revised version. Q6(R2) Nasal Contour: The nose region is part of the upper soft tissue contour. The landmarks Sn and UL’ lie on this contour, and removing it would result in degradation in MRE. Q7(R2) Other Clarification: Thanks for pointing this out. Your understanding of Δt and Δn is correct. The contour heatmaps in Eq.6 refer to ViT’s predictions rather than the ground truth. We will fix this typo in the revised version. Q8(R3) Innovation: Previous methods overlook the importance of contextual and structural information, which limits their performance in cross-domain scenarios. In contrast, we propose a novel parallel extraction and fusion architecture to utilize both types of features, overcoming the weaknesses of prior approaches. Q9(R3) Contour robustness: Our dataset includes images from various domains with differing quality levels. The MRE results remain stable across domains, regardless of image quality, as shown in Table 1. Furthermore, the mIOU on low-quality samples (e.g., noisy, low-resolution) shows only a slight drop, demonstrating the robustness of our contour generation module. Q10(R3) Generalization: Our approach leverages stable contour structure to guide landmark localization. This idea is transferable to other non-medical tasks, such as skeleton-based pose estimation, facial landmark detection, and vehicle part localization. [1] Jacobson, A. (2006). Radiographic Cephalometry: From Basics to 3-D Imaging (2nd ed.). Chicago: Quintessence Publishing.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Authors clarified main weaknesses in the rebuttal. There is strong agreement of reviewers regarding acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top