Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Anatomical landmarks are used for clinical measurements, screening, and to guide treatment decisions. In this work, we explore the clinical application of landmark-based angle measurements, with a particular aim of screening infants for Developmental Dysplasia of the Hip.

Our automated machine method uses a simple UNet++ architecture. The network is used to predict landmark heatmaps, which represent landmark localisation certainty. A Monte Carlo-like approach is then used to approximate an angle distribution from landmark heatmaps. We propose a confidence metric from the derived angle distributions.

Multiple clinician annotations are combined and compared to the machine predictions. The machine-generated angle distribution is verified by confirming the correlation of the mean angle values and standard deviations per scan, between the multiple clinicians and the machine. The confidence scores correlate for the clinicians combined and the machine. The confidence of the machine strongly correlates with the sum of the confidence scores given by clinicians for each scan.

This work is the first to present a method for estimating the distribution of clinically relevant angles from predicted landmarks. Landmark-based angle confidence can establish robust methods and increase clinician trust in using automated or computer-aided methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3981_paper.pdf

SharedIt Link: https://rdcu.be/eG4C4

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_12

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CleAll_Confidence_MICCAI2025,
        author = { Clement, Allison AND Willoughby, James AND Voiculescu, Irina},
        title = { { Confidence in Angle Predictions for Clinical Decision Support } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {116 -- 124}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a UNet-based method to predict landmark heatmaps and estimate angle distributions for screening of hip dysplasia in infants. Predictions were validated to strongly correlate with clinical annotations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The authors studied a clinically relevant problem and present a method that performs similarly to expert annotations. 2) The authors gathered an extensive dataset to estimate inter-observer variability and derive a confidence measure.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1) My main point of concern relates to the way the heatmaps are created, which affects every aspect of the rest of this manuscript: Typically in landmark detection, a wider sigma parameter is chosen, especially when the landmarks can be placed anywhere on a line. You could have also considered some anisotropic heatmap distribution, i.e. a stronger fall-off orthogonal to the bone ridge. Related to that:
- Why is was no loss used during training that directly penalizes the landmark locations? And would it make sense to add a loss term for the Graf angle, so to train end-to-end?
- The Monte Carlo Estimation will directly reflect the choice of how big the Gaussian blobs are, so I’m not sure how meaningful this measure really is. As the authors point out themselves, if the heatmaps overlap, arbitrary angle measurements will render the confidence score useless. How about estimating the biggest non-overlapping iso-set of both heatmaps, and then using a subset of these to compute the range of the corresponding line measurement angles?
- Have you explored other strategies like drop-outs to estimate confidence scores?
2) Evaluation: For the expert and non-expert annotations: Did you use any weighting for creating CALD maps, like the confidence scores estimated by the annotators themselves? And couldn’t you just use the 10 angles directly to estimate mean, std.dev. etc.?
- Other papers in the field have reported AUC as a means to use an AI method as diagnostic test directly, It would be interesting to see how your method fares in this regard.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

“translation (x≤0.1 pixels, y≤0.1 pixels)” hardly counts as augmentation, this is too little.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Important aspects of the method’s validation are incomplete and need to be addressed.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

My comments, in particular on the heatmaps and sigma values, were not sufficiently addressed. I disagree that the downstream tasks is not relevant, given that “Clinical Decision Support” is even in the paper title.

Review #2

Please describe the contribution of the paper

This paper proposes a confidence metric for angle prediction based on automated landmarking for DDH screening. The work is not proposing a novel AI as most studies but rather presents an interesting new metric for better scientific evaluation and practical usage of AI-based landmarking.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This study is an important step towards increased transparency on AI/ML based landmark predictions by providing a confidence metric to clinicians in order to make better informed decisions. The metric will allow clinicians to use their judgement in trusting machine-derived conclusions instead of blindly following AI provided outcomes.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Typo - figure 2 caption.

The paper is missing a discussion on the practical implications and usage of the said metric. For example how will the confidence metric impact the screening of DDH patients. Please provide some evidence on the effect of providing this additional information on patient care.

Some of the terminology used in this paper is unclear and difficult to follow.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Explain the terms and calculations in a different, more digestible, way
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Accept if authors can re-word to make the paper accessible to most, and explain clearly the impact of their angle prediction confidence metric.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors have brought a unique technical perspective in their work and with justified clinical application, I am sure this work can be useful to many in the field.

Review #3

Please describe the contribution of the paper

This paper proposes a method for obtaining valid confidence measures around angle estimates derived from automatically extracted landmark points in ultrasound images. While the evaluation of this method focused on the application of ultrasound-based screening for developmental dysplasia of the hip, measuring angles from a set of point landmarks on an image is a frequent occurrence in many screening, diagnostic and follow-up procedures and the implications of this paper could be far-reaching.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents an extremely useful method, and the method was also very-well validated. Validation relied on landmark labels from a relatively large sample of experts as well as their own confidence assessments on these. The results show good correlation between the confidence estimates produced by the automated method and the various confidence measures (inter-rater reliability, but also the confidence assessments provided by individual raters) gather in the study over a large number of ultrasound images. This is also an extremely well-written, well-organized and well-presented paper.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

One weakness of this paper (though I would certainly not call it major) is that intra-rater variability is not properly addressed. It is alluded to towards the bottom of page 3, and modelled as a Gaussian spread with sigma = 1 (pixel? millimetre? ??? why 1?) around the rater’s label, rather than estimated from repeat measurements. I do not think that this weakness significantly alters the conclusions of the study, but the limitation should perhaps be mentioned in the conclusions.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

What are the units of sigma=1 when using Gaussian distributions to generate heat maps? Pixels? What justifies sigma = 1?

Fig.2 : I believe left and right are mixed up in the caption

“Outliers for the machine confidence in this plot are the same as the outliers identified in Fig. 6”. Did you mean Fig. 5 (left)?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(6) Strong Accept — must be accepted due to excellence
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a truly excellent paper with important practical implications on several existing measurement processes. The scientific rigour is exemplary and goes beyond what I normally expect in a conference paper.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

R1-7: We did not find datasets with intra-rater data for this task. Existing work reports classification metrics, which are not currently comparable [1]. Intra-rater datasets are hard to collect. We focused on inter-rater variability to avoid confusion.

We use scikit-img for Gaussian filtering (units, std). Sigma S=1 results in a 9-pix kernel. Landmarks were placed with 14-pix diameter, motivating S=1. The literature reports 1<=S<=7 for landmark detection. This paper shows minimum radial error when S=3, with only small changes [2]; we kept S=1 to be consistent with CALD.

Our training data was labelled by 1 rater, common in most datasets. The model could be trained on combined heatmaps, but we show that only 1 rater is needed. Requiring many raters would decrease the generality of this method.

R1-10: See R1-7. Typos fixed.

R2-7: While sensible, we lack evidence in the ground truth to support that a landmark can be “placed anywhere on the line”. Across the 10 reviewers, we found that the distribution was often more bimodal than unimodal when they disagreed.

The Gaussian filter was used to mimic the annotations and represent rater confidence (reducing outwards from the centre). Creating a non-uniform spread from the annotations during training would introduce further assumptions which we do not have the evidence to defend.

See R1-7 heatmap/S choice.

Works have explored class-losses but found minimal impact [1]. Our work focuses on confidence of landmark-based angle measures rather than classification. A class-loss creates a task-specific model and does not allow for generalisation. There is no evidence that the NLL-loss would not learn this relationship between landmarks over iterations. Finally, we know anatomical landmarks have spatial relationships and so we use cross-channel attention to help retain and more quickly learn this association.

The outliers are shown in this work to highlight the ambiguity in this task. Landmarks along the ilium, are very inconsistent in location as their relative position is more important. It is possible to use the `iso-set’ of heatmaps for angle ranges, however, this would be task specific.

We have not used MCdropout for uncertainty in our work to date.

Self-reported confidence was deemed too imprecise as a weighting mechanism. We found that these values were a) poorly calibrated due to self-reporting/lack of reference, b) more strongly correlated with seniority than quality, c) were indicative of diagnostic confidence rather than label accuracy (rater may be confident in placement but feel diagnosis is unclear).

Using the 10 angles to estimate the mean would only allow us to place the machine in contrast to these 10 raters. The intention behind any confidence metric is to compare the prediction to the underlying population of values from which these 10 are only a sample. In order to give a confidence value we must estimate the population of clinical angles as we try to do here.

AUC here is i) classification-oriented (not our focus) and ii) reliant on a landmark-derived confidence metric to produce an ROC curve (such as the method we aim to validate). By proposing/validating a confidence metric in this work, we can produce an AUC for landmark-derived classification in future but it will only be meaningful with this validation.

R2-9: We can list additional information if specified.

R2-10: Limited translation in augmentation is noted. The 0.1 value refers to 10% on the x/y-axis.

R3-7: Fixed.

This work provides a foundation for universal DDH screening. In many countries screening is limited by time/budget. Having a confidence score on angle estimates allows automated systems to send only a small sample of cases for review. This allows for increased access to screening while reducing cost.

R3-7/10/12: We polished the technical/clinical wording further. We are unsure where to clarify and require guidance to address this.

[1] Ref 3 [2] https://ora.ox.ac.uk/objects/uuid:bcce0a2f-b

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Reject. While the clinical motivation is sound, the study lacks sufficient novelty, methodological depth to warrant publication.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

There is strong support of the clinical importance of this work by two reviewers, which outweigh some of the (meaningful) concerns by the third reviewer. I am slightly favoring acceptance of this work.

back to top

Confidence in Angle Predictions for Clinical Decision Support

Author(s):