Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Radiographic knee alignment (KA) measurement is important for predicting joint health and surgical outcomes after total knee replacement. Traditional methods for KA measurements are manual, time-consuming and require long-leg radiographs. This study proposes a deep learning-based method to measure KA in anteroposterior knee radiographs via automatically localized knee anatomical landmarks. Our method builds on hourglass networks and incorporates an attention gate structure to enhance robustness and focus on key anatomical features. To our knowledge, this is the first deep learning-based method to localize over 100 knee anatomical landmarks to fully outline the knee shape while integrating KA measurements on both pre-operative and post-operative images. It provides highly accurate and reliable anatomical varus/valgus KA measurements using the anatomical tibiofemoral angle, achieving mean absolute differences ~1° when compared to clinical ground truth measurements. Agreement between automated and clinical measurements was excellent pre-operatively (intra-class correlation coefficient (ICC) = 0.97) and good post-operatively (ICC = 0.86). Our findings demonstrate that KA assessment can be automated with high accuracy, creating opportunities for digitally enhanced clinical workflows.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4700_paper.pdf

SharedIt Link: https://rdcu.be/eHaVu

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_12

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuZhi_Deep_MICCAI2025,
        author = { Hu, Zhisen AND Cullen, Dominic AND Thompson, Peter AND Johnson, David AND Bian, Chang AND Tiulpin, Aleksei AND Cootes, Timothy AND Lindner, Claudia},
        title = { { Deep Learning-based Alignment Measurement in Knee Radiographs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {121 -- 130}
}

Reviews

Review #1

Please describe the contribution of the paper

This study introduces a deep learning-based method for automated knee alignment (KA) measurement from anteroposterior knee radiographs. The method localizes over 100 knee anatomical landmarks to outline the knee shape and integrates KA measurements for both pre-operative and postoperative images. It achieves high accuracy in clinical measurements.

The main contribution is mainly just the clinical application. The deep learning model is largely based on the computer vision model developed in [8].
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength lies on the clinical application and validation. The evaluations were carried out using three different metrics. The description of workflow is detailed and easy to follow. It enable readers to replicate their research.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

However, my primary concern is the lack of novelty. The deep learning model used in this paper is similar to the model in [8]. Both this paper and [8] use UNet-like architecture and hourglass-like encoder-decoder. The only difference is that they added attention gates within the model. However, there is no ablation study to demonstrate the effectiveness of these attention gates.

The other issues are: Why are two sets of manual annotations used? One is referred to as “manual measurement,” which, according to the description, is a subset of the manually annotated landmark positions in pre-operative and post-operative images from 376 test patients, used as the manual ground truth. The other is “clinical measurements,” which are obtained using commercial software. The question is: what is the difference between the manual measurements and the clinical measurements? This distinction is unclear and needs clarification.

The definition of point-to-curve (rP2C) distances is missing.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper lacks an ablation study, which is necessary to demonstrate whether the addition of attention gates actually improves the model’s performance. The main weakness of this paper is its lack of novelty. Additionally, it does not include any comparisons with existing state-of-the-art deep learning models.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper presents a framework towards automating the knee assessments by measuring knee alignments from AP knee radiographs from Total Knee Replacement patients. This is achieved by training an attention-enhanced hourglass model to detect several landmarks in the knee radiograph and computing the anatomical tibiofemoral angle (aTFA). Model is trained on 566 pre- and 457 one-year post-operative radiographs, and evaluated on an testing set of 376 test patients with pre- and post-operative images. 134 landmarks were annotated on pre-op and 181 landmarks were annotated on post-op images. Quantitative experiments are reported on landmark estimation accuracy. Furthermore, a comparative analysis between the clinical ground truth (obtained from clinical assessment from orthopedic surgeon) and algorithmically estimated measurements is done on a subset of testing dataset; Bland Altman analysis is done to assess the agreement, in addition to reporting correlation analysis and mean absolute deviation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The presented approach towards automating AP knee radiograph assessment would likely be valuable to the medical community given the prevalence of Total Knee Replacements.
- Quantitative comparison with established methods such as Bland Altman analysis, with the clinical ground truth (obtained by averaging multiple assessments) is sound and well explained. Performance analysis is conducted not only for the final measurements but also on the intermediate step of landmark estimation.
- Strong quantitative results, e.g. MAD of 1-degree w.r.t. clinical ground truth)
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Several of the limitations with evaluation have been acknowledged by the authors under the discussion section, for instance, the lack of experiments to studying the generalization of the approach on an independent dataset (i.e., another hospital site, with potentially small variations in the acquisition workflow or exposure settings). The size of the dataset for clinical evaluation seems quite low, given the prevalence of the condition and hence, likely variations. If possible, authors should consider reporting demographic distribution of the patients to get a better sense of potential bias (induced due to the small size of the testing dataset).
- The need for 134+ landmarks for knee radiographs is unclear. For the angular measurement calculation only a subset of landmarks are utilized so the entire set of landmarks appears to be used a regularization to help detect the key landmarks with higher accuracy by serving as context. It would be helpful if authors can consider adding an analysis on which/how many landmarks may be most suitable and if the annotation can be standardized, so the algorithm can potentially be independently evaluated, not only on aTFA but also the accuracy of key landmarks.
- This appears to be primarily an application paper focused on knee radiographs with minor technical novelty; while the results seem valuable and relevant but it’s unclear if it would serve the broader interest of the MICCAI community due to the specifics of the proposed method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper seems focused on a specific application of AP knee assessment, and while the results are very promising, they are reported on a relatively small dataset where the potential of bias is unclear. If authors can rebut these concerns, especially with the potential bias in evaluation, I would be inclined to recommend the paper for acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This study addresses Knee Alignment (KA) measurements on radiographs, which are essential for successful treatment in total knee replacement interventions. While traditional KA measurement methods are manual (hence time-consuming) and require long-leg radiographs, automated methods can be applied onto standard anteroposterior knee radiographs (which are more commonly used) with reduced costs and improved efficiency. This study proposes a deep learning-based approach to automatically localize knee anatomical landmarks, which are fundamental to compute KA measurements automatically in a following step.

The proposed method uses a (quite standard) hourglass network architecture combined with an attention gate structure to better focus on joint shapes in knee radiographs. This is the first deep learning-based study to localize over 100 anatomical landmarks in knee radiographs to facilitate KA measurements.

The anatomical tibiofemoral angles (aTFA) estimated by their approach are accurate and agree with manual measures.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents a precise idea, where a DL-based landamark localizer boosts a clinical routine. Tests performed on data collected by a hospital and comparison with both manual and automatic generated results demonstrate the clinical feasibility and interest of the proposal. The paper is well written and organized.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper lacks in the presentation of the state-of-the-art, which is important to appreciate the novelty of the proposal. Only papers [6, 7] are mentioned and the corresponding techniques are not presented with sufficient details. Paper [13] is the main reference. It has been attached as supplementary material as it has not been yet published and it belongs to the same authors. However, what [13] precisely implements is not described in the text. The authors cite [13] in the “contributions” paragraph but do not mention it in the previous overview of the state-of-the-art.

The metrics assessing the proposed method are not higher than the baseline clinical ones, in some cases. A deeper understanding of the different performance is not discussed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- the evaluation procedure is roughly explained and many details are given in the captions. How should we interpret rP2P and rP2C metrics? What do the different metrics in Table 2 refer to, precisely?
- In table 2, why do the metrics differ so much between pre- and post-operative cases? In pre-op, the proposed study does not significantly outperform [13]. Is it due to the larger angle, affecting absolute differences? In post-op cases does the better KA make the aTFA estimation more challenging? If it is the case, the achieved improvement is even more remarkable.
- in Table 2, the Agreement between C and M could be interesting, as they are both considered ground truth data.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the major strengths are more relevant than the weaknesses, and the proposed scheme can be of inspiration for other studies involving automatic landmark localization.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed my points

Author Feedback

We thank the reviewers for their constructive comments. Note: Reference [13] is now published. Novelty (R1 & R2): This is an application-focused study aimed at automating knee alignment measurements for clinical practice. Our goal is to demonstrate a robust method for potential clinical significance. Ablation study for attention gates (R1): Attention gates slightly improved landmark localization accuracy (~0.5–2%). However, as this is an application-focused study on knee alignment measurement, not a technical innovation paper, we excluded ablation results. Comparison with SOTA deep learning models (R1): The focus of this work is on the application of landmark detection to address a clinical need (i.e. automated alignment measurement). The proposed system is based on KNEEL [8], an hourglass network with greatly improved landmark localization accuracy. However, KNEEL did not outperform RFRV-CLM [6,7] in high-precision tasks. Thus, we compare our alignment results to RFRV-CLM [13] which is SOTA. Given the page limitations, we consider a comparison to alternative deep learning methods beyond scope. Two sets of ground truth (R1): Manual measurements are point-based, derived from manually annotated landmarks, while clinical measurements are directly taken by an orthopedic surgeon (see Section 2.3). Including both sets of measurements allows us to assess the impact of automation as well as the agreement with clinically established alignment measurements. The latter expands the scope beyond computer vision, enabling us to assess clinical utility and thereby better aligning the work with the objectives of the ‘CAI’ domain. Interpretation of rP2P and rP2C (R1 & R3): Curves represent anatomical boundaries (femur, tibia, fibula). P2P is the Euclidean distance between a predicted and manual ground-truth point; P2C is the distance between a predicted point to the bone boundary based on the ground truth points – averaged over all points per image. As we do not have pixel size information, we report relative distances (rP2P, rP2C) normalized by the tibial shaft width for clinical interpretability. We will clarify this. Dataset demographics (R2): In the test set (n=376), 58 (15.4%) had unknown gender/ethnicity, and 2 (0.5%) had unknown age. Known cases had: mean age 69.2 ± 8.7 years; 42.1% male; 88.4% White. The clinical subset (n=50) showed similar demographics: mean age 70.3 ± 8.2; 48.8% male; 97.6% White; 9 (18%) unknown gender/ethnicity. As TKR primarily affects older adults, this age distribution reflects real-world demographics. Data has been collected without preselection and represents the patient population at the collaborating hospital. We will add these details. Why 134+ points? (R2): While a subset is used for aTFA, we included 134+ points to support future extensions (e.g., other angles, joint space width, joint shape analysis). Using 134+ points also allows a direct comparison with [13], which used the same point set. Clarifying [13] as SOTA (R3): RFRV-CLM [6,7] and Hourglass networks [8] are the SOTA for knee landmark localization. [13] applied RFRV-CLM to detect landmarks and then used several points to measure aTFA. We appreciate the reviewer’s note and will clarify how [6,7] relate to [13]. Metrics in Table 2 (R3): The Intraclass Correlation Coefficient (ICC) assesses agreement between sets of measurements. The Mean Absolute Difference (MAD) quantifies average error between sets of measurements. Bland-Altman analysis (BAA) allows assessment of both bias and agreement. Differences in pre- and post-op performance (R3): Our model shows strong agreement with clinical measurements pre-operatively, validating both its reliability and the point-based aTFA definition. Post-operative results were less consistent, likely due to anatomical changes from TKR not captured well by our point-based definitions. However, our method outperformed [13] in post-operative cases. Future work will focus on improving post-operative performance.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The approach is likely to demonstrate clear benefits in clinical application; however, it appears to lack sufficient technical advancement. It may be more suitable for submission to a clinical journal.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Reviewers mention that one of the shortcomings of this work is the lack of a comparison with state-of-the-art landmark localization methods. These are discussed in the related work, however, the decision to use the chosen method is not well justified and is not backed up by evidence in the form of baseline comparisons, except with work from the same authors. The rebuttal comment is not justifying this shortcoming adequately. Whereas the clinical aspect is in the center here, it is nevertheless legitimate to ask why using one of many recent landmark localization methods would not perform in a better way. Since the methodological contribution is limited and the evaluation is not comprehensive in terms of baselines, I tend to reject this paper.

back to top

Deep Learning-based Alignment Measurement in Knee Radiographs

Author(s):