Abstract

We present a novel method for explainable vertebral fracture assessment (XVFA) in low-dose radiographs using deep neural networks, incorporating vertebra detection and keypoint localization with uncertainty estimates. We incorporate Genant’s semi-quantitative criteria as a differentiable rule-based means of classifying both vertebra fracture grade and morphology. Unlike previous work, XVFA provides explainable classifications relatable to current clinical methodology, as well as uncertainty estimations, while at the same time surpassing state-of-the art methods with a vertebra-level sensitivity of 93% and end-to-end AUC of 97% in a challenging setting. Moreover, we compare intra-reader agreement with model uncertainty estimates, with model reliability on par with human annotators.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1871_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1871_supp.pdf

Link to the Code Repository

https://github.com/waahlstrand/xvfa

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wåh_Explainable_MICCAI2024,
        author = { Wåhlstrand Skärström, Victor and Johansson, Lisa and Alvén, Jennifer and Lorentzon, Mattias and Häggström, Ida},
        title = { { Explainable vertebral fracture analysis with uncertainty estimation using differentiable rule-based classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a two-stage, interpretable approach for vertebral fracture analysis in low-dose, high-noise radiographs. Inspired by the clinical Genant Semi-Quantitative (GSQ) scoring system, it leverages existing vertebrae detection and keypoint estimation methods, followed by a combination of fuzzy (based on key points) and neural network (based on visual features) classifiers for fracture prediction. The method was evaluated on a private dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (i) Clinically-motivated interpretability: The paper addresses a challenging problem with a clinically-inspired approach (GSQ) for interpretable fracture analysis on projection images. This focus on explainability is valuable for clinical translation within the MICCAI community.

    (ii) Transparent architecture: The method mimics the clinical workflow, using modular components like keypoint estimation and fuzzy classifiers for interpretability. This design offers transparency into decision-making.

    (iii) Evaluation strategy: The use of AUC, F1, sensitivity, and specificity with Youden’s J statistic is appropriate for imbalanced datasets. Five-fold cross-validation strengthens the evaluation, though missing statistical significance is a limitation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (i) Limited Technical Novelty: The paper claims methodological innovation but primarily relies on existing vertebrae detection and key point estimation techniques. The core contribution seems to be the fuzzy logic classifier integration.

    (ii) Questionable interpretability benefit: If key points provide sufficient information for GSQ scoring, why is the fuzzy classifier needed? Investigate if a simpler approach using key points directly could achieve similar interpretability.

    (iii) Missing validation: The paper lacks some important information: (a) is the fuzzy classifier statistically significant in improving performance? (b) are interpretable GSQ scores generated and evaluated for clinical relevance? (c) how does the interpretability impact clinical workflow compared to existing methods?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    (i) While the authors plan to release the source code and trained models upon acceptance, the dataset will not be publicly available. However, the paper provides clear descriptions of data statistics in both the main manuscript and supplementary material.

    (ii) The paper details the training procedures and hyperparameters, with the latter chosen through a grid search strategy, which is a positive aspect.

    (iii) The experiments incorporate cross-validation, but reporting statistical significance of the results would further strengthen the evaluation.

    (iv) Uncertainty quantification experiments lack details on human annotations, including the number of annotators and their qualifications.

    (v) The paper lacks specifics concerning the computational infrastructure (hardware and software) used for all the experiments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (i) Improve clarity and readability throughout the paper, especially the technical details of classifier.

    (ii) Compare the proposed method to recent works in interpretability and explainability for regression ([1] for example), highlighting its strengths and weaknesses.

    (iii) Address the missing statistical analysis and report the interpretability of the final results (GSQ scores).

    (iv) Justify the need for the fuzzy classifier by demonstrating its clear benefit in interpretability compared to the standalone neural network and the simpler and direct computation of GSQ scores from keypoint estimates.

    (v) Consider conducting studies in clinical settings with radiologists to evaluate the impact of interpretability on clinical workflow and user trust in future work.

    Reference(s): [1] Toward Explainable Artificial Intelligence for Regression Models: A methodological perspective, IEEE Signal Processing Magazine 2022.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The focus on interpretability and clinical inspiration is appreciable, but the technical contribution and validation are currently insufficient for a methodology paper.

    Addressing the weaknesses, particularly regarding the fuzzy classifier’s value and interpretability benefits, along with stronger validation and clinical evaluation, could strengthen the paper.

    This paper has potential for the MICCAI conference but needs significant improvement. This work as-it-is might be better suited for venues focused on clinical translation, with additional studies demonstrating its usefulness for radiologists.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors’ clarification on comparison to directly computing GSQ ratios with key points. It would be helpful to see this point addressed more clearly in the manuscript. However, the lack of statistically significant differences between the ablation studies (not reported) – especially considering the methods achieve competitive scores – weakens the justification for the added complexity of the proposed approach compared to the simpler method (with similar explainability benefits) of directly computing the GSQ ratios. If possible, I would recommend adding the statistical significance results to the final manuscript to provide a clearer picture to the audience. That being said, the overall paper’s focus on clinical translation and transparency is motivated by sound clinical principles and therefore, will be of interest to the MICCAI community. Taking these aspects into consideration, I recommend an accept for this paper.



Review #2

  • Please describe the contribution of the paper
    • authors propose an explainable deep learning model for vertebral fracture assessment in radiographs.

    • proposed model was inspired by Genant’s method used by physicians for fracture assessment.

    • proposed method extracts landmarks that can be used to assess the deformation of the vertebral fracture.

    • the proposed framework includes several sub-methods, for instance, a bounding box detection technique.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • paper is well-written and easy to follow.

    • authors achieved good performance, with AUC value of 0.97.

    • the proposed vertebra classification method, inspired by the Genant’s approach, seems interesting and well designed for the task.

    • proposed method features a landmark localisation uncertainty quantification technique.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the proposed framework was build using several components (Fig. 2). Hovewer, each component on its own was of limited technical novelty. For instance, the detection method was based on the DETR model. The original contributions of the authors could have been better highligheted across the text.

    • authors state that the CNN and random forest classifier (training using vertebra features) are not explainable (Table 2 and text). Also, authors mention that the Grad-CAM algorithm cannot explain model reasoning. I believe that such statements might be a little bit confusing for the readers, and should be rephrazed.

    • authors did not generate saliency maps for the CNN model. Such maps would benefit the manuscript.

    • figures could be improved. Visualisations in Fig. 3 should be expanded. Scheme of the framework in Fig. 2 should be better described (be self-exploatory).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • authors stated that “Grad-CAM does not explain model reasoning” and that the CNN and the random forest algorithms (trained using established vertebra features) are not explainable. I think that such statements might be a bit confusing for the readers. First of all, the Grad-CAM algorithm can be used to generate saliency maps pointing out image regions that are important for classification. According to Table 2, the CNN achieved a similar AUC value to the proposed method. However, it is unclear which regions activated the CNN. Second, a single decision forest classifier trained using hand-crafted features is hardly a black-box model. Therefore, I think that it would be beneficial to also present saliency maps obtained for the CNN in Fig. 3. This would illustrate the differences between the methods in generating explanations.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript presents a solid study, clearly demonstrating a well-designed approach for the tackled task. However, the technical novelty of the proposed method is somewhat limited. I believe that authors need to address issues related to explainability before acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Authors have addressed my comment in a satisfactory way.



Review #3

  • Please describe the contribution of the paper

    This study introduces a multi-step deep learning method for detecting and classifying vertebral fractures using low-dose X-ray images. The classification task incorporates explainable criteria, and the detection component is complemented by uncertainty estimation, enhancing the tool’s transparency and reliability. The method demonstrates superior performance compared to existing approaches, as evaluated on a private dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method’s foundation on decision criteria commonly employed in clinical practice enhances experts’ trust in the AI tool while maintaining complete automation from vertebrae detection to final classification, thus expediting decision-making.
    • The method is original by simultaneously achieving high detection and classification accuracy, employing an explainable decision process, and providing an uncertainty map for relevant keypoint detection. This unique combination renders it innovative, trustworthy, and reliable.
    • Rigorous evaluation analysis, including quantitative comparison with state-of-the-art methods, strengthens the study’s credibility.
    • Overall, the manuscript is well-organized.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Lack of clarity in certain aspects of the methods description, particularly in the definition of loss functions where some terms remain undefined.
    • Visual representation of results could be improved for enhanced clarity.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have provided a detailed method description, and they plan to make the code and the trained model available in a public repository. However, data sharing is restricted due to privacy concerns.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Additional information missing from the method description should be provided:

    • In section 2.1, equation (1), certain terms were not defined.
    • In section 2.2, regarding the description of types of fracture, clarification is needed on the reference for vertebral height loss (I suppose that is the loss of one of the estimated vertebral height compared to anyone of the others).
    • In Table 2 of the supplementary materials, values of some hyperparameters were not reported (w_c, λ_IOU, λ_l1).

    Figure 3 requires clarification, particularly regarding the visibility of green points and the contrast of the uncertainty area. Additionally, in figure 3b, the nature of the points should be clarified in the caption.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The concept appears promising and valuable, particularly with the incorporation of explainable and trustworthy features. Only minor text modifications are needed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have answered to most of the comments raised by the reviewers. My original evaluation was positive and it is confirmed after rebuttal.




Author Feedback

We thank the reviewers for their invaluable help and input to improve the manuscript.

Contributions and novelty: As observed by reviewers #1 and #4, the key contributions of the paper are not the bounding box and keypoint detection methods themselves, which has been further clarified in our paper. Our contribution is their combination with uncertainty estimation and explainable (tractable) classification in a particularly noisy setting.

Neither the original residual log-likelihood paper or any following paper have used the uncertainty estimates in a substantial way beyond regularizing the keypoint detection. Novel to our model and as shown, these can be propagated to model classification uncertainty. Moreover, we show that this uncertainty is well calibrated to human experts.

The fuzzy classifier contributes over a hard condition decision tree (direct computation) as shown in our ablations (rows 1,5), by yielding probability estimates permitting loss backpropagation, which informs the keypoint regressor on the classification criteria. We show that this increases the F1-score for the first classification case, and overall improves the second. From an explainability perspective, it enables us to separate the probability contributions of the fuzzy classifier and image-feature classifier.

Interpretability vs. explainability: We thank reviewers #1 and #4 for their comments on explainability methods. We will attempt to address some misunderstandings and the difference between interpretability and explainability, where we have avoided using the former term in our paper.

According to e.g. the ISO/IEC TR 29119-11:2020 standard, interpretability is the level of understanding “how the underlying model works”, while explainability reflects “how the model came up with a given result”. Most work addresses post-hoc methods for interpretability, like Grad-CAM, which are not applicable to the explainability desired in this paper. We emphasize that these interpretability methods only rank feature contributions to predictions without providing tractable reasoning on their use in the model. Our model integrates the neural network feature extraction with decision tree transparency, aligning with clinical decision-making like GSQ.

Saliency maps: Visual feature importance methods, such as saliency maps (proposed by reviewer #1) generally aim to provide visual interpretable insights to the model but cannot be translated into an explained mode of reasoning. Although such methods can help understand model behaviour, they do not provide the tools required to comply with decision support systems, and we respectfully disagree that they could be used to accomplish the paper objective. However, we agree with reviewer #1 that the discussion should be updated to emphasize this, and that saliency maps would be an interesting addition.

Random forest comparison: While a single decision tree is explainable by construction, a random forest is an ensemble and requires feature importance which suffering from the same lack of model reasoning as saliency maps. We have clarified this in the paper.

Clinical translation: Concerning the clinical relevance of the interpretability, we thank reviewer #4 for encouraging suggestions for future work and agree that this is an important direction to pursue, but we believe it is out of scope here and best suited for a paper on pure clinical translation. Readability and clarity: We thank all reviewers for their comments on readability and clarity of the figures and methods, which we have made sure to address.

Thanks to reviewer #4 we have also added that there were 2 clinical annotators with excellent inter-reader agreement and added details on the hardware used. Moreover, we thank reviewer #4 for their important suggestions to improve the statistical analysis, and agree that while we provide errors, a significance test would have further strengthened the paper and will be added if allowed by the guidelines.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors are encouraged to address the concerns of the reviewers especially the statistical significance values (R4) in the camera ready version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors are encouraged to address the concerns of the reviewers especially the statistical significance values (R4) in the camera ready version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Rebuttal not satisfactory.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Rebuttal not satisfactory.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers have liked the paper and one of those proposing reject has changed his/her to accept after reading the rebuttal. Since the explainable AI area is important to highlight the paper may be accepted.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers have liked the paper and one of those proposing reject has changed his/her to accept after reading the rebuttal. Since the explainable AI area is important to highlight the paper may be accepted.



back to top