Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate segmentation of lumbar vertebral endplates is essential for assessing bone density and biomechanical properties in spinal disorders. While quantitative computed tomography (QCT) provides detailed bone density measurements, existing segmentation approaches primarily focus on vertebral bodies and intervertebral discs, often neglecting the precise delineation of endplates. Current deep learning methods perform well in healthy spines but struggle with pathological cases due to the thin and morphologically complex nature of endplates, particularly in the presence of osteophytes and degenerative changes. To address these challenges, we introduce the first publicly available dataset, Endplate3D-QCT, which contains pixel-level annotations of lumbar endplates in clinical QCT scans. Our dataset includes high-precision 3D segmentation masks targeting cortical endplates and subchondral bone, along with an automated evaluation framework for model assessment. We benchmark multiple deep learning models, including EfficientUNet, UNet, VNet, UNETR and SwinUNETR, using nnUNet as the training framework. While these models achieve Dice scores around 0.9, they exhibit inconsistencies in endplate identification, leading to false positives and false negatives. These findings highlight the need for further advancements in endplate segmentation techniques. Our dataset and benchmarks provide a valuable foundation for improving spinal implant design, bone density mapping, and computational modeling of vertebral load distribution. The dataset and the evaluation code are available at https://github.com/yin876705249/Endplate3D-QCT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4366_paper.pdf

SharedIt Link: https://rdcu.be/eHw50

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_15

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/yin876705249/Endplate3D-QCT

Link to the Dataset(s)

Endplate3D-QCT dataset: https://github.com/yin876705249/Endplate3D-QCT

BibTex

@InProceedings{YinZix_Endplate3DQCT_MICCAI2025,
        author = { Yin, Zixun AND Zou, Da AND Zhao, Yi AND Zhang, Chenbin AND Li, Weishi AND Wu, Minghui AND Yan, Kun AND Wang, Ping},
        title = { { Endplate3D-QCT: A High-Resolution Dataset and Benchmark for Automated 3D Segmentation of Lumbar Vertebral Endplates in QCT } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {147 -- 156}
}

Reviews

Review #1

Please describe the contribution of the paper

The primary contribution of the paper is the introduction of the Endplate3D-QCT dataset—a high-resolution, large-scale, and publicly available dataset for the automated 3D segmentation of lumbar vertebral endplates in quantitative CT scans. In addition, the authors propose a novel evaluation framework incorporating both volumetric and surface distance metrics, and they provide benchmarking results using established deep learning models (e.g., 2D/3D U-Net, VNet, UNETR, and Swin-UNETR) on both healthy and degenerative cohorts.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The dataset represents the first open-access collection with sub-millimeter precision annotations for lumbar vertebral endplates, including over 1,800 annotated surfaces, which fills a significant gap in the literature.

The proposed evaluation scheme is detailed and incorporates both detection metrics (precision, recall, F1) and segmentation metrics (Dice, Jaccard, HD95, ASD) for instance-specific performance analysis.

Extensive experiments comparing multiple well-known segmentation models are presented, providing valuable baseline performance data across different pathological conditions (healthy vs. degenerative cases).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The manuscript does not sufficiently clarify why segmentation of lumbar vertebral endplates is of critical importance. Specifically, the rationale for focusing on these structures—beyond noting their relevance for assessing bone density and surgical planning—is not convincingly articulated [1].

Although the paper claims that precise endplate characterization can enhance studies on scoliosis and degenerative disc disease due to asymmetrical bone remodeling, it remains unclear whether the dataset actually includes scoliosis cases. There is also a lack of comparative analysis or discussion on performance differences for such specific pathologies [2].

The results are primarily presented via 2D slice illustrations. The absence of comprehensive 3D visualizations (e.g., volumetric renderings or multi-view reconstructions) detracts from the clarity of the segmentation results and limits the reader’s ability to assess the overall segmentation quality in a three-dimensional context.

While the annotation protocol involved multiple experts, the manuscript does not provide quantitative analysis (e.g., inter-rater Dice or kappa statistics) to assess annotation consistency [3].

[1] Mushtaq M, Akram M U, Alghamdi N S, et al. Localization and edge-based segmentation of lumbar spine vertebrae to identify the deformities using deep learning models[J]. Sensors, 2022, 22(4): 1547. [2] Deng Y, Wang C, Hui Y, et al. Ctspine1k: A large-scale dataset for spinal vertebrae segmentation in computed tomography[J]. arXiv preprint arXiv:2105.14711, 2021. [3] Yang F, Zamzmi G, Angara S, et al. Assessing inter-annotator agreement for medical image segmentation[J]. IEEE Access, 2023, 11: 21300-21312.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a valuable dataset and evaluation, and its approach to segmenting lumbar vertebral endplates shows promise with potential clinical relevance. However, the motivation behind this segmentation could be explained more clearly, particularly to emphasize its unique clinical benefits. Additionally, a simple quantitative analysis of inter-observer variability would help enhance the credibility of the annotations and overall impact of the work. Therefore, my recommendation is Weak Reject.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

While the authors have provided some clarifications in their rebuttal, I think the following issues remain insufficiently addressed: The annotation consistency of the dataset cannot be assessed, and the dataset is derived from a single institution, which may introduce sampling bias and limit its generalizability to different scanners, patient populations, or acquisition protocols. Additionally, the performance decline in diseased states has not been fully analyzed or explained.

Review #2

Please describe the contribution of the paper

The authors have created a new dataset that will be made open upon publication. The dataset, Endplate3D-QCT, consists of QCT images of the lumbar spine that have had the vertebral endplates manually segmented. The authors then train UNET model variants to automatically perform segmentation using the NNUNET framework to establish benchmark performance for future development of segmentation approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the authors have created a novel dataset for a novel clinical need. The authors rightly point out that there are no current datasets with endplate segmentations and that most if not all segmentation algorithms for the bony spine are focused on vertebrae and disc segmentation rather than on endplate segmentations. The rational for the need is strong, being able to quantify endplate changes could be useful for understanding degenerative spine changes.
- the authors have had multiple clinical experts make and review the segmentations created. The segmentations are therefore likely high quality and a good foundation upon which to build new methods for analysing QCT scans of the lumbar spine. The segmentations as well as the clinical need identification are the main contribution of this work.
- The authors have considered many existing models for performing segmentation, providing excellent benchmarks. They compared many existing UNET type models for performing the segmentation. and evaluated the ability to detect and label the endplates correctly as well as segmentation accuracy.
- the authors have stratified performance of benchmark methods for segmenting endplates based on the amount of spine degeneration because of the anatomical changes that are inherent with the disease including osteophytes and changes in endplate thickness and BMD.
- the dataset will contribute to the availability of high quality spine focused medical imaging datasets, allowing greater advances in the development of advanced methods for image analysis including AI based methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Insufficient Characterization of Dataset Variability and Human Performance: A central claim of the paper is that the dataset can support the development of improved segmentation algorithms. However, this claim is undermined by the lack of quantitative characterization of human-level performance. Specifically, the authors do not report inter-observer or intra-observer variability. Reporting inter and intra observer performance would establish a target for algorithmic performance. Without this information, it is impossible to determine whether the benchmark models approach the upper bound of performance or whether substantial improvements are feasible. For instance, if the labeling has high variability in the vertebral endplate due to anatomical ambiguity or image quality, then algorithmic performance may already be approaching the practical limit. Conversely, low inter-observer variability would suggest that current algorithms are suboptimal. This distinction is essential, especially because the authors motivate the dataset by stating it will enable the development of improved segmentation methods. The motivation is not substantiated without clearly defining if superior performance is possible in the context of achievable accuracy.
2. Ambiguous Attribution of Error Sources in Diseased States: The paper reports that the benchmark segmentation models perform worse in cases with severe degenerative changes, but the analysis fails to disambiguate the source of this degradation. The reduced performance could stem from several non-exclusive sources: o Model limitations as suggested by the authors o Insufficient image quality for segmentation of cases with severe degeneration (e.g., due to osteophytes, artifact, or endplate thickness) o Increased labeling difficulty leading to inconsistent or erroneous ground truth annotations The authors do not design experiments or provide analyses to distinguish between these possible causes. For example, a reader study, or synthetic degradation experiments could provide insight. Without such analyses, the utility of the dataset in benchmarking truly “superior” algorithms is unclear, and the potential clinical impact is speculative.
3. Lack of Detail on Image Acquisition Parameters and Spatial Resolution The segmentation task centers on the vertebral endplate, a thin anatomical structure whose visibility is highly dependent on acquisition parameters. However, the paper provides minimal detail about the imaging protocol. Important missing parameters include: o CT vendor and scanner model o Voxel size and slice thickness o Reconstruction kernel and algorithm o Acquisition mode (e.g., helical vs axial) o Any post-processing applied to the images The authors do not discuss or quantify the point spread function (PSF) or system resolution, which is important when dealing with thin anatomical structures like the endplate. Without clarity on whether the imaging protocol supports sufficient resolution to distinguish the endplate from adjacent bone and soft tissue, it remains uncertain whether the annotations and downstream models are limited by the imaging system itself. This omission is a significant limitation, especially if the authors intend the dataset to serve as a reference standard for model development.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The clinical need and dataset are novel and would be good contributions to the community. However, there are significant weaknesses int he characterization of the dataset that limit its potential use.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

[1] The primary contribution of this study lies in the introduction of Endplate3D-QCT, the first publicly available high-resolution dataset specifically constructed for the automated 3D segmentation of lumbar vertebral endplates. [2] The authors establish a comprehensive evaluation framework that includes voxel-based and instance-level metrics such as Dice similarity, detection F1-score, HD95, and ASD. [3] They present a benchmark analysis by comparing six state-of-the-art segmentation models—nnUNet, UNet (2D/3D), EfficientUNet, VNet, UNETR, and SwinUNETR—across three clinical conditions (Normal, LDD-Mild, and LDD-Severe).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

[1] clinical relevance and data diversity, particularly in representing the spectrum of lumbar degenerative disease (LDD) severity. Unlike prior datasets that predominantly consist of healthy spines, Endplate3D-QCT includes scans from patients with mild to severe degenerative changes, enabling robust evaluation of model generalizability across pathological conditions. [2] clinically meaningful, instance-level evaluation metrics, such as endplate detection accuracy and surface distance measures, that go beyond traditional voxel-level metrics and align better with surgical planning requirements. [3] the study offers clear guidance on model robustness and limitations. These strengths collectively enhance the dataset’s value not only for academic benchmarking but also for real-world clinical applicability, especially in preoperative planning of spinal interventions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

[1] the dataset is derived from a single institution, which may introduce sampling bias and limit its generalizability to different scanners, patient populations, or acquisition protocols. [2] the performance of all evaluated models deteriorates markedly in LDD-Severe subsets, with detection recall dropping to as low as 0.55 for some models. This suggests that current segmentation architectures remain inadequate for handling complex degenerative morphologies, thus limiting immediate clinical deployment. [3] the study does not explicitly address metal artifact handling, which is a common challenge in spinal imaging, especially in post-operative patients. [4] no intraoperative or biomechanical validation is presented, leaving the real-world impact of their method largely inferential. These weaknesses highlight the need for future work focusing on cross-institutional validation, robust artifact handling, and clinical outcome correlation
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a significant contribution to the field of spinal imaging and medical AI through the development of Endplate3D-QCT. A major strength of the study is its clinical relevance. Endplate segmentation plays a vital role in spine surgery planning, particularly in procedures involving disc replacement and vertebral augmentation. The authors further provide a comprehensive benchmarking framework, evaluating six state-of-the-art segmentation models under three levels of anatomical difficulty.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors further provide a comprehensive benchmarking framework, evaluating six state-of-the-art segmentation models under three levels of anatomical difficulty.

Author Feedback

We appreciate the reviewers’ valuable comments on our work.

Clinical Significance of Endplate Segmentation. Vertebral endplate segmentation allows further non-invasive evaluation of endplate bone mineral density (BMD), which shows unique clinical values. Firstly, as the contact surface of the interbody fusion cage in surgery, endplate BMD (EBMD) was better than traditional BMD measurements in predicting postoperative cage subsidence [1], which would help to predict the risk of symptom recurrence and revision surgery preoperatively. Secondly, in the development of spinal scoliosis, there was usually more severe bone sclerosis in the concave side due to asymmetric stress concentration [2]. However, as osteophyte is usually more common in the endplate region, such asymmetric BMD distribution is expected to be more significant in the endplate, which lacks a suitable technique for further investigation. To address this issue, our dataset includes images of 10 cases of spinal scoliosis. In general, although existing studies indicated the extensive potential of EBMD, current methods could only roughly measure EBMD due to its irregular disc shape. Therefore, our 3D endplate segmentation tool would be essential for further clinical practice and research. Additionally, as we focus on the image segmentation task, although intraoperative and biomechanical analyses hold practical significance, they are not further discussed in this paper.

Annotation Consistency. We have implemented multiple measures to ensure annotation consistency and reduce dataset variability. All spine surgeons received uniform training and strictly followed the annotation manual to minimize subjective bias. Through multiple iterations using cross-annotation methods for labeling and feedback, the team conducted numerous discussions to rectify inconsistencies. Additionally, an experienced spine surgeon reviewed and corrected all annotations, further ensuring quality. Although quantitative metrics of human-level performance could not be provided due to rebuttal rules, these multiple safeguards effectively enhanced annotation consistency.

Performance Decline in LDD-Severe. The significant performance drop in LDD-Severe aligns with the highly complex and challenging pathological cases included in this subset. This is consistent with our objectives and expectations. As for the cause, we believe it is primarily that the existing model cannot handle the irregularity of the endplates within this subset, rather than erroneous ground truth annotations. By including this subset, we aim to provide a solid foundation for subsequent model development targeting complex pathological situations.

Additional Details. The dataset is curated from two sources: the publicly available CTSpine1K and data collected by our team from a single medical institution. The latter’s single-source nature is a limitation that we acknowledge in the conclusion section. We recommend that future studies consider multi-institutional data to enhance the generalizability. For the data from this medical institution, the CT scanning parameters, including energy peak and spacings, have been presented in the manuscript, and the scanner is DEFINITION, Siemens. The post-processing of images primarily involves the ROI extraction. Although 3D visualization aids in comprehensively displaying the masks, considering space and clarity, we only present results in the sagittal plane, which are sufficient to showcase the research findings, ensuring that the illustrations are clear and easy to read. [1] Okano I et al. Endplate volumetric bone mineral density measured by quantitative computed tomography as a novel predictive measure of severe cage subsidence after standalone lateral lumbar fusion. DOI: 10.1007/s00586-020-06348-0 [2] Wang H et al. Hounsfield Unit for Assessing Vertebral Bone Quality and Asymmetrical Vertebral Degeneration in Degenerative Lumbar Scoliosis. DOI: 10.1097/brs.0000000000003639

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Endplate3D-QCT: A High-Resolution Dataset and Benchmark for Automated 3D Segmentation of Lumbar Vertebral Endplates in QCT

Author(s):