Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Thyroid nodules are among the most prevalent endocrine disorders, with incidence rates increasing in recent years. Ultrasonography remains the primary method for thyroid nodule diagnosis due to its non-invasive nature and cost-effectiveness; however, the process is subjective and skill-intensive. To assist radiologists, Computer-Aided Diagnosis systems (CAD) have been developed to provide a second opinion. Despite these advancements, the absence of publicly available medical datasets has resulted in inconsistent validation methods, deterring comparability across studies. This paper introduces ThyroidXL, an open benchmark dataset for thyroid nodule classification, segmentation, and detection. With over 11,000 images from more than 4,000 patients, the dataset—collected and annotated by expert radiologists at the Vietnam National Hospital of Endocrinology—stands as the largest publicly available resource for thyroid nodule diagnosis in terms of both patient count and image volume. Additionally, we provide multiple deep-learning baseline models on three key tasks, including malignancy classification, thyroid nodule detection, and segmentation. The proposed dataset and benchmark can serve as a foundational resource for advancing CAD system development, fostering reproducible research, and accelerating progress in thyroid nodule diagnosis. Our dataset can be accessed at: https://huggingface.co/datasets/hunglc007/ThyroidXL

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2024_paper.pdf

SharedIt Link: https://rdcu.be/eG4Eb

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_60

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

https://huggingface.co/datasets/hunglc007/ThyroidXL

BibTex

@InProceedings{DuoVie_ThyroidXL_MICCAI2025,
        author = { Duong , Viet Hung AND Vu, Huan AND Phan, Huong Duong AND Nguyen, Duc Quyen AND Pham, Duc Hao AND Le, Quang Toan AND Nguyen, Ba Sy AND Do, Tien Dung AND Dinh, Viet Sang AND Nguyen, Tien Cuong AND Pham, Huy Hoang AND Ngo, Dien Hy},
        title = { { ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {616 -- 626}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors in this paper propose a new dataset named ThyroidXL comprising of 11545 ultrasound images from 4093 patients. The dataset consist of samples from both benign and malignant classes. Further, they had a comparative study using various existing classification models like AlexNet, Resnet34, Resnet50, EfficientNet-B6, EfficientNet-B7, VGG-11,VGG-13 and existing object detection models like EfficientNet-B3, Faster R-CNN, YOLOX-S, YOLOX-M, Deformable DETR, CO-DETR R50 and CO-DETR Swin-L and reported the performance of these models for the classification task.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main contribution of this paper is the introduction of the new dataset for thyroid cancer. The earlier three known thyroid cancer datasets had few images namely DDTI had 347 images, TN3K had 3493 images,TG3K had 3585 images and TN-SCUI had 3644 images. The dataset proposed in this paper have far more number of samples than the existing ones. ThyroidXL had a total of 11545 images.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The classification and object detection models used for the study here are well known existing models. The models used for thyroid nodule segmentation namely U-Net and UNet++ with an EfficientNet-B5 backbone are also existing models. There are no qualitative results for visual comparison to show how different models work on the segmentation task. There should be an elaborate study on the state-of-the-art results achieved so far for both the segmentation and classification task as there are many works on the existing datasets. Moreover the authors have claimed they are introducing a public dataset. So they should include some details on how to access their data.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work is divided into two parts thyroid nodule detection and classification. For thyroid nodule detection various existing classification models and object detection models were studied and comparative results were included. For thyroid nodule segmentation two existing models namely U-Net and U-Net++ were studied. All the models used in this paper are existing in literature and there is no novelty in the methodology part.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Accepting the paper because ThyroidXL has large number of ultrasound sample and will be very useful for research in this field.

Review #2

Please describe the contribution of the paper

This paper presents a novel dataset for thyroid nodule classification, segmentation, and detection, called ThyroidXL. The dataset comprises 11,000 images from more than 4,000 patients. All images are annotated by expert radiologists, and the samples are biopsy-confirmed. Furthermore, the dataset includes metadata related to patient demographics, and the authors provide initial results to serve as a benchmark for future research.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The dataset is well-documented, including the processing steps and label acquisition. It is large and can support the development of advanced methods for nodule classification. In terms of the number of images and patients, it is among the largest available.
- Cytological and pathological results, along with patient demographic data, are provided to complete the research records. This comprehensive documentation can have a significant impact on studies focused on fairness and translational research.
- The experimental section includes baselines for a wide range of tasks: classification, detection and segmentation. All of these tasks are important regarding diagnosis and prognosis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The authors state the following: “Furthermore, several studies also utilized thyroid ultrasound datasets, though they remain inaccessible to the broader research community [5,8,14,15]. The lack of accessibility to these datasets restricts independent validation and benchmarking, underscoring the need for large-scale thyroid ultrasound datasets.” However, some important citations are missing from their discussion, such as:

“An ultrasonography of thyroid nodules dataset with pathological diagnosis annotation for deep learning” — which includes 8,508 images from 842 patients.

“Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules.”

Additionally, the authors do not provide statistical results for the baseline models. As a result, future comparisons and benchmarking efforts may be incomplete or unreliable, especially considering the stochastic nature of many machine learning methods.

A main concern regarding validation is the lack of an external validation set — for instance, one based on a different public dataset. While the authors acknowledge limitations such as the malignant-benign ratio and potential generalization issues to different equipment or clinical settings — as they state: “…the malignant-benign ratio may influence model performance. Furthermore, the models trained on our dataset may not generalize well to images acquired using different equipment or clinical settings. Hence, future work should explore data augmentation, resampling, and domain adaptation techniques to address these limitations.” I consider that external validation remains an essential step. It would offer a more accurate assessment of real-world performance and strengthen the significance of the dataset.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

1) Regarding the dataset size, I suspect there is a mismatch in the numbers. The paper says: “Our dataset consists of 11,545 images from 4,093 patients, split 80-20 into training and validation sets. The training set includes 9,541 images from 3,275 patients, while the validation set has 2,094 images from 739 patients.” However, from the table: 3275+739=4014. I encourage the authors to revise the final numbers and ensure that everything matches.

2) A point that needs clarification is why the classification results seem better in the detection and segmentation tasks, specially, regarding the specificity.Are background and benign pixels grouped into the same class? If so, this could significantly affect performance metrics.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The dataset is well-documented and comprises a large number of samples. A strong point is that it includes demographic information. However, the empirical validation could be improved by incorporating statistical analyses and external validation. While this is not critical for the paper, I believe the scientific community will still benefit from both the paper and the dataset. That said, I believe the paper would be in a better position for acceptance if the authors address the two points I have outlined in the comments.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I believe this paper is suitable for publication. I did not have major concerns in my initial review. And the matter regarding the metrics across the different tasks has been correctly addressed by the authors. However, I would like to highlight that the main limitation of the paper is the absence of an external validation.

That said, the dataset is well-described, the chosen baselines are appropriate, and the authors report results on multiple tasks. Overall, this makes it a solid dataset paper for MICCAI.

Review #3

Please describe the contribution of the paper

The paper presents a comprehensive thyroid ultrasound dataset that is significantly larger than existing public datasets both in terms of patient count and image volume. The images were collected by experienced physicians with at least 5 years of experience in thyroid ultrasound, and all diagnoses were confirmed through cytological and pathological examinations. The authors have also benchmarked various deep learning models for three key tasks: malignancy classification, thyroid nodule detection, and segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addresses a critical gap in medical image analysis by providing a large, high-quality, and publicly available dataset for thyroid nodule diagnosis. The dataset is well-characterized with detailed patient demographics and clinical metadata, making it valuable for developing robust CAD systems. The authors have taken care to ensure the dataset’s integrity by applying strict inclusion criteria and employing experienced radiologists for annotation. Unlike existing datasets such as DDTI and TN3K, ThyroidXL does not contain handwritten diameter indicators that could obscure critical anatomical structures or introduce bias. The comprehensive benchmarking of state-of-the-art deep learning models across three different tasks provides valuable baselines for future research.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

all images were acquired using the same ultrasound equipment (Hitachi Aloka Arielta V70), which may limit the generalizability of models trained on this dataset to images from different machines or clinical settings. A significant omission is the lack of any measure of expert annotation error or inter-observer variability among the radiologists who annotated the dataset. The paper does not specify how many radiologists were involved in the annotation process, whether multiple experts annotated the same images to establish consensus, or how disagreements were resolved. Given the subjective nature of ultrasound image interpretation, metrics quantifying annotation consistency (such as Cohen’s kappa or Dice similarity coefficients between annotators) would provide crucial context for interpreting model performance and establish a human baseline for comparison.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces ThyroidXL, a large-scale dataset for thyroid nodule diagnosis that represents a significant contribution to the field of medical image analysis. The dataset, with 11,545 ultrasound images from 4,093 patients, far exceeds existing public datasets in both size and scope, addressing a critical need in the research community. The strengths of this work include the comprehensive benchmarking of various deep learning models for classification, detection, and segmentation tasks, and the consideration of clinically relevant metrics such as sensitivity and specificity at both image and patient levels. The authors have taken care to avoid issues present in existing datasets, such as handwritten markings that could introduce bias. However, several limitations prevent me from giving a stronger recommendation. Most critically, the paper lacks any measure of expert annotation error or inter-observer variability, making it impossible to establish a human performance baseline or understand the reliability of the ground truth labels. Additional concerns include the dataset imbalance between training and test sets, the use of a single ultrasound equipment type limiting generalizability, and insufficient discussion of techniques to address these limitations.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors clarified my concerns, and I think the paper would be a valuable contribution.

Author Feedback

Thank you, and an Overview We thank the reviewers for their constructive feedback. We are pleased that they recognize the value of our dataset, which surpasses existing ones in both scale and quality (R1, R2, R3), and appreciate our rigorous preprocessing and sample selection methods (R1, R2). We also thank R1 and R2 for highlighting the benefits of our rich demographic and clinical metadata, and for acknowledging our benchmark results as a foundation for future research. @R3 - Novelty concern This work presents ThyroidXL, a high-quality dataset for thyroid nodule diagnosis that fills a critical gap in the field. Its novelty lies in its scale, annotation quality, and multi-task labels. We also provide strong baselines using established models, following MICCAI practice for dataset contributions, leaving methodological novelty for future work. @R2 - Inter-annotator variability For malignancy classification, labels were directly derived from cytological and pathological findings, eliminating inter-annotator variability. Three board-certified radiologists annotated the images for segmentation. Each image was independently labeled by two annotators (≥5 years of experience), with ~8% of cases resolved by a third senior radiologist (≥30 years of experience) where conflicts were present. This information will be added to Sec. 3.1. @R1 - External validation We acknowledge the importance of external validation. Since TN3K and TG3K only have segmentation masks, we attempted to use DDTI, but found the official website of Cim@Lab unreachable. To avoid reporting potentially unreliable results from unofficial sources, we omitted them. We will pursue external validation in future work, as noted in Sec. 5. @R2 – Single equipment type limits generalizability We acknowledge this important limitation. All images in the ThyroidXL dataset were collected using Hitachi Aloka Arietta V70 ultrasound systems to ensure consistency. Methods to overcome this limitation have been discussed in Sec. 5. @R1 - Justification for superior classification results in detection and segmentation tasks The gap in classification results stems from differences in input size and metrics. Detection and segmentation use higher resolutions (640×640 and 384×480) than classification (224×224), a factor known to enhance accuracy in ultrasound imaging [1]. Moreover, detection and segmentation are evaluated using Average Precision, favoring low false‑positive rates and thus high specificity. Meanwhile, classification uses F1, which favors a balance between sensitivity and specificity (Tab. 2, 3, 4). @R1 - Patient count mismatch The final number of patients is 4093 (3354 for training, 739 for testing). @R1, R3 - Additional information Regarding [2] (R1), although claimed to be public, access details were missing, so we were unsure whether to cite the dataset as public or private. We have added [2] along with two TN3K references to Sec.  2.1. Due to space limits, qualitative segmentation results (R3) were omitted initially. We included a figure showcasing four representative malignant/benign cases in the final version. As for statistical results (R1), we followed the practice in [3] (MICCAI 2022) for classification and [4] (MICCAI 2021) for detection, both papers omitted confidence intervals. To improve reproducibility, we fixed random seeds and enabled PyTorch’s deterministic setting. @R1, R2, R3 - Code/Data Release We commit to publicly releasing code, checkpoints, pre- and post-processing pipelines, and the ThyroidXL dataset for community use. References [1]. Tang et al., The effect of image resolution on convolutional neural networks in breast ultrasound. [2]. Hou et al., An ultrasonography of thyroid nodules dataset with pathological diagnosis annotation for deep learning. [3]. Huang et al., Personalized Diagnostic Tool for Thyroid Cancer Classification using Multi-view Ultrasound. [4]. Shahroudnejad et al., TUN-Det: A Novel Network for Thyroid Ultrasound Nodule Detection.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Vox Populi: All three reviewers voted in favor of this paper, including one who switched to “for” post-rebuttal. All reviewers were engaged post-rebuttal.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset

Author(s):