Abstract

Recent advancements in AI and medical imaging offer transformative potential in emergency head CT interpretation for reducing assessment times and improving accuracy in the face of an increasing request of such scans and a global shortage in radiologists. This study introduces a 3D foundation model for detecting diverse neuro-trauma findings with high accuracy and efficiency. Using large language models (LLMs) for automatic labeling, we generated comprehensive multi-label annotations for critical conditions. Our approach involved pretraining neural networks for hemorrhage subtype segmentation and brain anatomy parcellation, which were integrated into a pretrained comprehensive neuro-trauma detection network through multimodal fine-tuning. Performance evaluation against expert annotations and comparison with CT-CLIP demonstrated strong triage accuracy across major neuro-trauma findings, such as hemorrhage and midline shift, as well as less frequent critical conditions such as cerebral edema and arterial hyperdensity. The integration of neuro-specific features significantly enhanced diagnostic capabilities, achieving an average AUC of 0.861 for 16 neuro-trauma conditions. This work advances foundation models in medical imaging, serving as a benchmark for future AI-assisted neuro-trauma diagnostics in emergency radiology.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4374_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YooYou_ANoncontrast_MICCAI2025,
        author = { Yoo, Youngjin and Georgescu, Bogdan and Zhang, Yanbo and Grbic, Sasa and Liu, Han and Aldea, Gabriela D. and Re, Thomas J. and Das, Jyotipriya and Ullaskrishnan, Poikavila and Eibenberger, Eva and Chekkoury, Andrei and Bodanapally, Uttam K. and Nicolaou, Savvas and Sanelli, Pina C. and Schroeppel, Thomas J. and Lui, Yvonne W. and Gibson, Eli},
        title = { { A Non-contrast Head CT Foundation Model for Comprehensive Neuro-Trauma Triage } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a foundation model for head CT imaging, utilizing a private GPT-4 model to automate multi-label annotations from radiology reports, thereby reducing reliance on manual labeling. The framework integrates DeepCNTD-Net with task-specific modules, including a modified 3D Dense U-Net for hemorrhage subtype segmentation and a multi-output U-Net for brain anatomy feature extraction (brainAnatFeat), aiming to detect 16 neurotrauma pathologies in emergency settings. The authors emphasize innovations in combining LLM-based annotation with multi-modal fine-tuning to enhance detection performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has the following strengths. First, it leverages a private GPT-4 model to automatically extract POS/NEG labels for 16 neurotrauma concepts from radiology reports, significantly reducing manual annotation costs and constructing a large-scale training corpus of 29,395 non-contrast head CT (NCCT) studies . The method independently pretrains a 3D Dense U-Net for hemorrhage subtype segmentation, and a multi-output U-Net for brain anatomy segmentation (brainAnatFeat), then fuses these pathological and anatomical features during multi modal fine tuning an approach that is both novel and extensible. Moreover, the dataset spans nine centers across the U.S., Canada, China, and India, covering multiple scanner vendors and protocols, which demonstrates the model’s robustness and generalizability in real world heterogeneous scenarios. Among these strengths, the use of GPT-4 for dataset generation is particularly compelling: medical imaging annotation traditionally depends heavily on expert judgment, especially for precisely identifying 16 neurotrauma findings. The GPT-4 model’s advanced medical language understanding enables it to emulate expert reasoning for POS/NEG determination, ensuring high annotation quality while alleviating the bottleneck of scarce radiologist resources. By bridging natural language processing with medical image analysis, this work not only crosses technical domains but also introduces an innovative paradigm for future image report joint modeling.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The methodology is poorly explained. Figure 1, which illustrates the workflow, lacks clarity in depicting how Hemorrhage Segmentation Features and brainAnatFeat are extracted, integrated, and utilized by DeepCNTD-Net. Key technical details, such as the architecture modifications to the 3D Dense U-Net or the fusion strategy for multi-modal features, are omitted. This ambiguity hinders reproducibility and raises questions about the validity of the proposed framework.
    2. Experimentally, the work is insufficiently validated. Despite using multi-center data, the authors do not address variations in CT scan parameters (e.g., slice thickness, contrast protocols), which may bias model performance. The LLM achieves only 79% accuracy in labeling ischemia/infarction (Table 1), yet no mitigation strategies (e.g., error correction, uncertainty quantification) are discussed. External validation is limited to the CQ500 dataset and only three pathologies, neglecting rare but critical conditions like cerebral contusion or edema. Additionally, the paper fails to compare with state-of-the-art methods (e.g., FM-CT) on external datasets, and Table 3’s superior performance of CT-CLIP for certain pathologies lacks explanation.
    3. Clinically, the paper overlooks practical deployment requirements. While emphasizing rapid emergency diagnosis, it omits inference speed metrics and hardware specifications, making clinical feasibility claims unsubstantiated. There is no real-world validation (e.g., prospective trials, radiologist comparisons) to demonstrate utility in emergency workflows.
    4. To improve, the authors should: (1) disclose LLM prompt examples and strategies for handling ambiguous reports to enhance reproducibility; (2) expand external validation to cover all 16 pathologies and include benchmarks like FM-CT; (3) revise Figure 1 to clarify the workflow; (4) address annotation errors and CT parameter heterogeneity; (5) quantify runtime efficiency and clinical performance. Without addressing these gaps, the paper’s contributions remain speculative.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the integration of automated labeling and task-specific pretraining is novel, the paper suffers from critical methodological and experimental shortcomings that significantly weaken its scientific rigor. Due to insufficient methodological clarity, limited validation, and misalignment with clinical priorities, I recommend Reject.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Clarification of Methodological Details: The authors have committed to adding detailed architectural descriptions in Figure 1, including network specifics, feature map sizes, and workflow for pretraining and fine-tuning, which substantially improves reproducibility and clarity. CT Parameter Heterogeneity: The explicit clarification regarding the distribution of CT parameters (slice thickness, energy types, and preprocessing strategies) effectively addresses concerns about data variability and potential biases. The applied preprocessing and data augmentation strategies appear robust and suitable for reflecting real-world variability. Inference Speed and Clinical Feasibility: Providing concrete inference speed metrics (0.89 seconds per CT volume) and hardware requirements (NVIDIA A100 GPU, 40GB) clearly addresses practical deployment concerns, supporting the feasibility of clinical translation. Detailed Performance Metrics: The additional specificity, sensitivity, precision, and F1 scores provided across multiple conditions enhance the transparency of the model evaluation, aligning well with clinical priorities emphasizing high sensitivity. However, we noted two critical issues that require further attention and improvement. LLM Annotation Accuracy and Error Mitigation: While the authors acknowledged potential annotation inaccuracies caused by the LLM, especially concerning subtle findings such as infarcts, they only provided general conceptual strategies for improvement (e.g., tailored prompts, consensus labeling, uncertainty estimation). External Validation and Comparative Analysis: The external validation remains limited, primarily relying on the CQ500 dataset and not fully covering all 16 pathologies. Additionally, comparative analyses with state-of-the-art models (particularly FM-CT) remain insufficient. We strongly recommend conducting further external validations to cover a broader range of pathologies and providing direct, comprehensive comparisons with recent and relevant state-of-the-art methods to strengthen the generalizability claims. Given these observations, my final decision is reject.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a specialized 3D foundation model for comprehensive neuro-trauma triage using non-contrast head CT scans. By leveraging large language models (LLMs) to automatically generate multi-label annotations from radiology reports, the authors construct a large-scale, richly labeled dataset. The proposed model, DeepCNTD-Net, integrates pretrained task-specific networks for hemorrhage subtype segmentation and brain anatomy parcellation, and fuses their features via multimodal fine-tuning. This design enables accurate detection of 16 critical neuro-trauma conditions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a multimodal formulation where a foundation model (DeepCNTD-Net) integrates features from two independently pretrained networks—one for hemorrhage subtype segmentation and another for brain anatomy parcellation. This approach allows the model to capture both pathological and anatomical priors, improving its diagnostic capacity.
    2. The model is trained and evaluated on a large, diverse dataset collected from nine international centers, and also tested externally on the CQ500 dataset. It demonstrates strong generalization across a wide range of critical neuro-trauma findings (average AUC: 0.861 for 16 conditions). The authors also conduct ablation studies to assess the contribution of each module, which strengthens the validity of their design choices.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While the authors introduce two segmentation networks (for hemorrhage subtype and brain anatomy), which are then used to extract features for the main classification model, the paper does not report their standalone performance (e.g., Dice score, IoU, or visual examples). Given that these modules significantly contribute to the final model’s accuracy, their efficacy remains unclear without quantitative or qualitative validation.
    2. The experimental comparisons are mostly restricted to CT-CLIP and FM-CT. The authors should also include comparisons with smaller, task-specific expert classification models. Although foundation models may not always outperform these specialized networks on narrow tasks, such comparisons are essential to contextualize the strengths and trade-offs of using a generalist approach. 3.The paper does not provide any analysis on the inference time, model size, or hardware requirements of the proposed DeepCNTD-Net. Given its complex architecture and multimodal fusion, these factors are crucial in real-world emergency radiology workflows, where decisions must be made in seconds. The lack of such practical considerations limits the work’s immediate clinical impact. 4.The paper primarily reports average AUC scores across six major and sixteen total neuro-trauma findings. While AUC provides a broad sense of model discrimination, it is not sufficient on its own—especially in a multi-label clinical setting where class imbalance is common and decision thresholds are critical. Other essential metrics such as precision, recall, F1-score, specificity, and calibration error are completely missing. These are crucial for understanding model behavior in real-world settings where false positives may lead to unnecessary interventions and false negatives can delay critical care. 5.The paper lacks interpretability tools or mechanisms, such as saliency maps or attention heatmaps, which would help radiologists trust and understand model decisions.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a well-structured foundation model for head CT triage and introduces several interesting components, including segmentation-guided feature integration and large-scale label generation using LLMs. However, the most critical concern lies in the evaluation of the model’s actual performance and clinical utility. The results rely almost entirely on average AUC, which, while useful for ranking models, does not reflect clinical decision quality—especially in a multi-label, imbalanced setting like neuro-trauma. Metrics such as sensitivity, specificity, precision, F1-score, and per-class performance are essential to understand how the model would behave in real-world triage scenarios. Additionally, the paper does not provide any threshold-dependent evaluation or calibration analysis, nor does it explore the consequences of false positives and false negatives. Given the high stakes in emergency radiology, this limits the trustworthiness of the reported findings. Unless the authors provide more granular, clinically grounded performance analysis, the practical value of the proposed system remains unclear. A strong rebuttal addressing these concerns could potentially change the recommendation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    While the authors have commendably addressed several points by providing additional data (segmentation performance, extended classification metrics, inference details, and comparisons to internal task-specific models), critical deficiencies remain that significantly impact the paper’s contribution and potential clinical applicability, making it unsuitable for acceptance in its current form.

    While the authors provided additional metrics, the reported precision is low (e.g., 0.225 for “all findings,” 0.449 for “six major findings”). Although high sensitivity is desirable for triage, such low precision implies a very high false positive rate. This could lead to significant “alarm fatigue” and potentially unnecessary follow-up investigations in a high-pressure emergency workflow, undermining the practical utility of the system. The authors’ brief mention of “application-specific operating points” does not sufficiently mitigate this concern within the current manuscript. Calibration error also remains unaddressed.

    Additionally, the segmentation performance for critical tasks like hemorrhage detection is only moderate (Dice 0.60-0.70 with high variance). While the authors claim these features are “meaningful,” the impact of this upstream variability and moderate accuracy on the robustness and reliability of the final multi-label classification is not convincingly established.

    The authors’ framing of their work as establishing “initial reference points” also suggests a preliminary stage of development. While valuable, the serious implications of very low precision, coupled with concerns about the reliability stemming from moderate performance in foundational segmentation tasks, mean the paper does not currently present a sufficiently robust or validated solution for the complex problem it tackles. Therefore, despite the authors’ efforts in the rebuttal, these remaining critical flaws necessitate a rejection.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a 3D foundation model for detecting diverse neuro-trauma findings in non-contrast head CT scans, aimed at improving emergency triage. The study leverages large language models (LLMs) for automatic labeling, generating comprehensive multi-label annotations for critical conditions. The approach involves pretraining neural networks for hemorrhage subtype segmentation and brain anatomy parcellation, which are integrated into a comprehensive neuro-trauma detection network through multimodal fine-tuning. The model demonstrates strong triage accuracy across major neuro-trauma findings such as hemorrhage and midline shift, achieving an average AUC of 0.861 for 16 neuro-trauma conditions. This work advances foundation models in medical imaging, serving as a benchmark for future AI-assisted neuro-trauma diagnostics in emergency radiology.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel methodology: The integration of LLM-driven automated labeling with task-specific neural networks for hemorrhage segmentation and brain anatomy parcellation. Strong evaluation: The model achieves high accuracy across both common and less frequent critical conditions, enhancing performance through multimodal feature integration. Comprehensive dataset: The inclusion of data from multiple centers across different countries enhances the model’s generalizability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the paper provides a comprehensive overview of the model’s application, it could benefit from more detailed explanations on how the foundation model was built, including the specific architectures and training processes used. The paper could also expand on potential limitations or challenges in applying the model across different types of CT scans or varying quality of images.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper represents a significant advancement in the application of AI to neuro-trauma diagnostics, offering a robust framework for emergency triage. Future work could focus on real-world validation and seamless clinical integration to enhance patient care.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides a strong methodological contribution with the introduction of a 3D foundation model for neuro-trauma detection. The integration of LLM-driven labeling and multimodal feature fusion enhances the model’s diagnostic capabilities. Despite minor areas for improvement, such as providing more detailed explanations of the model’s construction, the paper’s contributions and findings warrant a high acceptance score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for constructive feedback. Our primary contribution is to establish initial reference points for applying foundation models to the comprehensive neuro-trauma domain. Below, we respond point-by-point to the major comments. R1 and R3: Clarifications on model architecture and training We will add architectural details to Fig. 1, including network specifics, feature map sizes, the pretraining and finetuning workflow, and how features are passed between networks. R1 and R3: CT parameter heterogeneity limitations Our dataset spans a wide range of CT acquisition parameters. Slice thickness distribution includes 6% (0–1mm), 2% (1–2mm), 14% (2–3mm), 78% (3–5mm), and <0.1% (>5mm). More than 15 energy types are represented, with ~25% being dual-energy acquisitions combining various spectra. Preprocessing standardizes matrix size, orientation, and field of view. We apply data augmentation to simulate variability in image quality. All scans are non-contrast CTs, ensuring modality consistency while reflecting real-world heterogeneity. R2: Performance of pretrained segmentation networks On 454 cases, hemorrhage segmentation achieved a Dice score of 0.60±0.26 and a volume error of 7.6±20.3 ml; for hemorrhages >5 ml (n=293), Dice improved to 0.70±0.20 and volume error to 10.1±24.7 ml. Brain segmentation on 61 cases showed robust performance: left/right hemispheres Dice 0.94±0.10, IOU 0.90±0.10, ASSD 1.7±4.5 mm; infratentorial/supratentorial Dice 0.91±0.12, IOU 0.84±0.15, ASSD 2.8±6.5 mm; other 11 anatomical regions Dice 0.72±0.19, IOU 0.59±0.19, ASSD 3.6±10.6 mm. These results indicate pathologically and anatomically meaningful features supporting classification. R2: Comparison with task-specific expert models Internally developed task-specific models achieved AUC 0.97 (sensitivity 0.92, specificity 0.90) for hemorrhage classification (n=3247), and AUC 0.95 (sensitivity 0.89, specificity 0.89) for midline shift detection (n=2545). While expert models perform generally better, the foundation model achieves comparable accuracy, especially for certain findings like midline shift. R2 and R3: Inference speed, model size, and deployment On an NVIDIA A100 GPU (40GB), model inference takes ~0.89 seconds per CT volume. The model has 6,417,007 parameters across 751 layers, requiring ~11.8 GB GPU memory during inference. R2: Limited evaluation metrics Across six major findings, average sensitivity was 0.815±0.105, specificity 0.809±0.070, precision 0.449±0.200, and F1 score 0.558±0.186. For all findings, sensitivity was 0.827±0.101, specificity 0.800±0.117, precision 0.225±0.222, and F1 score 0.305±0.249. On CQ500 (three findings), sensitivity was 0.850±0.023, specificity 0.857±0.157, precision 0.710±0.226, and F1 score 0.758±0.132. These results will be added to the manuscript. In neuro-trauma triage, high sensitivity is prioritized to minimize missed critical findings. Our model preserves this while maintaining reasonable specificity. Lower precision—especially for rare findings—leads to more false positives and may require additional follow-up. Application-specific operating points can help balance precision and recall. R3: LLM label noise and reproducibility concerns We agree that LLM-generated annotations for subtle findings (e.g., infarcts) are affected by variable report language and terminology. To mitigate label noise, we empirically optimized a general-purpose prompt (shown in Sect 2.1) to minimize labeling errors. Further improvements—such as clinically tailored prompts, LLM consensus labeling, and uncertainty estimation—can enhance consistency and reproducibility. We will add this discussion to Sect. 3.2. R3: Limited external validation and comparison with SOTA CQ500 provides labels for hemorrhage subtypes, midline shift, mass effect, and calvarial fractures, but lacks radiology reports. We evaluated all available conditions except calvarial fractures and reported comparative results with FM-CT (see p.7).




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While reviewers appreciate the approach to employ LLM based label generation to build a 3D CT foundation model there remain substantial concerns (in particular wrt results / evaluation) among the majority of reviewers leading to a recommendation to reject this paper.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces a 3D foundation model of non-contrast head for neuro-trauma triage, with multi-label annotations generated by LLMs from clinical reports, and task-specific networks for hemorrhage segmentation and brain anatomy parcellation. Although the effort in curating a large-scale multi-centre dataset for building the foundation model is well appreciated, the limited validation with external datasets and LLM label noise still require further work to fully demonstrate the value of the resulting foundation model.



back to top