Abstract

Abdominal trauma is one of the leading causes of death in the elderly population and increasingly poses a global challenge. However, interpreting CT scans for abdominal trauma is considerably challenging for deep learning models. Trauma may exist in various organs presenting different shapes and morphologies. In addition, a thorough comprehension of visual cues and various types of trauma is essential, demanding a high level of domain expertise. To address these issues, this paper introduces a language-enhanced local-global aggregation network that aims to fully utilize both global contextual information and local organ-specific information inherent in images for accurate trauma detection. Furthermore, the network is enhanced by text embedding from Large Language Models (LLM). This LLM-based text embedding possesses substantial medical knowledge, enabling the model to capture anatomical relationships of intra-organ and intra-trauma connections. We have conducted experiments on one public dataset of RSNA Abdominal Trauma Detection (ATD) and one in-house dataset. Compared with existing state-of-the-art methods, the F1-score of organ-level trauma detection improves from 51.4% to 62.5% when evaluated on the public dataset and from 61.9% to 65.2% on the private cohort, demonstrating the efficacy of our proposed approach for multi-organ trauma detection. Code is available at: https://github.com/med-air/TraumaDet

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1056_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/med-air/TraumaDet

Link to the Dataset(s)

https://www.kaggle.com/competitions/rsna-2023-abdominal-trauma-detection/data

BibTex

@InProceedings{Yu_LanguageEnhanced_MICCAI2024,
        author = { Yu, Jianxun and Hu, Qixin and Jiang, Meirui and Wang, Yaning and Wong, Chin Ting and Wang, Jing and Zhang, Huimao and Dou, Qi},
        title = { { Language-Enhanced Local-Global Aggregation Network for Multi-Organ Trauma Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This workintroduce a novel language-enhanced local-global aggregation network. The network integrates both global and local visual information and is enhanced by text embeddings from LLM. The model utilize a dual attention mechanism where both global and local features serve as keys and values for each other. Additionally, to incorporate intrinsic anatomical cues into visual representations, the LLM text embeddings were leveraged.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The overall idea of fusing global and local features is well-motivated, considering the current challenges.
    2. The performances improvement of the proposed method are impressed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The strcture of the paper needs to be improved. For instance, breaking down the paragraphs in the introduction for better illustration would enhance readability. The writing should be improved as well. For instance, in the last paragraph of the introduction, “cans” should be corrected to “scans”.
    2. The visulization is not easy to follow. Specifically, in Fig. 1, the fusion module was claimed to consist of six multi-head self-attention layers, but the module symbol was not shown. This could lead to misunderstanding, as it may appear that the features only went through element multiplication.
    3. The novelty of adding the category-wise prompts in the Language-Enhanced Module appears unclear. From my understanding, the prompt is generated according to the label (as mentioned in section 2.3). The author did not discuss how this is utilized during inference, leaving the purpose of these prompts ambiguous.
    4. While the experiments compared the proposed method to a few state-of-the-art models, they lack comparison regarding the selection of backbone for the image encoder/text encoder.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Improve the overall writing and paper structure.
    2. Improve Fig.1. to highlight the module names that introduced in each section.
    3. Please explain the inference settings especially how to use the category-wise prompts during inference.
    4. Equation explanation has problem. i.e. in equation (1), the dimension ‘d’ in cross attention should represent the dimension of the key rather than ‘the dimension of Q, K, and V’. (if it’s the standard cross-attention, if not, please provide more details).
    5. It’s better to provide experiments on the backbone selection. For example, compare with other LLM such as blip2[1]. [1] Li, J., Li, D., Savarese, S. and Hoi, S., 2023, July. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Unclear novelty; poor paper structure

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a model that integrates both local organ-specific features and global contextual information, enhanced by text embeddings from Large Language Models (LLMs) to capture anatomical relationships within and between organs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The use of LLM-generated text embeddings to enhance visual feature understanding is innovative and capitalizes on the rich medical knowledge embedded in language models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the model shows improvements on specific datasets, the paper lacks a discussion on its generalizability across different datasets or under varied clinical conditions.
    2. What is the differences between the used LLM embedding and the existing Medical LLM/VLM methods? What type of LLM does the author use? And some methods such as Llava-med [1], LViT [2] have introduce the language information into the medical image analysis. The author should also clarify the differences. [1] Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., … & Gao, J. (2024). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36. [2] Li, Z., Li, Y., Li, Q., Wang, P., Guo, D., Lu, L., … & Hong, Q. (2023). Lvit: language meets vision transformer in medical image segmentation. IEEE transactions on medical imaging.

    3. Providing statistical analysis (e.g., confidence intervals or p-values) for the performance metrics could also strengthen the validity of the results.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weakness. I would like to see it accepted if my concerns can be resolved during the rebuttal period.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are certain innovative points for structural design and prompt design, but lack of discussions on related works.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks for the rebuttal.



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel approach to Multi-organ Trauma Detection. Specifically, it integrates global contextual information and local organ details for more accurate localization and diagnosis. Additionally, the paper employs a Language Model (LLM) to encode the organ and category, enhancing the capture of anatomical relationships between intra-organ and intra-trauma connections.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.As discussed in the contribution section, the proposed method is both logical and innovative. 2.The experimental results are impressive. The method has been validated on two datasets and compared with existing state-of-the-art methods, effectively addressing the problem.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The methodology is based on 2D scans, which may restrict its practical applications and value.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    n/a

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1.See weakness 1. 2.Given that the text information used in the article is relatively simple, and compared to the more complex text seen in computer vision applications like CLIP, this paper’s use of high-dimensional embeddings to encode such straightforward information might be somewhat redundant and is worth further discussion.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, from the perspectives of methodology and innovation, I believe the paper is worthy of acceptance. However, there are limitations regarding its applicability that should be further discussed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The author’s reply has resolved my concerns. As stated in my recommendation, I am willing to raise the score and recommend the acceptance of the article.




Author Feedback

We thank the AC and reviewers for their time. Most reviewers are positive and supportive, highlighting our idea as “well-motivated”. Our work is “logical and innovative”, experiments “effectively address the problem”, and “performance improvements are impressed”.
To R4:Regarding R4’s concern about the model’s generalizability, our model fuses global textual and local organ features for trauma detection. The design considers relations between the global and local features, it can potentially generalize to similar clinical scenarios, in which both global and local contexts are important, such as brain age estimation (He et al, IEEE TMI21), and reconstruction (Huang et al, ICCV21). We will extend our work across various organ types for future work. R4 asks for a discussion with existing works, like LLaVa-Med, LViT. Our work uses ViT-B-32 with CLIP for text embedding and GPT-3.5-turbo for trauma descriptions. Compared with LLaVa-Med and LViT, our work focuses on 3D medical volumes, whereas they both focus on 2D image-text tasks. Furthermore, LLaVa-Med is tunned from LLaVa by instructions, while our work adopts the idea of text embedding. Particularly, compared with LViT which also uses text embedding, we further novelly fuse text embedding on local and global scales, which better leverages medical information. In addition, we use the text encoder empowered by CLIP to align text and image features more efficiently. R4 requests statistical analysis. We have provided the mean and std. of the result. Due to space issue, we didn’t put p-values in our manuscript. Here, we list the p-values when comparing with the top-2 best baselines (Huang et al, and CBAM) on the case and organ accuracy on private and public datasets, which are 2e-4, 5e-3, and 1e-6, 5e-4, indicating the significance. We will include the full results in the final version. To R6:We thank the R6 for suggestions on writings and figures. We have followed your suggestion to correct the typo and further improve the organization and figure caption. We will further simplify sentences and include these revisions in our final version. R6 requires clarification of the model inference. The Language-Enhanced Module uses organ-wise and category-wise prompts. During inference, only organ-wise prompts are used, as they are based on organ names, not labels. Category-wise prompts, generated from labels, are only used during training to compute the KL-loss and guide training. Regarding clarification on Eq. (1). We apologize for the oversights. ‘d’ is the dimension of the key, not ‘Q, K and V’, We will correct this in the final version. R6 requires discussion on the backbone selection. For the vision encoder, we tested ResNet18, DenseNet121, and DenseNet169, which had lower case accuracy (by 5%, 2%, and 2%) than our chosen ResNet50. We didn’t present these results due to space constraints. For the text encoder, we selected ViT-B-32 with CLIP for its proven effectiveness and wide applicability (Liu et al, ICCV23, Li et al, ICLR23). For comparison with other LLMs (e.g. blip2), please note that our work focuses on 3D medical volumes, whereas blip2 focuses on 2D image captioning, which is hard to generalize to 3D data. In this regard, we propose to generate text from labels by using GPT-3.5-turbo, which can generate the clinical description based on labels, without vision inputs. To R8:Regarding R8’s question on the input shape. We apologize for any confusion. Our methodology is based on 3D scans, not 2D scans. All vision encoders are 3D models, making our approach suitable for practical 3D applications. R8 asks for the discussion on information embedding. We clarify that straightforward text has intrinsic semantic relationships, which prior works (Liu et al. ICCV23, Li et al. ICLR23) have shown effectiveness in segmentation tasks. In addition, we fuse organ information with vision features and use trauma semantics to guide training, which shows more effectiveness from our ablation study.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mostly positive comments (A, WA, and WR), acknowledging the novelty of the proposed method and performance improvement.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mostly positive comments (A, WA, and WR), acknowledging the novelty of the proposed method and performance improvement.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    one reviewer changed the score up, and the others remained the same, bar is above the acceptance rate, positive comments are visible.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    one reviewer changed the score up, and the others remained the same, bar is above the acceptance rate, positive comments are visible.



back to top