Abstract

Extracting structured labels from radiology reports has been employed to create vision models that detect several types of abnormalities simultaneously. However, existing works focus mainly on the chest region. Few works have investigated abdominal radiology reports due to the more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that the extraction of urgency labels achieves performance comparable to that of human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a publicly available dataset containing over 1,000 CT volumes.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1470_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/rsummers11/LEAVS

Link to the Dataset(s)

LLM annotations for reports from AMOS-MM: https://github.com/rsummers11/LEAVS Human annotations for evaluations: https://github.com/rsummers11/LEAVS

BibTex

@InProceedings{BigRic_LEAVS_MICCAI2025,
        author = { Bigolin Lanfredi, Ricardo and Zhuang, Yan and Finkelstein, Mark and Thoppey Srinivasan Balamuralikrishna, Praveen and Krembs, Luke and Khoury, Brandon and Reddy, Arthi and Mukherjee, Pritam and Rofsky, Neil M. and Summers, Ronald M.},
        title = { { LEAVS: An LLM-based Labeler for Abdominal CT Supervision } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {316 -- 326}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents LEAVS, a zero-shot large language model (LLM)-based system designed to extract structured abnormality and urgency labels from abdominal CT radiology reports. It introduces a four-stage prompting system—sentence filtration, finding type assessment, uncertainty categorization, and urgency scoring—and evaluates its performance against human annotations and rule-based methods. The authors further use the extracted labels to supervise a vision classifier, aiming to build a general abnormality detection system for abdominal CT.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Tackles an underexplored domain—structured labeling for abdominal CT reports—which is less studied compared to chest imaging.
    • Proposes a modular, interpretable, and adaptable four-stage prompt system for label extraction using LLMs.
    • Demonstrates performance comparable to or exceeding human annotators in some metrics.
    • Includes comparisons with both rule-based (SARLE) and LLM-based (MAPLEZ) baselines.
    • Offers open-source code and annotations (anonymized link given in the paper), supporting reproducibility.
    • Evaluation includes a vision classifier to show downstream utility of the extracted labels.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Language and clarity are poor, making it difficult to follow the methodology or understand the pipeline. Especially in the Methods and the Results section.
    • Several grammatical, typographical, and formatting issues persist throughout the manuscript and references.
    • Incorrect and inconsistent terminology, such as “SAMUAES” instead of “UAE-S”, “Total segmentator” instead of “Totalsegmentator” and “AMOS-MM” instead of “AMOS”.
    • The vision classifier model is insufficiently described: the architecture, flow, and integration with the labeler are not explained at all.
    • Rerms like IRB and Chain-of-Thought (CoT) prompting are not defined.
    • Justification for model choices (e.g., why Qwen2-72B is considered the best) is missing.
    • Unclear annotation process: annotator qualifications are unspecified, labeling range is vague (“100 to 150 reports”), and manual labeling methods are not explained.
    • Baseline comparison with MAPLEZ is inappropriate, as it is designed for chest X-rays, not abdominal CT.
    • Mismatch between described and reported organs in the evaluation section (e.g., Table 2 lists 6 organs though 7 are mentioned in the 5th paragraph of the results section.).
    • Unclear impact of the four-stage labeling strategy on improving the vision model—this critical link is not substantiated.
    • Inference time (17.2 min per report) is impractical, and no experimental mitigation or analysis is provided.
    • References have stylistic inconsistencies and Grammatical mistakes (e.g., varying author name formats, DOI usage, capitalizations in the references section such as the first reference contains the names of all authors while the reference number 6, 10, 25, 27 and 26 mentioned the first author name followed by “et al”, and the 2nd reference mentions the name of the model as “Uae” in the paper title which is actually “UAE”.).
    • Lack of a clear research gap or motivation: The paper does not explicitly outline the specific limitations in existing methods that LEAVS addresses or why a new labeling approach is needed for abdominal CT.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper explores an important and relatively less-studied area—automated label extraction from abdominal CT reports using LLMs. However, to be impactful and practically useful, several core aspects must be significantly improved. Clarify and correct terminology, ensure consistent pipeline description, and improve the explanation of the classifier model’s architecture and its interaction with the labeler. The connection between the structured labels and the downstream vision classifier is particularly underdeveloped. Given the promising direction, a substantial rewrite could help realize its potential.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite a novel and promising idea, the paper lacks clarity in explanation, consistency in terminology, and rigor in methodology reporting. Key aspects of the pipeline—including the classifier, the integration of labels, and the experimental justification—are underexplained or missing. The language issues are pervasive and significantly hinder comprehension. Until these major issues are resolved, the paper does not meet the standard for publication.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper
    1. This paper introduces LEAVS (LLM Extractor for Abdominal Vision Supervision), a novel zero-shot prompt system that leverages large language models (LLMs) to extract structured labels from abdominal CT reports.
    2. LEAVS provides a more comprehensive and fine-grained labeling system for abdominal CT reports compared to existing methods, covering a wide range of findings (e.g., “Absent,” “Device,” “Postsurgical,” “Enlarged,” etc.).
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This approach is innovative compared to traditional rule-based or supervised methods.
    2. This paper demonstrates techniques to speed up inference, such as sentence filtration and reducing the number of required computations, which is crucial for practical applications.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. This paper acknowledges that the inference time for LEAVS (17 minutes per report) is a significant limitation for real-world applications. This could hinder its adoption in clinical settings where rapid processing is essential.
    2. I want to know how does the system process the data privacy.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. This paper introduces LEAVS, a novel zero-shot prompt system that leverages large language models (LLMs) to extract structured labels from abdominal CT reports.
    2. This paper thoroughly evaluates the LEAVS system, comparing it against existing baselines (e.g., MAPLEZ and SARLE) and showing that LEAVS outperforms these methods in terms of F1 scores, precision, recall, and other metrics.
    3. This paper also validates LEAVS on a private dataset and a public dataset (AMOS-MM), showing that the method can generalize across different domains.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper is the development of a novel system, LEAVS, which uses large language models (LLMs) to extract structured labels from abdominal CT radiology reports. This system aims to improve the annotation process for medical imaging, particularly in the abdominal region, which is more complex than the chest region due to its anatomy and range of pathologies. A generic LLM-based solution is presented for applying structured labels to X-ray and CT datasets. Innovations are:

    • Transfer to CT Reporting Domain: The paper successfully adapts LLMs to the domain of abdominal CT reporting, which has been less explored compared to chest radiology.
    • LLM Sentence Filtration: The system employs a specialized sentence filtration process to focus the LLM on relevant parts of the CT report. This is particularly useful for handling long and detailed medical reports, ensuring that the LLM processes only the most informative sentences (well, medical reports are not THAT long, but anyway).
    • Urgency Assessment: LEAVS includes an urgency assessment feature that categorizes the urgency of findings in the CT reports. This helps in prioritizing medical interventions based on the severity of the abnormalities detected. Use of LLMs is more flexible compared to common rule-based approaches (e.g. SARLE)
    • Flexibility of LLMs: The use of LLMs provides greater flexibility compared to traditional rule-based approaches like SARLE. LLMs can adapt to new abnormalities and keywords without the need for manually crafted rules, making them more versatile and efficient.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel Formulation with Hybrid Approach: The paper presents a hybrid approach combining total organ segmentation with LLM-based labeling, reporting, and anomaly detection. This integration leverages the strengths of both segmentation and language models to enhance the accuracy and comprehensiveness of medical report analysis.
    • Original Use of Data / Generic Approach for abdominal CT Reporting: This is the first time a generic approach has been applied to abdominal CT reporting, making it versatile for various organs and abnormalities. Unlike specific labelers like SARLE, LEAVS can handle a wide range of findings across multiple organs.
    • Demonstration of Clinical Feasibility by outperforming human experts: The results show that LEAVS can sometimes outperform human experts in labeling accuracy, demonstrating its potential for clinical use. This highlights the feasibility of using LLMs in real-world medical settings to assist or even enhance human performance.
    • Novel Application with LLM as Inference Engine/Expert System: Treating LLMs as an inference engine or expert system for deducing type, urgency, and finding uncertainty assessments from full organ reports is a novel application. This approach allows for detailed and nuanced analysis of medical reports.
    • Strong Evaluation due to the chosen validation setup: The validation setup includes majority voting from three human experts, ensuring robust and reliable evaluation of the system’s performance. This method helps to mitigate individual biases and provides a more accurate assessment of the system’s capabilities.
    • Advanced Prompting Techniques: The paper employs state-of-the-art prompting strategies, such as repeating questions and answers and requiring arguments. These techniques help minimize invalid answers and hallucinations, ensuring the reliability of the LLM’s outputs.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Weak Argumentation for Novelty : The paper’s claim for novelty in extracting comprehensive findings from abdominal CT reports is somewhat overstated. As the authors themselves mention, there is already a long list of LLMs being utilized for specific findings. The novelty and genericity of LEAVS are based on existing technological knowledge (e.g., MAPLEZ and other domain-specific solutions) adjusted for generic CT reports. This weakens the argument for the uniqueness of the approach.
    • Minor aspects:
      1. Metric Labeling: “Min/R” Label in Table 3: The label “Min/R” (minutes per report) in Table 3 is unconventional. Introducing a new metric should follow standard naming conventions, such as “reports per hour” (RpH), especially given the length of the reports. This would make the metric more intuitive and easier to understand.
      2. Complexity of Labelling Ontology with Hard-to-Follow textual explanation: The textual explanation of the labeling ontology in the introduction is difficult to follow. A graphical or tabular representation would be more effective in conveying the information. This would help readers understand the default values, single-label, and multi-label options per sentence more clearly.
      3. “The labeling has been traditionally..”  “Labeling has traditionally been done…”
      4. Some missing commas, e.g. “We propose LEAVS, a Large Language Model Extractor…”
      5. “Large-language models” should be consistently hyphenated or not. It should be: “Large language models” or “Large-language models” throughout the paper. Cf. abstract vs. keyword list mismatch.
    • Hypothesis Testing for Sentence Filtration: The conclusion that sentence filtration allows the model to focus on important parts of long reports is speculative. The average report in the AMOS-MM training set has only 16 sentences and 1,400 characters, which is relatively small. The hypothesis that LLMs perform better with more compact and relevant content needs explicit testing. The current argument is based on assumptions rather than empirical evidence.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    long list of strength with a) innovative hybrid approach, b) very strong performance compared to other tools and human experts, c) LLMs as inference engine and with novel application in general d) robust evaluation

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for recognizing the strengths of our work and for their insightful comments. Their suggestions will improve the manuscript’s clarity and quality. As detailed below, we believe most concerns can be effectively addressed through clarifications and minor edits.

R1, R2: Long inference (17 min/report): We appreciate the observation. As we discussed, this is a limitation of our implementation. In Section 4, paragraph 4, we presented an experimentally validated method that reduces the runtime to 5 min/report and suggested a future direction involving faster technology. Furthermore, we prioritized labeling quality for this specific work.

R1: Data privacy inquiry: Thank you for raising this point. As stated in the abstract, we ran all LLMs locally. We anonymized the private validation set with an external tool and used an anonymized public dataset.

R2: Incorrect method names and typographical errors: Thank you for pointing these out. We will revise method names to match original publications, define IRB, and fix reference capitalizations. We confirm that AMOS-MM is the correct name, and CoT is defined in Section 2, paragraph 5.

R2: Lack of architecture definition: We appreciate the request for clarification. The vision classifier and its interaction with the structured labels are described in Figure 1, Section 2, paragraph 6, and Section 3, paragraph 5. TotalSegmentator and UAE-S are cited and not described due to space limitations. We provided code for full reproducibility. We will further clarify that the ResNet layers serve as the shallow classifier.

R2: Justification of LLM choice: We agree that justification is essential. It was provided in Section 3, paragraph 1. We will clarify that the F1 score for abnormality type labeling was used for validation.

R2: Lack of annotation details: We acknowledge the need for more details. We will describe the annotation team (two board-certified radiologists, two senior radiology residents, and one postdoctoral MD researcher) and the labeling tool available in our shared code.

R2: Inappropriate baseline: We understand the concern and acknowledge the challenge in selecting baselines in an underexplored domain. We used MAPLEZ and SARLE as the most relevant available ones. Including both provides valuable performance comparisons.

R2: Organ mismatch in methods/results: As mentioned in the paper, our predefined criterion was to include results only for organs with N+>10. We will further clarify that some annotated organs were thus excluded from results.

R2: Unclear impact of LEAVS on the classifier: We agree that explicitly isolating the effect of the LEAVS prompt on classifier performance would strengthen the paper. While we did not ablate this component, prior work we cited in the paper (e.g., MAPLEZ) has shown that improved LLM labels enhance downstream classification.

R2,R3: Weak research gap/novelty: We appreciate the reviewers’ feedback and would like to clarify that the novelty lies in the application: abnormal organ detection with comprehensive and high-quality labels extracted from CT reports with thorough evaluations. In contrast, prior work has predominantly used LLMs for narrow tasks, or has relied on rule-based systems, and has not classified abdominal organs in a 3D CT as normal/abnormal.

R3: Lack of ontology clarity: Thank you for this suggestion. To enhance clarity, we will restructure the ontology description from Section 2 as a table or as additional details in Figure 2.

R3: Hypothesis testing needed for sentence filtration: Our ablation study statistically supports improvement with sentence filtration (Table 3, LEAVS vs no filtration). We agree that we do not test if the improvement results from filtration enabling focus on important parts, hence our use of “probably” in Section 4, paragraph 3. In the list item in Section 1, we will mark it as a hypothesis.

R3: Other minor language and format fixes: We will address all suggestions.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top