Abstract

Radiology reports contain free-form text that conveys critical clinical information derived from imaging studies and patient history. However, the unstructured nature of these reports, coupled with the complexity and ambiguity of natural language, poses significant challenges for automated information extraction, particularly in domains with limited labeled data. To address this, we introduce a novel expert-annotated dataset encompassing four new imaging modalities: cardiac magnetic resonance imaging (MRI), abdominal ultrasound, head computerized tomography (CT), and CT pulmonary angiography (CTPA). Leveraging this dataset, we developed transformer-based models optimized for entity recognition and relation extraction within specific modalities, enabling the generation of high-quality radiology annotations. Our evaluation of fine-tuning methods demonstrate that modality-specific models achieve a 12.5% macro F1 score improvement in entity recognition and a 28.3% improvement on relation extraction tasks compared to prior approaches. These findings highlight the potential of fine-tuned, modality-specific models in enhancing automated radiology text processing and downstream applications. By releasing the model and datasets, we aim to foster research on wider modalities in medical natural language processing across a broader range of imaging modalities. The code is available at https://github.com/tonikroos7/RadGraph-Multimodality.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4672_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/tonikroos7/RadGraph-Multimodality

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GuaHao_Enhancing_MICCAI2025,
        author = { Guan, Haoyue and Dai, Yuwei and Afyouni, Shadi and Kain, Alec and Hsu, Wen-Chi and Cheng, Jiashu and Yao, Sophie and Wang, Yuli and Wu, Jing and Pulakhandam, Rishitha and Zhao, Lin-mei and Zhu, Chengzhang and Jiao, Zhicheng and Jones, Craig and Bai, Harrison},
        title = { { Enhancing Radiology Report Interpretation through Modality-Specific RadGraph Fine-Tuning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work introduce an expert annotated dataset encompassing four new imaging modalities: cardiac magnetic resonance imaging (MRI), abdominal ultrasound, head computerized tomography (CT), and CT pulmonary angiography. Leveraging this dataset, we developed transformer-based models optimized for entity recognition and relation extraction within speciffc modalities, enabling the generation of high-quality radiology annotations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The dataset comprises a total of 800 radiology reports, including: 100 cardiac MRI reports, 200 pulmonary artery CT angiography (CTPA) reports, 300 head CT reports, and 200 abdominal ultrasound reports. This dataset contains 14,813 unique entity-label pairs (entity, label) and 11,179 unique relation triples (entity 1, entity 2, relation). Overall, the scale of this dataset represents a 60% increase compared to the original RadGraph (limited to chest X-rays) and differs from RadGraph-XL (which includes chest CT, abdominal/pelvic CT, brain MR, and chest X-rays).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of Novelty No structural innovation in the model architecture: The work employs existing language models (e.g., BERT, BiomedBERT) and established information extraction frameworks (e.g., DYGIE++), without proposing novel model structures or extraction mechanisms—such as a new multi-task framework, attention mechanism, or graph-based modeling approach. Annotation schema derived from RadGraph: While the authors expanded the modalities and disease types, they reused RadGraph’s predefined entity/relation schema, merely extending its coverage rather than introducing a new annotation paradigm or knowledge representation structure.

    2. Insufficient Clarification on Differentiation from RadGraph-XL RadGraph-XL already supports multiple modalities (e.g., abdominal CT). The authors should emphasize the uniqueness and necessity of the newly added modalities in this work.

    3. Ambiguity in Fine-Tuning Strategy The paper does not specify whether modality-specific fine-tuning or multi-task learning was applied, leaving the training protocol unclear.

    4. Concerns About Evaluation Fairness – Potential Bias from Shared Data Sources This work introduces a new expert-annotated multimodal dataset and achieves significant performance gains through modality-specific fine-tuning, demonstrating practical value. However, questions remain regarding the fairness of the experimental evaluation:

    The training and test data both originate from the authors’ custom dataset, a common practice in domain-specific fine-tuning. Yet, the criteria for train-test splitting are unspecified, raising concerns about potential “self-contained benchmark” bias or limited generalizability.

    Furthermore, RadGraph and RadGraph-XL were not fine-tuned on this dataset, making direct comparisons with the proposed method potentially less equitable.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    To enhance the credibility and generalizability of the experimental findings, the authors are advised to consider the following revisions in their manuscript:

    1. Clearly specify the data splitting methodology for the training, validation, and test sets, ensuring strict independence between them to avoid potential data leakage.

    2. Report the zero-shot performance of RadGraph and RadGraph-XL on the test set of this dataset to provide baseline references.

    3. Further evaluate cross-modal generalization capability, such as testing on modalities not included in training, to validate the model’s robustness.

    4. Including these additions would significantly improve the rigor and persuasiveness of the experimental design.

    5. Release all the codes and datasets to public

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contributions of the paper are 1) The paper introduces a new dataset including cardiac MRI, CTPA, head CT, and abdominal ultrasound reports and extends the RadGraph datasets. 2) It emphasizes with experiments the importance of modality-specific models, which consistently outperform out-of-domain models in entity recognition and relation extraction tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A new dataset, which includes cardiac MRI, CTPA, head CT, and abdominal ultrasound reports, is introduced. Compared to the original RadGraph dataset, which only includes chest X-ray images, this expansion broadens the model’s applicability and offers significant benefits to the community.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper does not mention whether the new multi-modality dataset will be made publicly available. Since the original RadGraph is based on public reports and imagings, it would be beneficial if the new imaging data were also made public to ensure broader adoption and reproducibility.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is based on a few key factors. 1) The expansion of RadGraph to support multiple imaging modalities is valuable and broadens its applicability. 2) The fine-tuning of modality-specific models is effective, but the lack of clarity on dataset availability raises concerns about reproducibility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This is a borderline accept. My concerns about limited novelty and narrow evaluation remain, particularly the lack of broader baselines and downstream validation. However, I acknowledge that this is a dataset paper, and the contribution of a well-annotated, multi-modality extension to RadGraph has clear utility for the MICCAI community.



Review #3

  • Please describe the contribution of the paper
    1. The authors introduce a new, expert-annotated dataset that includes four previously underrepresented imaging types: cardiac MRI, abdominal ultrasound, head CT, and CT pulmonary angiography (CTPA). This significantly expands the diversity of data available for medical NLP tasks.
    2. They build and fine-tune BERT-based models tailored to each imaging modality. These models are designed for named entity recognition (NER) and relation extraction, and they demonstrate substantial performance improvements in both areas.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors introduce a new, expert-annotated dataset that includes four previously underrepresented imaging types: cardiac MRI, abdominal ultrasound, head CT, and CT pulmonary angiography (CTPA). This significantly expands the diversity of data available for medical NLP tasks.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Major Weaknesses:

    1. The paper states, “For each modality in our dataset, we fine-tuned a modality-specific version of RadGraph for performance evaluation.” in Section 3.1. Does this mean there is a separate RadGraph-like model for each modality, resulting in four models in total? If so, are the evaluation metrics in Table 4 (Multi-modality RadGraph) computed based on the outputs of these four models?
    2. The key contribution of this paper is the creation of a novel expert-annotated dataset covering four new imaging modalities: cardiac MRI, abdominal ultrasound, head CT, and CT pulmonary angiography. Therefore, if permissions and ethical considerations allow, it is strongly recommended to release the dataset, along with the corresponding pretrained models and inference code, to foster research in medical NLP across a broader range of imaging modalities. Without such release, the impact and value of the work would be significantly diminished.

    Minor Weakness:

    1. The use of quotation marks is incorrect, as in “Observation”.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper makes a valuable contribution by introducing a diverse, expert-annotated dataset across four underrepresented imaging modalities and demonstrating strong performance gains using modality-specific BERT-based models. However, the lack of dataset and code release — despite being central to the paper’s contribution — limits its potential impact and hinders reproducibility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their valuable feedback and constructive suggestions. We appreciate the recognition of our contributions in introducing a new expert-annotated multi-modality dataset (R1, R2, R3), which broadens the scope of data for medical NLP tasks (R3) and offers substantial benefits to the research community (R2). Additionally, we are grateful for the acknowledgment of our focus on modality-specific models (R2) and the strong performance improvements through fine-tuning (R3), which to the best of our knowledge have not been previously explored. In response to all reviewers, we confirm that the source code, trained model weights, and dataset will be made publicly available upon acceptance. We will release the dataset and annotations under a Data Use Agreement (DUA) for non-commercial, research purposes. Our university will host and ensure the long-term preservation of all data.

Reviewer 3: Q: Is there a separate RadGraph-like model for each modality? What RadGraph model used in Table 4 metric eval.

We appreciate the reviewer’s attention to the evaluation metrics. We trained four separate RadGraph models, each specific to one imaging modality. The evaluation metrics in Table 4 refer to the multi-modality RadGraph model trained on reports from all four modalities combined. We will revise the manuscript to clearly distinguish the modality-specific models from the multi-modality model.

Reviewer 1: Q: No structural innovation in model / Annotation schema

Our focus is on exploring the impact of fine-tuning RadGraph for modality adaptation, a novel application not previously explored. Using the existed language model effectively highlights the performance gains achieved through fine-tuning. We use existed DYGIE++ framework due to its proven effectiveness in clinical entity extraction tasks in all previous RadGraph works. We adhered to the RadGraph annotation schema to maintain consistency with prior work. Altering the annotation schema would introduce discrepancies in evaluation metrics. We instructed our annotators to provide more granular annotations based on observed annotation errors in prior RadGraph papers, aiming to improve label accuracy and enhance the dataset’s utility for future research.

Q: The authors should emphasize the uniqueness and necessity of the newly added modalities.

We acknowledge the need for more clinical detail in section 2. We will revise it to emphasize the clinical significance of the new modalities. These imaging modalities serve as critical first-line diagnostic and screening tools for cardiovascular, digestive, and neurological disorders, conditions with high morbidity and mortality rates that often necessitate urgent intervention. By incorporating these underrepresented modalities, our dataset introduces clinically significant scenarios that are currently lacking in existing medical NLP area.

Q: Not specify whether modality-specific fine-tuning or multi-task learning was applied. Specify the data splitting methodology.

We apologize for the confusion regarding fine-tuning and data splitting. The model was pre-trained on the RadGraph dataset and then fine-tuned on our modality-specific data. We mentioned our parameter settings in section 3.1, and we will release the parameter settings files once accepted. We applied a 70%-15%-15% split for training, validation, and testing to ensure no overlap. Training settings will be detailed in the revised manuscript

Q: RadGraph and RadGraph-XL were not fine-tuned on this dataset. Should report the zero-shot performance of RadGraph and RadGraph-XL.

Since our model is pre-trained on RadGraph dataset, it is equivalent to fine-tuned RadGraph model. Given that the RadGraph-XL dataset was not released, it was infeasible to fine-tune and evaluate it under the same conditions as our model. We reported the zero-shot performance of both RadGraph and RadGraph-XL on our dataset in Table 3, shown out-of-domain performance on new modalities.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Despite the acknowledged value of the new dataset, reviewers have raised critical concerns regarding the limited technical contributions.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This appears to be a useful dataset of multiple modalities worth accepting as also agreed by reviewers.



back to top