Abstract

Developing advanced medical imaging retrieval systems is significantly challenged, since the concept of ‘similar images’ varies significantly across different medical contexts and perspectives, resulting in a pressing lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic way. We build up two comprehensive medical imaging retrieval datasets for Chest X-ray and CT, MIMIC-IR and CTRATE-IR, with detailed image-image ranking ordering annotations conditioned on diverse anatomical structures. Lastly, we develop two retrieval systems, namely, RadIR-CXR and RadIR-ChestCT, which consistently show superior performance in traditional image-image and image-report retrieval tasks, and further enable flexible and effective image retrieval conditioned on specific anatomical structures in text form, achieving state-of-the-art results on 77 out of 78 metrics.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1160_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MAGIC-AI4Med/RadIR

Link to the Dataset(s)

https://huggingface.co/datasets/zzh99/RadIR

BibTex

@InProceedings{ZhaTen_RadIR_MICCAI2025,
        author = { Zhang, Tengfei and Zhao, Ziheng and Wu, Chaoyi and Zhou, Xiao and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
        title = { { RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {510 -- 520}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents RadIR, a novel and scalable framework for medical image retrieval by leveraging automatically mined similarity supervision from radiology reports. It introduces two datasets (MIMIC-IR and CTRATE-IR) and trains two retrieval models (RadIR-CXR and RadIR-ChestCT) that outperform existing vision-language baselines. The methodology includes a decomposition of radiology reports into anatomical structures, followed by a region-aware similarity scoring mechanism. The approach supports both unconditional and anatomy-conditioned retrieval, which is valuable for real-world clinical tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Using report-derived similarity at the anatomy level is smart and clinically meaningful. It can reduce the need for extensive manual annotation -Strong and complete evaluation results across both unconditional and anatomy-conditioned retrieval tasks, with consistent improvements across nearly all metrics -Clear figures and method description

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • General discussion of the results is missing. as a reader you have to draw your own conclusions from the presented tables. -Weaknesses of the method are not discussed. For example: why does this method perform very well on Bronchi, but not on Thorax? -Inclusion of a few qualitative examples would have been valuable.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The discussion of the results is very brief. There are undiscussed insights from the method that are visible in the tables, but should also be mentioned in the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Strong paper that covers both CT and X-ray modalities with an interesting method, that is lacking a bit in the results discussion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper present an image retrieval system called RadIR that leverage radiologist reports and images content to rank instances in a retrieval set based on their relevance to a certain query. It addresses context-dependent challenge of defining image similarity in medical contexts which could depends on multiple factors such as global appearance, localized findings, and specific pathologies.

    The main contribution of the paper lies in creating a textual representation of the visual features based on radiologists reports which contain structured information about i) which anatomical structures were examined, ii) what specific findings were observed in each structure and iii) How these findings relate to potential pathologies.

    The idea of the authors is to create an indirect measurement for image similarity that can focus on at multiple granularities such as I) anatomy-specific for example to show images with enlarged ventricles or II) global at the level of the whole image, for example to find pneumonia across all scans.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    To achieve their goal of using text-based similarity scores as an indirect measure for the corresponding image similarity and creating automatically generated rankings of image similarity, the authors devised a novel framework based on 3 stages:

    1. Extracting findings related to specific anatomical structures from the radiology reports using tools like RadGraph-XL.
    2. Organizing the extracted findings into a hierarchical framework that captures relationships between anatomical structures.
    3. Assessing the semantic similarity between descriptions of the same anatomical structure across different reports using RaTEScore, a medical language model.

    Moreover, they employed a novel two-stage training process where they first pretrained a CLIP-style model with Vision Transformer and BERT encoders to learn global image-report alignments through contrastive learning, then extended this model with a fusion module that combines visual features with text query information to enable retrieval conditioned on specific anatomical queries.

    The paper has also contributed in creating two major outcomes:

    1. The authors extend two widely-used datasets (MIMIC-CXR and CTRATE) to create MIMIC-IR and CTRATE-IR, providing detailed annotations of image-image similarity ordering across 90 anatomical structures

    2. Two specialized systems, RadIR-CXR for chest X-rays and RadIR-ChestCT for CT scans, that can perform traditional image-to-image retrieval while also enabling the clinically valuable capability of retrieving images based on queries about specific anatomical structures.

    Finally, the authors has demonstrated empirical results via RadIR-CXR and RadIR-ChestCT that achieved superior performance compared to existing methods in traditional image-to-image retrieval, that enabled fine-grained retrieval conditioned on anatomical structures and that showed strong improvements for less frequently mentioned anatomical regions.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Despite RadIR’s innovative approach to medical image retrieval, the paper presents several methodological assumptions and validation gaps that raise important questions about its clinical robustness and utility.

    1. The fundamental idea of RadIR rests on the assumption that text similarity between anatomical findings reliably reflects image similarity. While intuitively reasonable, this relationship is not straightforward in clinical practice. The paper assumes that text similarity between anatomical findings is a reliable indirect measure for image similarity. However, similar visual patterns could be described differently by different radiologistsbased on their reporting styles, terminology preferences, and level of detail. For example, a visual pattern might be described as “mild opacity” by one radiologist and “subtle consolidation” by another.

    2. The paper provides limited discussion of how report variability (report quality or incomplete descriptions) affects the retrieval system performance. Yet, it assumes good quality and comprehensiveness of the radiology reports. These variations could significantly impact the extraction of anatomical findings and subsequent similarity assessments. The lack of analysis on how RadIR handles incomplete or inconsistent reports leaves questions about its robustness in diverse clinical scenarios.

    3. The paper could benefit from the evaluation of radiologists to confirm whether the retrieved cases would be clinically useful.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s methodological contributions are substantial and innovative with technical foundation that creates the necessary groundwork for subsequent clinical evaluation. The authors proposition contributes in solving a critical bottleneck in developing medical image retrieval systems which enables scaling that would be practically impossible with manual methods. The empirical results are compelling, with RadIR demonstrating superior performance across nearly all metrics compared to existing state-of-the-art methods. The creation of two comprehensive datasets (MIMIC-IR and CTRATE-IR) and two specialized retrieval systems represents a substantial contribution to the research community, providing resources and benchmarks that will enable further advances.

    Even though the paper does make assumptions about the correlation between text and image similarity that warrant further investigation, these limitations don’t diminish the significance of the technical achievement. Moreover the lack of clinical validation by radiologists is a valid concern but represents an opportunity for future work rather than a fatal flaw.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces RadIR, a scalable retrieval framework for medical imaging that leverages radiology reports to define image-image similarity at anatomy-specific levels. The authors construct two large-scale retrieval datasets, MIMIC-IR (X-ray) and CTRATE-IR (CT), with similarity rankings mined from paired radiology reports. RadIR was trained on these datasets as a retrieval system and supports both unconditional and conditional (anatomy-aware) image retrieval, and achieves high performance across multiple retrieval tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Report decomposition that preserves anatomical structures in a hierarchical format. Their approach is scalable and applicable to other fields as well and they a large amount of findings across anatomies for their training data. 2) Developed a relevance quantification strategy that is based on string matching in the form of a RaTEScore. 3) Developed a specific architecture and training procedure for RadIR. Their anatomy-conditioned late fusion between the text and vision encoder was unique. The use of a triplet loss in this fused representation was thoughtful. They show strong performance compared to other CLIP approaches that were trained on MIMIC and CT-RATE.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Would performance improve with more specialized encoders, such as the CheXagent vision encoder or Clinical Longformer for text? The current model uses a standard ViT and BERT setup, evaluating alternatives tailored to medical imaging and clinical text encoding could improve their results. 2) One of the components of the work is to establish similarities between images using the reports. But the reports don’t always include information in the images, can the authors quantify this potential noise that may introduced during training? 3) The rationale for “mask[ing] out the positive elements outside the diagonal” in the InfoNCE loss is unclear? The InfoNCE loss already penalize off-diagonal elements. This is effectively making CLIP from a weakly supervised to a more anatomy-focused supervision signal, which I’m not sure of its transferability to other downstream tasks than just retrieval. 4) Very minor: For 3.1, “Date Source” is spelled wrong

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a relevant contribution to medical image retrieval, with a thoughtful, scalable, and innovative approach to dataset generation and a retrieval-based training approach that performs consistently well across multiple metrics. The method demonstrates high performance on rarer anatomical queries and introduces a way to bridge text and image similarity in a clinically grounded way. However, the paper may benefit from more specialized encoders and clearer methodological explanations. Despite these limitations, the work represents a meaningful advancement and should be of interest to the MICCAI community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their insightful feedback and generous recognition of our work, including: “solving a critical bottleneck in developing medical image retrieval systems” (R#1), “substantial and innovative methodological contributions” (R#1 & R#3), “high performance across multiple retrieval tasks” (R#1 and R#3), “valuable approach for real-world clinical tasks” (R#2), etc. We have further improved our paper based on their valuable feedback and addressed all the concerns as follows:

[R#1 Q1] Concerns on Report Styles and Synonyms (i) In Equation 3, we employ RaTEScore — a medical language model trained and evaluated on multi-radiologist annotated clinical reports — to quantify text similarity between regional findings. This ensures robustness against variations in clinician styles and terminology preferences. Its effectiveness in nuanced medical language processing has also been experimentally validated in [1]. (ii) Furthermore, our pipeline supports seamless integration of evolving medical language models, allowing continuous improvement in performance and robustness. [1] RaTEScore: A Metric for Radiology Report Generation

[R#1 Q2] Concerns on Report Quality and Incomplete Descriptions (i) Incompleteness: We explicitly excludes unreported anatomies from being sampled as text conditions during both training and evaluation, thereby preventing the noise introduction from incomplete annotations. (ii) Incorrectness: While report inaccuracies may introduce noise (will be discussed as a limitation in the camera-ready version), this effect is mitigated by using MIMIC and CTRATE — both manually verified, widely adopted datasets that ensure high-quality annotations.

[R#1 Q3] Human Evaluation Thanks for the constructive suggestion. We will include this in the camera-ready revision.

[R#2 Q1] More Discussion on Experiment Results We sincerely apologize for this oversight. Space constraints limited our discussion in current submission. We will incorporate a more detailed one in the camera ready version.

[R#2 Q2] Performance Gap Across Anatomies We hypothesize the performance gap primarily stems from the greater pathological diversity in thorax (where we identified three times more abnormality categories than in bronch), which likely increases the complexity in modeling the regional image-image similarity. We will further discuss this in the camera ready version.

[R#2 Q3] Qualitative Results Thanks for the suggestion. We will add qualitative results in the camera-ready version.

[R#3 Q1] Domain Specialized encoders We apologize for missing this detail in the paper. In practice, we employed domain-specific encoders: BiomedCLIP for CXR and CT-CLIP for CT imaging. We will clarify this in the camera-ready version.

[R#3 Q2] Concerns on Incomplete Descriptions Please refer to R#1 Q2.

[R#3 Q3] Loss Function Modifications. (i) The masked InfoNCE loss handles unpaired yet clinically similar image-text pairs (indicated by their high report similarity) on off-diagonal positions. It is crucial for our superior unconditional retrieval performance over baselines. (ii) Extending RadIR as a pre-trained model to other downstream tasks is treated as future work.

[R#3 Q4] Typo Thanks for the kind reminder. We will fix this in the camera-ready version.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top