Abstract

The development of AI-based methods to analyze radiology reports could lead to significant advances in medical diagnosis, from improving diagnostic accuracy to enhancing efficiency and reducing workload. However, the lack of interpretability of AI-based methods could hinder their adoption in clinical settings. In this paper, we propose an interpretable-by-design framework for classifying chest radiology reports. First, we extract a set of representative facts from a large set of reports. Then, given a new report, we query whether a small subset of the representative facts is entailed by the report, and predict a diagnosis based on the selected subset of query-answer pairs. The explanation for a prediction is, by construction, the set of selected queries and answers. We use the Information Pursuit framework to select the most informative queries, a natural language inference model to determine if a fact is entailed by the report, and a classifier to predict the disease. Experiments on the MIMIC-CXR dataset demonstrate the effectiveness of the proposed method, highlighting its potential to enhance trust and usability in medical AI. Code is available at: https://github.com/Glourier/MICCAI2025-IP-CRR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0252_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

htps://github.com/Glourier/MICCAI2025-IP-CRR

Link to the Dataset(s)

MIMIC-CXR-JPG dataset: https://physionet.org/content/mimic-cxr-jpg/2.0.0/ CXR-LT dataset: https://physionet.org/content/cxr-lt-iccv-workshop-cvamd/2.0.0/

BibTex

@InProceedings{GeYuy_IPCRR_MICCAI2025,
        author = { Ge, Yuyan and Chan, Kwan Ho Ryan and Messina, Pablo and Vidal, René},
        title = { { IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {313 -- 323}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose IP-CRR, an interpretable-by-design model for classifying chest radiology reports. The method builds on the information pursuit (IP) framework, which aims to select an ordered set of queries to maximize information gain until enough evidence is gathered to classify the input report. The classification pipeline consists of three main steps:

    • A set of interpretable queries is defined by mining a large-scale radiology report dataset. These queries are then clustered to reduce redundancy and keep only the most informative ones.
    • The answers to these queries are weakly generated using LLMs, with each query assigned one of three labels: -1 (absence), 0 (uncertainty), or +1 (presence).
    • A variational information pursuit approach is used to make the final classification based on the generated query responses.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of using a clustering approach to reduce the size of the query space is well-motivated and adds efficiency to the overall model.
    • Applying the information pursuit framework to chest radiology reports is an interesting and novel direction.
    • Experimental results demonstrate that the proposed model can achieve competitive accuracy using only a subset of queries, highlighting its potential for efficient and interpretable decision-making.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • In section 3.2, it would be beneficial if the authors provide a short recap of the stop condition and how the value of L is determined.
    • While Fig. 3 demonstrates that the model can maintain high accuracy with a small set of queries, the results in Table 1 raise concerns. In particular, for some tasks (e.g., LO), the performance gap between the black-box baseline (CXR-BERT (FT-All)) and IP-CRR is small, which aligns with the typical minor trade-off seen in other interpretable models. However, for tasks such as CA, CM, and PN, the performance degradation is substantial. This raises concerns about the practical applicability of IP-CRR in clinical settings, where such drops in accuracy may not be acceptable. A similar trend was observed with CBM, suggesting that both models may be limited in their ability to retain predictive performance while incorporating interpretability. It would be helpful if the authors could discuss these trade-offs more explicitly and offer potential explanations for the significant performance drops in certain tasks.
    • Including qualitative results for CBM in Fig. 4 would improve the comparison between IP-CRR and CBM. Specifically, visualizing the most informative queries (IP-CRR) alongside the learned concepts (CBM) could provide valuable insight into the interpretability mechanisms of each model.
    • An inference time comparison between IP-CRR and the baseline models would be valuable, particularly given that the model sequentially evaluates queries. This could help evaluate the method’s feasibility for real-time or large-scale deployment.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The proposed method introduces an interesting and potentially impactful approach for interpretable classification of radiology reports. However, some aspects of the model require further clarification—particularly regarding the trade-off between interpretability and performance. Additional discussion, especially around the more challenging tasks and efficiency metrics, would strengthen the manuscript and its applicability to real-world scenarios. There are some minor typo mistakes that should be removed as well.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method introduces an interesting and potentially impactful approach for interpretable classification of radiology reports. However, some aspects of the model require further clarification—particularly regarding the trade-off between interpretability and performance. Additional discussion, especially around the more challenging tasks and efficiency metrics, would strengthen the manuscript and its applicability to real-world scenarios. There are some minor typo mistakes that should be removed as well.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors carefully answered the raised concerns.



Review #2

  • Please describe the contribution of the paper

    The core idea in the paper is to make predictions based on a set of the most informative facts extracted from the report. The selected facts themselves serve as the explanation for the diagnosis. It adapts the Information Pursuit (IP) framework to sequentially select the most informative queries and integrates the NLI-based query answering, including handling “unknown” answers. Results are shown for CXR reports and several labels.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The approach resulting in a more transparent result within the realm of CXRs is novel. The additional confidence reported is a novel application.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Figure 1 appears to have a serious error. “heart size is stable” requires the prior report to determine if the heart is enlarged or not, yet the shown method fails with a 99% confidence.

    Under related work, the authors state that it is “hard for radiologists to trust their predictions in clinical use”. This should be either deleted or a reference should be provided to the product this is referring to. The reviewer is not aware of any radiologist using any such product clinically.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach but appears to have serious errors (see Figure 1). The motivations can also be clarified (see above).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The example in FIgure 1 and author rebuttal raises further flags in terms of the quality of the method results. Upon further review, q5 should be unknown. Temporal information is necessary to derive any such conclusions.



Review #3

  • Please describe the contribution of the paper

    Proposed an interpretable-by-design framework for Chest Radiology Reports (CRR) classification, which extends the Information Pursuit (IP) framework for explainable image classification.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses a lack of interpretability problem (critical limitation) in medical AI by proposing a model that is inherently explainable. This significantly enhances trust and clinical applicability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the paper proposes an interpretable approach, the idea of extracting structured information (e.g., facts, entities, or key phrases) from radiology reports and using them for classification has already been explored in prior work. Such as given in: https://doi.org/10.1200/CCI.22.00139

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) The novelty of the proposed method is not clearly distinguished from prior work; similar frameworks exist that extract structured information from radiology reports. 2) The evaluation is limited to MIMIC-CXR without external validation or cross-institutional testing. 3) A comparison with stronger baseline models and recent interpretable methods is missing and should be added.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important problem in medical AI by proposing an interpretable framework for radiology report classification.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The paper addresses a lack of interpretability problem (critical limitation) in medical AI by proposing a model that is inherently explainable.




Author Feedback

We thank the reviewers for finding our paper addresses an important problem in medical AI (R1) using an interesting, potentially impactful and novel direction (R2) that improves model transparency (R4). In what follows, we address the major concerns and clarify misunderstandings.

  1. Novelty and Motivation. (R1, R4) We respectfully disagree that the novelty of our work is diminished due to prior work on extracting structured information from radiology reports for classification. On the contrary, we think our work makes a meaningful and distinct contribution to this broadly relevant problem. Our key innovation is an interpretable-by-design classification framework that automatically selects what information to extract. Our approach learns to extract and summarize relevant facts automatically, while the prior work cited by the reviewer relies on a rule-based system. We will revise the manuscript to clarify these distinctions and more explicitly highlight our unique contributions.

  2. Trade-off Between Interpretability and Performance. (R2) We appreciate this concern. The main reasons for the performance degradation are the number of trainable parameters (109MB in CXR-BERT-FT-All vs. 12 MB in IP-CRR) and the design of the query set. For example, only 5/520 queries mention “aorta”, compared to 53/520 queries that mention “opacity”, so the performance of “CA” (Calcification of the Aorta) is worse than “LO” (Lung Opacity). This could be alleviated by a better design of queries (e.g., by increasing the number of queries for under-represented classes) and by using better answering models for more challenging tasks.

  3. More Experimental Results about Comparison Study, External Validation, Efficiency, and Qualitative Results. (R1, R2) Thank you for these thoughtful suggestions. 1) We compared to SOTA black-box models (CXR-BERT) and interpretable (CBM) baselines. Additionally, we also explored a sparse CBM variant [arXiv 2024: 2404.03323] under multiple sparse constraints, but the best performance did not exceed that of IP-CRR (e.g., the AP scores on LO are 0.935 for Sparse CBM vs. 0.972 for IP-CRR). 2) While external evaluation is valuable, it is complicated by differences in labeling conventions across datasets, which can lead to label shift. We believe this is an important direction that warrants further investigation and plan to explore it in future work. 3) For efficiency, the total inference cost = (querying + answering + classifying) * steps. On average, IP-CRR needs 1.2ms/step for querying and classifying, and 34ms/step for answering (Flan-T5) on one A5000 GPU. For “LO” task, with 30 steps, IP-CRR needs 1.056s/sample, while CXR-BERT takes 8ms/sample. The inference cost is reasonable given the added interpretability and remains compatible with real-time use. 4) We provided CBM qualitative results in our GitHub, including two examples comparing IP-CRR and CBM interpretations.

  4. Stopping Condition. (R2) IP terminates when the remaining queries provide zero information gain. In practice, this is approximated by using the probability of the classes given the history, and IP stops once it exceeds a pre-determined threshold ε. We will clarify this further in the final manuscript.

  5. Clarification of Figure 1. (R4) The reviewer is correct that “heart size is stable” implies a temporal comparison. However, this information can be inferred from one report because of the way in which the CXR-LT dataset [MedIA: 2024.103224] is labeled: if a disease is not reported as positive then it is treated as negative. In the future, we plan to extend our model to support longitudinal analysis.

  6. Clarification of Clinical Trust Statement. (R4) There appears to be a misunderstanding. We did not mean that doctors are using such models in clinical practice. Our intended point is that lack of interpretability remains a key barrier to clinical adoption of AI [npj 2025: s44401-025-00016-5, JMIR: 2024/1/e53207].




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers appears to be even less convinced of the viability of the paper for acceptance to MICCAI. The authors have offered clarifications based on which further issues have been uncovered. In the light of this, the paper may be rejected.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal did not address the concerns raised by the reviewers.



back to top