Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Vision-language models have proven to be of great benefit for medical image analysis since they learn rich semantics from both images and reports. Prior efforts have focused on better alignment of image and text representations to enhance image understanding. However, though explicit reference to a prior image is common in Chest X-Ray (CXR) reports, aligning progression descriptions with the semantics differences in image pairs remains under-explored. In this work, we propose two components to address this issue. (1) A CXR report processing pipeline to extract temporal structure. It processes reports with a large language model (LLM) to separate the description and comparison contexts, and extracts fine-grained annotations from reports. (2) A contrastive captioner model for CXR, namely CoCa-CXR, to learn how to both describe images and their temporal progressions. CoCa-CXR incorporates a novel regional cross-attention module to identify local differences between paired CXR images. Extensive experiments show the superiority of CoCa-CXR on both progression analysis and report generation compared to previous methods. Notably, on MS-CXR-T progression classification, CoCa-CXR obtains 65.0% average testing accuracy on five pulmonary conditions, outperforming the previous state-of-the-art (SOTA) model BioViL-T by 4.8%. It also achieves a RadGraph F1 of 24.2% on MIMIC-CXR, which is comparable to the Med-Gemini foundation model.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2248_paper.pdf

SharedIt Link: https://rdcu.be/eHdSr

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_8

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheYix_CoCaCXR_MICCAI2025,
        author = { Chen, Yixiong AND Xu, Shawn AND Sellergren, Andrew AND Matias, Yossi AND Hassidim, Avinatan AND Shetty, Shravya AND Golden, Daniel AND Yuille, Alan L. AND Yang, Lin},
        title = { { CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {78 -- 88}
}

Reviews

Review #1

Please describe the contribution of the paper
- This paper proposed CoCa-CXR incorporating both prior and current CXRs for progression prediction.
- This paper curated a CXR-4 dataset with data from MIMIC-CXR and external knowledge from Gemini and ImaGenome.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper proposed a local cross-attention module to integrate features of a chest X-ray pair and a scheme to process chest X-ray datasets with temporal information.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The curation of CXR-4 involves four steps, described in Section 2. However, the content only describes what the authors did without telling what’s the problem addressed by each step. The authors shoud improve the description to clearly explain the motivation behind each process.
- Regarding progression prediction on MS-CXR-T, the authors should provide more details on the inference setting. For example, whether the prediction is conditioned on the prior image only or both prior and current images.
- This paper used an independent section before the method section to describe the data curation process, which is overemphasized. The authors should consider moving the content to a subsection in the experiments and results section.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method, which incorporated a new regional cross-attention, is incremental to CoCa. The motivations of data curation processes are unclear.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The authors claimed that the dataset is “foundational”. However, the high-level idea of each data curation stage is still unclear. For example, it is difficult to understand how the “clean image-report pair” dataset can address “the challenges of learning temporal alignment”.

Review #2

Please describe the contribution of the paper

The authors propose a framework for multi-stage vision-language contrastive learning that enables disease outcome prediction and report generation. They evaluate the method on paired chest x-ray data and introduce four new sub-datasets cleaned and filtered from MIMIC-CXR.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Strong performance paired with a nice description makes the paper easy to read and enables a quick understanding of the topic.

High-quality figures and tables combined with a clear description and execution of all experiments.

A high-quality ablation study motivates all parts of the proposed architecture
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Figure 4, which feels like a really important part of the paper as it introduces the prompts and, therefore, how CXR-4 was generated, is missing. This severely reduces the understandability of the paper and the dataset.

The authors claim to have introduced the Regional-Cross Attention module, but masking out attention regions is a very common approach integral to training, for example, language models. I believe this part should be put more into the background

region(i) was not properly introduced, making it hard to understand. If i stands for the token, I assume that you can also use different masks for different diseases in the same report. How are such cases handled?

There is no mention of a link to the dataset that you introduce.

(Writing error on page 7: “pretraininginitializing”)
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I believe the paper is well executed, but multiple points (as mentioned above) are not satisfactory in the current status. But I believe these points can be addressed in a rebuttal, which is why I chose weak accept
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors apply CoCa-style joint contrastive and captioning pretraining to the task of longitudinal chest X-ray interpretation, focusing on progression classification and report generation. They extend the CoCa framework with a regional cross-attention mechanism that aligns regions between prior and current X-rays, enabling the model to focus on spatial changes. To support this, they introduce CXR-4, a new dataset derived from MIMIC-CXR, structured for temporal analysis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper targets a clinically important but underexplored problem: detecting and describing temporal changes in chest X-rays. This has clear relevance for disease progression monitoring.

The combination of contrastive learning and captioning for longitudinal imaging is novel in this context.

The proposed regional cross-attention mechanism is intuitive and effective, allowing explicit modeling of spatial change between timepoints.

The experimental setup is strong, including evaluation on MS-CXR-T for classification and MIMIC-CXR for report generation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The evaluation of generated reports does not include standard NLP metrics such as CE (clinical efficacy), BERTScore, BLEU, or ROUGE. This limits the comparability of the results to prior work in report generation.

While still a preprint it would have been interesting to see how the model compares to the cited Maira-2 model, since it seems highly relevant.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

How are the spatial masks of the regional cross-attention for current and prior image defined at test time?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work addresses a clinically relevant task, presents a novel approach, and demonstrates its effectiveness through comprehensive experiments.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The paper addresses clinically relevant tasks, presents a novel regional cross-attention extension to CoCa, and provides strong experimental validation.

Author Feedback

Reviewer #1

Comment: The evaluation of generated reports does not include standard NLP metrics such as BLEU or ROUGE. Comparison to MAIRA-2 would be helpful.

Response: Thank you for your constructive comments. We agree that including BLEU and ROUGE is valuable for completeness. We have added BLEU and ROUGE scores to the final result table to supplement RadGraph-F1, which has been verified to have consistent clinical semantics (Yu F, Endo M, Krishnan R, et al. Evaluating progress in automatic chest x-ray radiology report generation[J]. Patterns, 2023, 4(9).).

As for MAIRA-2, we appreciate the suggestion. We note that MAIRA-2 uses the prior report as an input modality, which is substantially different from our setting that relies only on image pairs. Moreover, MAIRA-2 only reports results for the Findings section, while our work tackles the more comprehensive task of Findings + Impression generation. Nonetheless, we will include a discussion and numbers from MAIRA-2 in our final comparison table for completeness.

Reviewer #2

Comment: Missing prompts for Gemini; masking not sufficiently distinguished from common practices in NLP models; unclear definition of region(i); dataset not linked.

Response: We apologize for the oversight in supplementary materials. The prompts used for Gemini to extract comparison and localization descriptions will be clearly included in the camera-ready version.

Regarding your concern on masking: in contrast to language model pretext masking, our regional masking is designed to introduce a spatial prior for temporal alignment in CXR progression modeling. Specifically, each region in the current image is only allowed to attend to spatially corresponding regions in the prior image. This masking is applied during both training and inference, and is thus fundamental to enforcing temporal locality—a key inductive bias in this domain.

region(i) refers to a local square window around position i in the image token grid. While different abnormalities may have overlapping regions, our model does not rely on disease-specific masking but applies a consistent spatial restriction to encourage localized comparison, independent of the disease category. We will clarify this notation in the final version.

Dataset sharing is subject to institutional clearance and will be released upon approval.

Reviewer #3

Comment: Data curation steps lack motivation; inference setup not explained; dataset section should be moved to experiments.

Response: Thank you for your feedback. We respectfully disagree on several points:

Our dataset curation pipeline is one of the main technical contributions of the paper. Existing CXR datasets do not provide explicitly structured progression annotations between image pairs. Our four-stage construction of CXR-4 addresses the challenges of learning temporal alignment (Stage 1), modeling progression (Stage 2–3), and enabling region-level attention (Stage 4). We will revise Section 2 to make the motivation of each sub-dataset more explicit.

The inference process for temporal classification is indeed described in Sec. 4.2: it involves prompting the decoder with “[condition] is” and selecting the most likely progression label from the vocabulary. The model uses both the current and prior image as input to the regional cross-attention module during inference, consistent with training. We will make this clearer.

We appreciate the suggestion to reorganize the dataset section. However, we intentionally provide dataset curation details early, as it is foundational to both model design and training strategy. Presenting this first helps contextualize the technical contributions that follow.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding

Author(s):