Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multi-modal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0629_paper.pdf

SharedIt Link: https://rdcu.be/eHwUU

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_53

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ToniChopp/SimCroP

Link to the Dataset(s)

CT-Rate dataset: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE RadChestCT dataset: https://cvit.duke.edu/resource/rad-chestct-dataset CC-CCII dataset: https://www.kaggle.com/datasets/fakaframe082/cc-ccii LUNA16 dataset: https://luna16.grand-challenge.org/Data/

BibTex

@InProceedings{WanRon_SimCroP_MICCAI2025,
        author = { Wang, Rongsheng AND Tang, Fenghe AND Yao, Qingsong AND Yan, Rui AND Zhang, Xu AND Huang, Zhen AND Lai, Haoran AND He, Zhiyang AND Tao, Xiaodong AND Jiang, Zihang AND Zhou, S. Kevin},
        title = { { SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {563 -- 573}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of the paper is the development of SimCroP (Similarity-driven Cross-granularity Pre-training), a novel medical vision-language pre-training framework specifically designed for radiograph representation learning in 3D chest CTs.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Similarity-Driven Alignment (SA): The paper introduces a novel similarity-driven alignment module that adaptively associates descriptive sentences in radiology reports with the most semantically relevant sub-regions (patches) in 3D CT volumes. Unlike prior contrastive learning approaches that rely on global image-text alignment, this fine-grained alignment operates at the sentence-patch level without requiring explicit spatial annotations. This is particularly interesting because it mirrors the clinical reasoning process, where radiologists describe local findings in specific anatomical contexts.

Cross-Granularity Fusion Module: The proposed cross-granularity fusion mechanism innovatively combines instance-level visual representations with word-patch-level cross-modal features. This dual-level integration allows the model to capture both global context and localized pathology information, which is critical for effective interpretation of complex and spatially sparse radiographs such as chest CTs. This design supports downstream tasks like segmentation and classification with improved granularity and precision.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Insufficient Insight into Text Decoder Outputs at Inference Time: While the paper introduces a text decoder to reconstruct masked tokens as part of the multi-modal pretraining objective, it lacks qualitative or quantitative analysis of the decoder’s outputs during inference. Providing example reconstructions or illustrating how well the decoder captures clinically relevant concepts would offer better intuition on the model’s language understanding capabilities and the efficacy of the cross-modal learning.

Limited Analysis of Failure Cases or Limitations: The paper does not explore or analyze situations where the model fails or underperforms.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Besides Fig. 4, please provide more insights to the Text Decoder.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper proposes SimCroP, a new pretraining framework that enhances CT radiograph representation learning by aligning descriptive sentences from radiology reports with corresponding image sub-regions. It addresses challenges in medical self-supervised learning—sparse lesion distribution and hierarchical report structure—through three objectives: masked image modeling, similarity-driven sentence–subregion alignment without manual annotations, and cross-granularity masked report modeling that combines instance-level and word-patch-level features for improved report reconstruction.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The manuscript is well written, with a clear and logical presentation of the model architecture, training objectives, and experimental results.
2. The proposed sentence–subregion alignment module is novel, by selecting the sentence level as the granularity for supervision, the method closely mimics clinical reporting practices, which strengthens its interpretability and practical relevance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The loss term L_MLM introduced in Equation (6) is not defined in the manuscript.
2. The term “cross-granularity fusion” is misleading, as the method only combines instance-level and word-patch level features without actual interaction or cross-level reasoning. A more appropriate term would be “multi-granularity fusion.”
3. The use of a fixed hyperparameter K in selecting top-matching regions poses limitations, especially given the variability in ROI size and distribution. While an ablation study is provided, it does not fully address concerns about the adaptability of the method to diverse clinical scenarios, such as large nodules versus small calcifications.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces a novel and clinically inspired method with clear writing and strong empirical results. While there are some issues, such as missing definitions and terminology, the overall contribution is good and shows potential for impact in medical image pretraining.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed the concerns on L_MLM loss definition and terminology. However, the response on Top-K adaptability, though supported by lesion distribution data, lacks experimental validation across lesion types. The intent to explore adaptive strategies is noted but insufficient. I recommend acceptance with minor revision, with further validation on lesion-specific performance.

Review #3

Please describe the contribution of the paper

The paper presents SimCroP (Similarity-Driven Cross-Granularity Pre-training), a framework for medical vision-language pre-training specifically designed for 3D chest CT scans.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper introduces a cross-granularity fusion module that integrates multi-modal information across instance level and word-patch level, helping the model better capture key pathology structures in sparse radiographs.
2. The similarity-driven alignment module pre-trains the encoder to adaptively select and align the correct patches corresponding to each sentence in reports without requiring explicit spatial annotations.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. As acknowledged by the authors in the conclusion, “the absence of instance-level cross-modal alignment hinders the zero-shot performance of our approach.” This limits the model’s ability to generalize to completely unseen scenarios.
2. I would like to know if the method proposed by the author can be applied to 2D medical images such as X-ray and Fundus image Li et al. (2024) “VisionUnite: A vision-language foundation model for ophthalmology enhanced with clinical knowledge,” which also uses a multi-granularity approach for medical image understanding, though in a different domain.
3. Since the author used the masked image, and I want to know how the author solved this problem if the masked part corresponds to the image part of the report. Did the author consider solving this problem?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a well-executed approach to medical vision-language pre-training that addresses important challenges in medical image analysis.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The paper is good. But the rebuttal did not address my previous concerns. My final score is borderline accept, and I will consider the opinions of other reviewers.

Author Feedback

We appreciate the reviewers’ comments and insightful suggestions, especially some thought-provoking proposals (R1,R4). We thank for the acknowledgement of the interest(R1,R2), novelty(R1,R4), organization and impressive experimental results(R1,R2,R4) of our paper. Our responses are as follows:

LOSS DEFINITION (R1 Q1): The loss term L_MLM refers to standard masked language modeling loss introduced in BERT, computed using cross-entropy loss over the masked tokens. We will revise the manuscript to explicitly define the loss term for clarity and completeness.

TERM MISLEADING (R1 Q2): We appreciate your provoking insight, and we will revise the term per request.

TOP-K ADAPTABILITY (R1 Q3): We acknowledge the reviewer’s concern regarding the use of a fixed Top-K for selecting top-matching patches, especially considering the variability in lesion sizes and distributions. During the experiment stage of our approach, we involved 5 medical students to annotate ten distinct leision types for 2,000 volumes. With a patch size of 16168, the number of lesions and their average affected patches are: Disease, Count, Avg patches Arterial wall calcification: 1131, 8.55 Bronchiectasis: 178, 9.56 Lymphadenopathy: 357, 1.86 Pericardial effusion: 163, 19.52 Atelectasis: 165, 19.21 Consolidation: 228, 14.74 coronary artery wall calcification: 895, 3.97 Lung opacity: 897, 44.10 Pleural effusion: 8, 74.69 Hiatal hernia: 84, 6.63 Our method employs Top-K=64, which has proven sufficient to cover the majority of lesion-affected regions, enabling effective similarity-driven alignment between patches and textual descriptions. Despite the fixed K, this design provides a balance between computational efficiency and clinical adaptability, allowing the model to robustly accommodate a wide range of lesion sizes and appearances across diverse clinical scenarios. We will explore adaptive or attention-based alternatives to better capture fine-grained features, which we believe would further improve performance on these challenging tasks.

ZERO SHOT LIMITATION (R2 Q1): We will further enhance zero-shot transferability of our approach to unseen categories or conditions.

TRANFERABLE APPLICATION (R2 Q2): We believe our method can show adaptability which opens promising directions for cross-domain transfer in Med-VLP.

MASKED IMAGE-TEXT ALIGNMENT (R2 Q3): While masking may hide regions mentioned in the report, our method leverages unmasked patches for cross-modal alignment. Though we don’t explicitly ensure alignment for masked lesions, our multi-granularity strategy helps maintain semantic consistency. We believe report-guided masking strategies can further enhance explainability of our approach.

INSIGHTS OF TEXT DECODER (R4 Q1): We thank the reviewer for highlighting the need for further analysis of the text decoder’s outputs. While our approach achieves a significantly higher MLM accuracy compared to CXRBERT (SimCroP: 83.08%, CXRBERT: 55.38%) on the validation set of CT-RATE, our primary objective is not to attain SOTA cloze performance. Meanwhile, we leverage the MLM task as a means of injecting rich text supervision to guide the pretraining of the image encoder and enhance multi-modal alignment. We manually examine reconstructed outputs and observed that disease-related terms are accurately predicted, indicating the decoder captures clinically relevant concepts.

MORE ANALYSIS OF FAILURE CASES (R4 Q2): As shown in Tables 1, 2, and Fig. 2, our method underperforms on fine-grained tasks, including lung nodule and pericardial effusion classification, and lung segmentation, where M3AE achieves comparable or slightly better performance. These tasks require precise localization and subtle feature discrimination. While our model leverages both instance-level and word-patch-level supervision, the use of a fixed Top-K patch selection limits its adaptability to spatially localized regions—particularly critical in detecting small nodules or subtle effusions.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers suggest accept, and the 3rd reviewer doesnot respond in the rebuttal stage. I checked the rebuttal and think the authors addressed the questions from the 3rd reviewer. The idea of similarity-driven alignment and cross-granularity fusion module makes sense to me and the experimental results are quite promising. As a result, I suggest to accept this paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

Author(s):