Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate more performant and scalable medical AI systems. Our code is avaiable at https://github.com/SigmaLDC/More-performant-and-scalable.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3788_paper.pdf

SharedIt Link: https://rdcu.be/eHwWz

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_33

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/SigmaLDC/More-performant-and-scalable

Link to the Dataset(s)

CTRATE dataset: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE RADChestCT dataset: https://zenodo.org/records/6406114

BibTex

@InProceedings{LiYin_More_MICCAI2025,
        author = { Li, Yingtai AND Lai, Haoran AND Zhou, Xiaoqian AND Ming, Shuai AND Ma, Wenxin AND Wei, Wei AND Zhou, S. Kevin},
        title = { { More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {348 -- 357}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a framework that uses llms to extract diagnostic labels from radiology reports, enabling large-scale supervised pre-training of vision encoders at minimal cost. The authors demonstrate that LLM-extracted labels are of high quality (AUC >96%), and models trained on them perform comparably or better than those trained on BERT-extracted labels. Furthermore, the authors show that supervised pre-training improves vision-language alignment in contrastive learning setting, achieving new sota results in zero-shot classification and retrieval tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses the critical bottleneck of medical vision-language pre-training: lack of high-quality labeled data. Using LLMs for automatic label extraction is relevant and effective, as shown by previous recent papers.
- reporting across multiple benchmarks and LLMs, achieving sota in zero-shot diagnosis and retrieval with a relatively lightweight architecture (3D ResNet-18) and minimal training data
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The LLMs that are used in this paper are not specifically trained on medical data. It is unclear whether the extracted labels are accurately extracted and whether the match with BERT-extracted labels is maybe just a coincidence. The should have been discussed more extensively in the paper. This weakens the foundation of this paper.
- from a methodological point of view, the contribution is a framework that already exists in the general domain. Since it does not introduce domain specific adjustments the technical contribution is limited.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Fig 1: The spider chart and line plot axes are not readable. Consider changing the font from comic sans to a better (academic/professional) font.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper aims to solve an interesting domain-specific problem, but offers a generic solution that does not take the specifics of the domain into account. Discussion on applying generic LLMs to a highly specialized domain, potentially outside the training realm of these models is not discussed. Other than that, the execution of the method, and presentation of the results is good.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The rebuttal does not sufficiently answer the questions posed in the reviews, and in introduces further statements and claims that are not substantiated by their paper or existing literature. Unfortunately the rebuttal is not well organized.

Review #2

Please describe the contribution of the paper

This paper proposes a framework that leverages LMs to extract diagnostic labels from radiology reports, enabling low-cost large-scale supervised pretraining for medical image encoders. These LLM-derived labels are used to train a 3D ResNet-18 using BCE, with enhancements such as label smoothing and auxiliary segmentation supervision. The paper shows that this supervised pretraining improves downstream vision-language alignment when followed by contrastive CLIP-style training, outperforming prior approaches in zero-shot classification and retrieval tasks, even with limited data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Cost-effective dataset curation using LMs. The authors can generate “silver-standard” dataset at a relatively low-cost. 2) Separating generating vision representation from actual vision alignment. It has been done before, but they showed it here with a supervised backbone. 3) Strong performance even in few-shot regimes. Given 10% training data, their performance is still high even with a relatively small 3D Resnet18. 4) Strong ablations. Table 3 is definitely a nice table to have that shows the improvement of each methodological addition.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Not in any specific order: 1) Superficial label quality evaluation. The label quality from LMs was compared to BERT-extracted labels from CT-RATE, not to human verifiable ground truth. 2) Authors language is superfluous and distracts from the main technological innovation. The term “democratize” and several other qualifiers are unnecessary for this work. 3) 3D Resnet 18 is relatively small compared to other 3D ssl frameworks. I do wonder if their results will generalize to parameter scaled approaches. Perhaps it would be good to consider ViT approaches as well. 4) The main drawback of this work is that the authors do not benchmark against vision-only ssl methods like simclr or dinov2. Such comparisons would affirm if supervised strategies really benefit CLIP pertaining. There has been ssl work, even with dinov2 [1], that shows that vision-only ssl followed by CLIP has marked improvements in performance. 5) Small batch size for CLIP training. A batch size of 10 was used, which is fairly small and probably led to unstable training dynamics in that data regime.

[1] Cijo Jose, Théo Moutakanni, et al. DINOv2 Meets Text: A Unified Framework for Image-and Pixel-Level Vision-Language Alignment. arXiv:2412.16334. 2024.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The technical novelty is the work modest, the evaluation of label quality is indirect, and some modeling choices (e.g., batch size, architecture specificity) limit the strength of the conclusions. With additional validation and broader benchmarking, this work could be more promising.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors surprisingly addressed most of my concerns. Their use of lm-labeled for training and actual labels for evaluation reassured me and shows value in using an LM to generate labels. Further, this work is timely because of several recent open-sourcing of LM models, it is valuable to discuss alternative training strategies leveraging LMs. Regarding my comment about batch size, the literature provided does show the advantages of supervised pretraining in the natural language domain compared to vision-only SSL models.

I still think that the author’s language is too flowery and detracts from the work personally, but I think the main contributions are clear and would be valuable to the MICCAI community.

Review #3

Please describe the contribution of the paper

The paper presents a scalable framework that leverages large language models (LLMs) to automatically extract diagnostic labels from radiology reports, which are then used to construct a multi-label “silver-standard” dataset for training a CLIP-style vision-language model. The primary contribution lies in demonstrating a cost-effective pre-training strategy that achieves competitive, and in some cases state-of-the-art, performance within a standard CLIP training pipeline, while requiring less annotated data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed pipeline is simple and demonstrates strong performance. The image encoder trained using diagnostic labels generated by LLMs outperforms standard labels across multiple evaluation metrics. The approach achieves new state-of-the-art results on three tasks on CT-RATE and RAD-ChestCT datasets.
2. The paper is well-structured, and the methodology is clearly presented and easy to follow. Most figures and tables support the narrative. The experimental design is comprehensive, including scaling law analysis and ablation studies.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The current experimental setup is its exclusive focus on two CT datasets, without assessing zero-shot or domain-adaptation performance on datasets from other widely used imaging modalities, such as MIMIC-CXR or CheXpert. Expanding the evaluation to include additional modalities is needed to enhance the credibility of the results.
2. The paper does not provide sufficient detail regarding the prompt templates used to guide the LLM in extracting abnormality labels from radiology reports. The implementation of the auxiliary segmentation supervision is not adequately described.
3. Figure 2(a) is not clear. The terms “from scratch” and “video pre-trained” are not clearly defined in the manuscript. Formula 3 is unclear, does the z_i refers to image features after global average pooling?
4. LLMs selection might not be the best choices. LLM like GPT-4o, Claude 3, and LLaMA 3 should be studied.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methodology demonstrates strong empirical performance. There is no test of generalization or zero-shot performance on other imaging modalities. Insufficient discussion of segmentation, unclear labeling and ambiguous notation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

They have addressed my primary concerns regarding the prompt templates and clarified several points of confusion in the original manuscript. The authors also acknowledged the limitations related to the absence of GPT-4 evaluation and explained their constraints in accessing the model. It remains strongly recommended to include GPT-4-based evaluation in future work to strengthen the credibility of results. In addition, questions raised by other reviewers, including those regarding label trustworthiness and the influence of batch size, were also addressed with adequate clarification.

Author Feedback

We thank all reviewers for their valuable feedback. Reviewers appreciate that our work tackles the annotation bottleneck in medical imaging by using LLMs to extract diagnostic labels, creating low-cost“silver-standard”dataset that enables supervised pre-training and advancing vision-language alignment. They note that, despite the simplicity and light cost of our pipeline, it achieves SOTA performance on two large-scale dataset. The extensive experimental design, including scaling-law analyses and detailed ablations, is recognized for clearly demonstrating the contribution of each component. Reviewers also remark that the manuscript is clear, well organized, and supported by effective figures and tables. Below we group the principal comments and provide concise clarifications.

1.Are LLM-labels trustworthy?(R2,R3) CT-RATE’s BERT-labels serve as our noisy-but-human-audited reference against human. Despite non-perfect, this indirect comparison establishes a lower bound for label quality. Importantly, LLM-labels are used only for training, with all evaluation using original test labels, preventing any inflation of reported metrics. As shown in Table 2, even with potentially higher noise, LLM-labels produce models with comparable performance to BERT-labels, demonstrating their value.

2.Novelty&Why the contribution is important(R2,R3) While framework[26] exists in general domain, its adoption in radiology has been blocked by the prohibitive cost of obtaining large-scale annotation and lack of effective supervised pre-training, pushing the community toward exploring sophisticated alignment methods during past few years. We show for the first time that LLMs can bridge this gap by dramatically reduce labeling costs and enabling effective large-scale supervised pre-training, outperforming carefully designed pipelines with plain CLIP training. Additionally, to our knowledge we are the first to reveal the substantial performance gains obtained by removing L2-normalization during CLIP fine-tuning.

3.More modalites&Choice of LLMs(R1) We agree LLMs like GPT-4o should be studied. However, due to their restrictive policy, we were unable to access their API. Our method does not rely on CT-specific assumptions and is, in principle, modality-agnostic. We plan to experiment with more modalities and LLMs in future.

4.Notation clarity,Prompt template&auxiliary segmentation supervision(R1) “from scratch” refers to randomly initialized weights; “video pre-trained” indicates weights from the torchvision library pre-trained on Kinetics-400. “z_i” should be “v_i”, refers to image feature after global average pooling.

We are sorry for the unclear description of prompt template due to space constraint. Our system prompt is: “You are a medical report analyzer. Your task is to classify chest CT reports into 18 specific categories. For each report, determine the presence(1)/absence(0) of each condition. Categories to classify: 1.Medical material … 18.Interlobular septal thickening Output format should be exactly 18 comma-separated binary values(0 or 1), one for each category in order listed above. Example output:0,1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0 No other text is allowed for the output.”

Auxiliary segmentation supervision is a patch-level classification, as described in [16].

5.Batch size,Architecture choice&Comparison to vision-only SSL(R3) Larger batch sizes up to 160 yield worse results. As for datasets with<100k samples, large batch size reduces stochasticity, affecting generalization. The batch size of 10 aligns with comparable methods and is common in medical imaging. ViT backbones would yield similar results. We respectfully argue that such a comparison is not necessary. As discussed in [26], while vision-only SSL performs well broadly, supervised backbones excel at zero-shot classification.

We again thank reviewers for their insightful feedback and are confident that our clarified contributions will meaningfully benefit the MICCAI community.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Despite some methodological simplicity and limited novelty in framework design, this paper clearly contributes a valuable empirical insight: that LLM-generated silver labels can be effectively used for medical vision-language pretraining, with significant cost savings and strong downstream performance. This is a practical and scalable strategy, especially in an era where LLM accessibility is increasing. While Reviewer #2 remains unconvinced, the other two raise their scores. I believe the community can benefit from these findings.

back to top

More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

Author(s):