Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The rapid advancement of medical foundation models creates unprecedented demand for large-scale training data, yet existing medical repositories remain contaminated by heterogeneous mixtures of high- and low-quality image-text pairs—a severe data pollution problem that significantly bottlenecks model performance and optimization. While manual curation could theoretically ensure quality, it is impractical for managing large-scale datasets effectively. To address this critical challenge, we introduce RefineNet—a scalable framework that systematically refines data quality by distilling multimodal large language model (MLLM) insights into an offline reward model. RefineNet innovatively decouples human decision-making for quality assessment into two key dimensions: image-text fidelity and semantic consistency. By strategically filtering and curating datasets, RefineNet demonstrates remarkable performance improvements across diagnostic tasks. Specifically, our method selects 50% high-quality data subsets that outperform full-data baselines by 9.15% in Recall@10 (retrieval), 85.59 AUC (classification), and 72.59% accuracy (visual question answering). Moreover, RefineNet achieves notable agreement with human expert judgments (Pearson’s r=0.67), providing clinicians an auditable bridge between automated curation and validation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3325_paper.pdf

SharedIt Link: https://rdcu.be/eHw7n

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_48

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaNin_RefineNet_MICCAI2025,
        author = { Zhang, Ningyi AND Gao, Yuan AND Wang, Xin AND Chan, Ka-Hou AND Wu, Jian AND Lam, Chan-Tong AND Wang, Shanshan AND Sun, Yue AND Im, Sio-Kei AND Tan, Tao},
        title = { { RefineNet: Elevating Medical Foundation Models through Quality-Centric Data Curation by MLLM-Annotated Proxy Distillation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {498 -- 508}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper filters low-quality image/text pairs by distilling the preference of SoTA vision language models (e.g., Gemini-1.5). The authors first generate a proxy dataset from ~10k Gemini annotated image/text pairs and train a reward model (RefineNet) with a margin-based ranking loss. CLIP trained on 50% of the RefineNet-filtered data gives significant performance gains than CLIP trained using 100% of the data. The authors also compared the proposed approach with baselines data filtering methods (e.g., ClipScore) and show it’s consistently better.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Well-written: This paper is very well-written. The text is very clear and simple to understand.
- Interesting Research Direction: This paper is focused on a super interesting research direction related to MLLM-aided data quality assessment / filtering. I believe this is a super exciting direction and this work is a step towards this direction.
- Strong Empirical Results: The stated performance gain of CLIP model trained on quality filtered dataset is substantial. Although, the gain is somewhat limited when compared with other data filtering approaches (e.g., those based on ClipScore)
- Proper Evaluation of Reward Model: I appreciate the authors to evaluate the reward model using both static image/text metrics as well as alignment to human raters.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Additional evaluation on alternative SoTA MLLM other than Gemini would be useful to guide researchers to pick the right MLLM to do data filtering.
- Limited Generalizability Evidence: This paper demonstrated strong reuslts on PMC-OA dataset. The generalizability of RefineNet or the data filtering approach to other datasets is not extensively explored. Additional experiments on diverse datasets would improve the impact of the work.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper introduces RefineNet, a novel approach using MLLM preference distillation to filter low-quality image/text pairs, demonstrating improved CLIP performance with filtered data and strong alignment with human raters. The paper is well-written and addresses an interesting research direction, but it would benefit from evaluating alternative MLLMs and providing more generalizability evidence beyond the PMC-OA dataset. Therefore, I recommend a weak accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper contributes to getting a novel offline quality assessment strategy for medical-related datasets to improve the quality of datasets that can improve the performance of medical foundation model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The work is novel. The author proposes an offline quality assessment specifically for medical data that can match with the human evaluation, balancing its clinical validity, scalability, and cost efficiency that hard to achieve by other related works.
2. The author is concerned about data security in their works. By using proxy dataset and offline model, it can ensure the medical data privacy.
3. The proposed work showed that their model that generates half less data could outperform the full dataset trained on the same CLIP models. This can reduce the training time and efficiency.
4. The author has done a strong evaluation. Not only do comparisons with other models, but the author also includes humans to evaluate their works. This is very important because it shows how likely the model will behave like a human expert.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The author should also try to evaluate their method into different dataset, such as: Ikezogwo, W., et. al. (2023). Quilt-1m: One million image-text pairs for histopathology. Advances in neural information processing systems, 36, 37995-38017. Sanjay Subramanian, et.al. (2020). Medicat: A dataset of medical images, captions, and textual references. ArXiv preprint arXiv:2010.06000.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The author’s proposed work is novel. It focuses on lack of data curation for medical repositories and also focuses on data privacy that currently being the main concern when using LLM.
2. The structure and the clarity of the paper is good and with strong evaluation procedure to show the impact of their works.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors propose RefineNet, a medical image-text quality assessment pipeline designed to retain high-quality data with strong clinical relevance. Medical vision-language models are trained on large-scale datasets that often contain samples with limited clinical utility, which can hinder performance due to the presence of low-quality data. To address this issue, the authors introduce a contrastive learning objective that enables reward modeling to distinguish between informative and less reliable samples. Experimental results demonstrate that a CLIP model trained on RefineNet-filtered data achieves superior downstream performance across various alignment metrics, with the filtered data showing a strong correlation with human alignment scores.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This study presents a novel approach to reducing training costs for vision-language pre-training (VLP) on large-scale biomedical data while maintaining the clinical efficacy of the data. RefineNet enables effective quality assessment, which not only preserves high overall image and text quality but also enhances performance across various downstream applications, such as retrieval, visual question answering, and disease classification. Given that the proposed approach targets data filtering for VLP, it is further extendable to a broader range of medical imaging tasks. Moreover, the method can be used to the quality of large-scale medical image-text datasets, including generated data, although this is not explored in this study.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Reproducibility concerns: This study distills ratings from a closed-source MLLM, Gemini, into a reward model, which may limit reproducibility due to privacy concerns that hinder real-world use. In contrast, CLIPScore and BLIPScore are fully automated and do not rely on closed-source models. The observed performance gain may partly result from access to a private MLLM, and it remains unclear how the method would perform with underperforming, open-source alternatives. Releasing the code or dataset and exploring distillation from open models would improve accessibility and broaden the method’s impact.
- Detailed information for proxy data collection: Although Section 3.1 provides some explanation regarding proxy data curation, offering a more detailed account of each step would help readers better follow the overall process. For instance, it would be helpful to clarify the motivation of corruption, how it works (e.g. rule-based?), the role of stratified levels in the following process, how human experts are involved in rating, and the criteria applied to determine whether samples are labeled as chosen or rejected.
- Interpretability of analysis figure: The intention behind the analysis figure (Fig. 3) is mixed and unclear. A clearer organization of subfigures and a more detailed caption could help to show its implications more effectively. In particular, the distribution under the “User Message” section in Fig. 3(b) is difficult to interpret due to the small font size. As a potential future direction, incorporating a cross-modal attention into RefineNet could strengthen interpretability. For example, it may help visualize how the scores in Fig. 4 are derived by providing more intuitive alignment between image regions and texts.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This study presents a novel approach to reducing training costs for vision-language pre-training on large-scale biomedical data. RefineNet enables effective quality assessment, which not only retains overall image and text quality but also enhances performance across various downstream applications. The method is straightforward and broadly extendable to a wide range of medical imaging and quality assessment tasks.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

RefineNet: Elevating Medical Foundation Models through Quality-Centric Data Curation by MLLM-Annotated Proxy Distillation

Author(s):