Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2328_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/vectorInstitute/pmc-data-extraction/

Link to the Dataset(s)

Dataset: https://huggingface.co/datasets/vector-institute/open-pmc Model: https://huggingface.co/vector-institute/open-pmc-clip

BibTex

@InProceedings{BagNeg_Advancing_MICCAI2025,
        author = { Baghbanzadeh, Negin and Fallahpour, Adibvafa and Parhizkar, Yasaman and Ogidi, Franklin and Roy, Shuvendu and Ashkezari, Sajad and Khazaie, Vahid Reza and Colacci, Michael and Etemad, Ali and Afkanpour, Arash and Dolatabadi, Elham},
        title = { { Advancing Medical Representation Learning Through High-Quality Data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {23 -- 33}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of the paper is the introduction of Open-PMC, a curated biomedical image-text dataset that extracts and cleans subfigures from scientific articles and enriches their captions using both original descriptions and in-text references. The paper shows that this careful curation improves representation learning, leading to better performance on downstream tasks compared to existing large-scale datasets like BIOMEDICA.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has several strengths: 1) it introduces a novel use of subfigures, extracting them from full figures to create cleaner and more focused image-text pairs, which enhances modality-specific learning. Besides it enriches captions with in-text references, adding contextual information often overlooked in prior datasets, leading to better semantic alignment;3) it presents strong evaluation and ablation studies that convincingly demonstrate the benefits of these curation strategies, showing improved performance on multiple downstream tasks compared to existing datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Major weaknesses of the paper include: 1) the dataset itself is not entirely novel, as it is derived from the same PMC-OA source used by previous works like BIOMEDICA; 2) the motivation for proposing a new dataset is not clearly articulated, making it unclear why this version adds enough value to warrant a separate release.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend accepting the paper due to its innovative dataset curation and strong experimental evaluation. The use of subfigures and enriched captions improves data quality, which enhances performance on downstream tasks. The well-conducted experiments clearly show the value of these improvements, making the dataset potentially impactful for the biomedical AI community. Although the motivation could be clearer, the novel approach and solid results justify the acceptance.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The contribution of the paper is the curation of Open-PMC, a high-quality dataset of 2.2 million medical image-text pairs extracted from PubMed Central articles. It decomposes compound images to subfigures, along with the captions for subfigures and in-text reference to enrich the description. Experimental studies show that quality dataset can improve representation learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Improved dataset quality compared with existing public datasets.
    2. Extensive experiments to test the dataset quality and benchmarking against much larger datasets. Ablation studies demonstrate the value of image decomposition and contextualized captions.
    3. Open access and reproducibility.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The image decomposition model is trained specifically for radiology tasks, limiting the application to broader data modalities.
    2. Some clarifications are needed to help the reader fully understand the details. (1) in 3.2 Image Modality Assignment, why the need for reviewers to assess images? The data are crawled from PubMed articles, and article usually describe the modality, therefore, there should be ground truth labels for these images when curating the data. (2) In 4.1 Pretraining, why the “training durations for other datasets are adjusted to ensure all models train on the same total number of examples”? This doesn’t make sense, please explain. (3) Table 2, what are the number in the second column (Text-to-Image) and third column (Image-to-Text), there is no explanation of it in the table caption or table header. (4) In 4.2, how the confidence range is computed? (5) In Table 4, why the Quilt dataset performance declined with in-text? And again, please explain in the caption or header what is the data in the columns.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces Open-PMC, a curated dataset of 2.2 million medical image-text pairs, and shows that careful attention to data quality can lead to improved performance in medical vision-language models, even when using fewer data points compared to larger datasets. The experimental results demonstrating that models trained on Open-PMC perform competitively across retrieval and zero-shot classification tasks. The dataset and code release adds value for the research community.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper focuses on data curation quality in medical multi-modal deep learning model training.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The study shows that although medical images have been challenging with representation learning due to the lack of quantity compared to natural images, data curation quality alone can improve model performance. This is evident from experimental results performed with various modalities. Additionally, reproducibility is guaranteed by publicly accessing the learned model-based code with Open-PMC.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Although it is mentioned as a limitation, data curation and processing are optimized for radiology modality, so it shows somewhat weaker performance in other modalities. However, it demonstrates similar performance when compared to large-scale datasets, and shows better performance than PMC-OA, which has similar size. This is a counter-evidence showing that data quality affects representation learning.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is a well-validated work showing that data curation quality plays a critical role in representation learning.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

N/A




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top