Abstract

Vision-language models have demonstrated remarkable success in general medical image analysis, yet their application in pediatric imaging remains significantly underexplored. These models show limited performance on pediatric datasets, primarily due to domain gaps stemming from anatomical differences, lower radiation doses, and pediatric-specific diseases. To this end, we present the first pediatric vision-language pre-training framework, dubbed PedCLIP, trained on a comprehensive pediatric imaging dataset comprising 404,670 X-rays of pediatric patients across diverse anatomical regions. To address anatomical diversity, we introduce a Mixture of Body part Experts design, with each expert specializing in learning features from distinct anatomical regions. Experimental evaluation across eleven downstream tasks demonstrates that our model significantly outperforms current state-of-the-art vision-language models, achieving superior diagnostic accuracy in challenging pediatric conditions, including rare diseases such as pediatric inflammatory arthritis. Code is available: https://github.com/tadeephuy/PedCLIP

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1342_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/tadeephuy/PedCLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuyTa_PedCLIP_MICCAI2025,
        author = { Huy, Ta Duc and Shoby, Abin and Tran, Sen and Xie, Yutong and Chen, Qi and Nguyen, Phi Le and Gole, Akshay and Liu, Lingqiao and Perperidis, Antonios and Friswell, Mark and Linke, Rebecca and Glynn, Andrea and To, Minh-Son and Hengel, Anton van den and Verjans, Johan and Liao, Zhibin and Phan, Minh Hieu},
        title = { { PedCLIP: A Vision-Language model for Pediatric X-rays with Mixture of Body part Experts } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {489 -- 499}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the limitations of general-domain pretrained models in medical imaging and the domain gap between different anatomical regions in X-ray analysis. To improve vision-language pretraining in this context, the authors propose integrating a Mixture of Body Part Experts (MoBE) block into a transformer-based visual encoder.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper provides a thoughtful analysis of the limitations of existing vision-language pretraining approaches when applied to pediatric X-ray imaging, highlighting the domain-specific challenges posed by anatomical variability.
    2. Experimental results demonstrate that the proposed MoBE-enhanced model achieves superior performance compared to general medical vision-language models across multiple downstream tasks, supporting the effectiveness of the proposed architecture.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. limited novelty. The core idea of using a Mixture of Experts has already been well-established in machine learning. For instance: Eigen, David, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv:1312.4314 (2013). Moreover, incorporating MoE modules in place of feed-forward networks within transformer layers has been widely explored. Relevant works should be discussed in the Introduction, such as: [1] Fedus, William, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR 23.120 (2022): 1–39. [2] Lin, Bin, et al. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. CoRR (2024).

    2. Ambiguity in Method Description. The term “anatomy” in the triplet (XR,text,anatomy) is ambiguous. Does it refer to specific anatomical regions (e.g., hip, knee)? This should be explicitly defined. Furthermore, it is unclear whether the BERT encoder and the image encoder are pretrained, and if so, what datasets were used for their respective pretraining.

    3. Unclear Evaluation Design in Table 1. It is not specified whether the “original adult-pretrained” vision-language models (VLMs) share the same pretraining dataset. Additionally, the performance variations of VLMs pretrained on the pediatric-specific PeXR dataset across different evaluation sets are not well explained. For example, such models show lower zero-shot classification accuracy on the CHD dataset compared to adult-pretrained models, yet higher accuracy on the GRAZ dataset.
    4. Inconsistency Between Fine-Tuning and Pretraining Descriptions (Table 2). Table 2 is said to report performance after “fine-tuning”, but Section 3.1 describes the models as “pretrained on our large-scale Pediatric PeXR dataset.” The paper should clarify whether the reported results are based on fine-tuning after pretraining and, if so, how the fine-tuning was conducted.

    5. Figure Caption Inconsistency (Figure 5). The caption for Figure 5 refers to a setting labeled “MLP-multi,” but this configuration does not appear in the figure’s histograms. This inconsistency should be corrected or clarified.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the topic is relevant and the proposed solution is empirically evaluated, the originality of the method is limited and several aspects of the methodology and experiments require clarification. Therefore, I recommend a weak reject, although a strong rebuttal may justify reconsideration.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    This paper proposes a task-specific approach that integrates a Mixture-of-Experts (MoE) mechanism into vision-language model (VLM) pretraining, tailored to pediatric X-ray imaging. While MoE is presented as one of the main contributions, the novelty of using soft routing over existing methods such as top-1 routing is neither sufficiently justified nor empirically validated. In particular, the use of a cross-entropy classification loss on the gating weights inherently promotes one-hot expert selection, which contradicts the intended benefit of soft routing. Moreover, the lack of comparison against state-of-the-art MoE-based architectures limits the strength of the empirical claims. Additionally, the reliance on the pretraining dataset, PeXR, which is not publicly available, harms the reproducibility and transparency of the proposed method.

    With a more rigorous evaluation, clearer justification of methodological choices, and use of publicly accessible datasets, the work could potentially contribute to the field in the future.



Review #2

  • Please describe the contribution of the paper

    This paper introduces PedCLIP, the first vision-language model (VLM) pre-trained specifically on a large-scale pediatric X-ray dataset (PeXR) comprising over 400K X-rays spanning 23 anatomical regions. To address the challenge of anatomical heterogeneity, the authors propose a novel Mixture of Body Experts (MoBE) architecture, which integrates expert-specialized (body parts) MLPs within the transformer layers, gated by anatomical context. The model is trained using a multi-modal contrastive loss and evaluated across 11 downstream pediatric imaging tasks, including classification, regression, segmentation, and concept alignment. PedCLIP outperforms existing adult and generalist medical VLMs in both zero-shot and fine-tuned settings.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Relevance: The development of pediatric-specific foundational models addresses a gap in current medical AI literature, where most VLMs are trained on adult chest X-rays. 2) Architecture – MoBE:The Mixture of Body Experts (MoBE) module is a well-motivated and effective solution to the problem of multi-body-part interference. 3) Large-Scale Pediatric Dataset (PeXR): The authors curate and utilize a large and diverse pediatric dataset covering 23 anatomical regions, enabling generalization across tasks. 4) Solid Visualizations: The papers figures, tables and graphs communicate a significant attention to detail 5) Strong Empirical Results: PedCLIP demonstrates consistent improvements across zero-shot and fine-tuned tasks, outperforming strong baselines. The paper also provides strong and insightful linear probing results.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Ambiguity in Metric Interpretation: While accuracy is reported, it’s unclear how models were prompted (e.g., examples would be beneficial) and whether variations in output format (e.g., verbose vs. short responses) influenced correctness scoring. 2) Reproducibility: There is no mention of what vision transformer is used and its hyperparameters. There is no mention of other architectural/parameter-count differences between baselines apart from using MoBE. This is important for understanding where else performance discrepancies come from 3) PedCLIP Generalization: It would be illuminating to see how PedCLIP pre-trained using Adult XR data performs in zero-shot and supervised settings. Some methods do not benefit much from pre-training on pediatric data (GLoRIA-zeroshot), while others benefit significantly (MGCA-zeroshot). This would provide a more general and impactful argument for the efficacy of the MoBE strategy.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave this paper a score of 4 (Accept) due to its relevant contribution to an underexplored but important area of medical AI—pediatric imaging. The design decisions of PedCLIP are well motivated and demonstrate strong empirical results across zero-shot and fine-tuned tasks. The paper also has excellent visualizations. However, the score is moderated by a few important concerns. Specifically, ambiguities in reproducibility (e.g., lack of architectural details and prompting strategies) and missing ablations on cross-domain pretraining (e.g., PedCLIP trained on adult vs. pediatric data). Addressing these concerns would improve both clarity and generalizability of the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Given the author’s responses, the backbone, parameter count and prompt evaluation, I would move up the rating



Review #3

  • Please describe the contribution of the paper

    This paper introduces PedCLIP, the first foundational vision-language model (VLM) specifically designed for pediatric medical imaging. To address the domain gap between adult-trained models and pediatric data, the authors curate PeXR, a large-scale dataset of over 400,000 pediatric X-rays covering 23 anatomical regions. To handle the anatomical diversity, they propose a novel Mixture of Body Experts (MoBE) architecture, where each expert specializes in features from a specific body part, and a gating mechanism dynamically selects the appropriate expert per image. They also leverage LLaMA-3.1 to extract body-part-specific report text, ensuring accurate vision-language alignment. PedCLIP significantly outperforms existing generalist and adult-trained VLMs across multiple pediatric tasks, including classification, regression, and segmentation, and shows strong alignment with clinically meaningful concepts.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) Pediatric-specific model design: The development of PedCLIP fills a significant gap by creating the first vision-language model tailored specifically for pediatric imaging, addressing anatomical and procedural differences from adult data. (2) Large and diverse dataset (PeXR): The paper introduces a comprehensive pediatric dataset of over 400,000 X-rays covering 23 anatomical regions, enhancing generalizability across multiple body parts and clinical tasks. (3) Mixture of Body Experts (MoBE) architecture: The MoBE module effectively mitigates feature interference in multi-anatomy training by allowing each expert to specialize in a specific body part, leading to improved learning efficiency and performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (1) Unclear novelty of MoBE vs. related medical MoE mechanisms: The paper uses a gating mechanism to assign inputs to body-part-specific experts, but does not clearly distinguish this design from existing medical MoE approaches (e.g., M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts). The lack of discussion on architectural similarities or differences limits the clarity and perceived novelty of the proposed method. (2) No comparison with existing MoE architectures: The paper introduces MoBE but does not compare it with existing mixture-of-experts (MoE) methods and medical MoE designs (e.g., DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets; Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models). This makes it difficult to assess whether MoBE brings genuine architectural improvements or just benefits from the pediatric-specific setup. (3) Lack of justification and efficiency analysis for gating design:

    • The gating network is trained using CE loss on body-part labels, which pushes the output toward a near one-hot distribution. This raises the question: what is the actual benefit of using a learned gating mechanism over directly assigning experts by body-part labels? And the paper lacks an ablation without gating and fixed expert selection using body-part labels.
    • The gating mechanism introduces extra parameters and computational overhead, but the authors do not report its impact on model size or inference speed (4) Unclear dataset accessibility and imbalance in PeXR: The proposed PeXR dataset is not explicitly stated as public or private, and the paper does not provide the sample distribution across the 23 anatomical regions. If the dataset follows a long-tail distribution, some body-part experts in the MoBE module may receive insufficient training, leading to performance imbalance across regions.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents a valuable contribution by introducing a pediatric-specific vision-language model and a large-scale multi-anatomy dataset, I recommend weak accept due to several concerns regarding the novelty and evaluation of the proposed Mixture of Body Experts (MoBE) architecture. First, the paper does not clearly distinguish MoBE from existing medical MoE approaches (e.g., M4oE), limiting the perceived novelty. Second, it lacks direct comparisons with prior MoE methods such as DAMEX and Med-MoE, making it unclear whether the observed gains are due to architectural improvements or simply the pediatric-specific data setup. Third, the gating mechanism is trained with cross-entropy loss on body-part labels, pushing the output toward one-hot distributions—yet the paper does not justify the advantage of this learned gating over direct expert assignment, nor does it provide efficiency comparisons or ablations without gating. These omissions weaken the methodological rigor, though the overall contribution remains meaningful.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The work addresses an important gap in pediatric imaging and proposes the MoBE architecture to handle multi-anatomy learning. While the idea is meaningful, especially for pediatric data, the novelty and effectiveness of MoBE are not well-validated due to missing comparisons and ablations. Despite these issues, the direction is valuable. I recommend an accept.




Author Feedback

We’re glad reviewers recognized the clinical value of PedCLIP(R3),its well-motivated MoBE (R1,2,3),thoughtful analysis(R2),strong results(R1,2),clear visuals(R1),and comprehensive benchmarks on diverse pediatric datasets(R1,3).

R1.1:Prompt format We use CheXzero pos/neg (±)prompt style:

XR shows (no) AUC is from softmax over (±)prompt-XR sim.score so output length doesn’t affect results. R1.2,R2.2:Backbone We’ll clarify that “anatomy” refers to regions (i.e.,hip,knee). PedCLIP uses ViT-B/16 pretrained on ImageNet and BioClinicalBERT pretrained on MIMIC notes, totaling 187M params. Others(in M): GLoRIA:134 MedCLIP:137 MGCA:173 Carzero:221 PRIOR:281 UMedCLIP:414 PedCLIP ranks 4/7. Beside MoBE, there are no architectural changes, showing its gains stem from MoBE and not size. R1.3:PedCLIP on AdultXR Our goal is to build a pedi.VLM across body parts, while adult pretrain data is chest-only. Applying PedCLIP to adultXR would underutilize and misaligns with MoBE design goal. We will explore where MoBE is limited in text. R2.1,R3.1,R3.2:Novelty of MoBE We appreciate the reviewer’s comment.First,we emphasize that this work focuses on a macro framework for foundation VLM across diverse body parts via MoBE rather than introducing a new MoE. To our knowledge, this is the first to apply soft-MoE to anatomically diverse pediatric data. Second,our MoE design differs. While DAMEX & Med-MoE also use explicit gating like MoBE, other MoEs(Switch,DeepMoE,M4oE) rely on implicit gating with load balancing for uniform expert usage,conflicting with the long-tailed nature in PeXR:routing rare body parts to underutilized experts,reducing specialization. DAMEX uses top-1 routing based on dataset,limiting expert sharing and generalization. In contrast, MoBE uses soft routing,important when diseases span multiple regions (see R3.3). DAMEX also routes all image tokens, increasing compute and ambiguity, especially from background patches. Med-MoE redundantly routes both img/txt tokens describing the same part. MoBE efficiently routes using only the global [IMG] token, preventing patch tokens from focusing on bodypart discrimination and neglect diseases features. Finally, we stress that PedCLIP, our pediatric VLM, is the main contribution. MoBE is a component to address anatomical heterogeneity in pediatric data. We'll expand related comparisons to support MoBE design. R2.3:Tab.1 We use the original checkpoints for adult-pretrained VLMs, which can differ in data (e.g.,GLoRIA:CheXpert;MedCLIP:MIMIC). Adult models are better on CHD as they are chest-pretrained; and mixed body parts in PeXR interfere with chest features of PeXR-VLMs (MoBE fixes this;see Fig.5). PeXR-VLMs are better on GRAZ as wrist XR is absent in adult data. This is well-noted in revision. R2.4,R2.5:Inconsistencies Finetune: We see the concern, and confirm no inconsistency: Sec.3.1 covers PeXR pretraining;Tab.2 shows results after finetuning the pretrained. To finetune,we freeze the img encoder and train a linear head(cls/reg) or a mask decoder(seg) for 50 epochs. Fig.5: MLP is correct and we'll update. R3.3:Gating design. Learned soft-gate enables shared findings features across bodyparts:fracture cues learned in the arm can help detect fractures in the leg. Importantly,it picks experts automatically without needing bodypart labels in inference. The suggestions to use fixed expert labels without gating is valuable. That said,we expect it to underperform,as it relies on precise bodypart labels and prevents experts from sharing features. We'll include in revision. The gate is a linear layer (768×23) with minimal impact on runtime. R3.4:PeXR PeXR is not publishable at this stage, but we will release our pediatric foundational model PedCLIP checkpoint first. PeXR is imbalance of skewness 2.6, with the top-3(chest,wrist,forearm) of 46%. Beside balanced sampling, MoBE’s soft routing lets rare regions benefit from shared features of others. We'll add distribution to text.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a divided group of reviewers. Two of the reviewers have been moved to support the paper more after the rebuttal. One who had originally rejected is not recommending acceptance still. But the new comments seem general and do not refute the rebuttal. Given two strong Accept decisions, I side with Accept.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top