Abstract

Segmentation foundation models (SFMs) hold promise for medical image analysis, but their direct clinical application is limited by computational cost, potentially suboptimal accuracy, and fairness concerns. In this paper, we propose a novel framework to address these challenges by distilling knowledge from a heterogeneous ensemble of pre-trained SFMs, generating specialized, high-performance models for domain-specific medical image segmentation. Unlike existing single-SFM approaches, our methodology leverages the collective intelligence of diverse SFMs to enhance accuracy, fairness, and efficiency. A key contribution is a knowledge distillation strategy using the ensemble’s aggregate predictions on unlabeled data to minimize reliance on manual annotation. Evaluated on a large, diverse dataset of CT and MRI scans from 702 individuals, our distilled model significantly outperforms individual SFMs and their ensemble average, achieving state-of-the-art segmentation accuracy, improved fairness across demographics (sex, age, BMI), and substantially reduced computational cost. These results offer a practical paradigm for leveraging foundation models in real-world clinical settings, mitigating key SFM limitations.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2167_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiQin_From_MICCAI2025,
        author = { Li, Qing and Zhang, Yizhe and Yang, Shengxiao and Li, Qirong and Wang, Zian and Liu, Junhong and Zhang, Haoyang and Wang, Shuo and Wang, Chengyan},
        title = { { From Generalist to Specialist: Distilling a Mixture of Foundation Models for Domain-specific Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {195 -- 204}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose to leverage an ensemble of foundation segmentation models to generate segmentation masks for distilling a student network on unlabelled imaging data. The authors also propose a fairness-aware training strategy to handle the model’s bias introduced from various FMs.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation for this paper is clear and straightforward. The use of ensemble FMs is a sound approach to handle the limited data scenario in clinical settings. The proposed fairness-oriented sampling method to reduce demographic biases is also straightforward and seems to work well.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The method’s performance and fairness characteristics does not consider the imaging data’s characteristics (for example whether the number of CT vs MRI scans impact the model’s performance on each modality, whether different scanners affect the performance). This is surprising to me because the distribution shift between imaging modalities would also be a major factor impacting the model’s performance for real-world cases [1]. The authors also did not mention this in the dataset section. Similarly, the method’s performance and fairness characteristics for other anatomies or imaging modalities (e.g., ultrasound, PET) are also unexplored.

    The paper claims that no explicit groundtruth information is necessary when distilling their student from various FM teachers. However, models such as SAT would require very explicit text prompts (same as its training prompt), and prompt-driven methods such as SAM or SAM2 also need some ground truth information to get the appropriate visual prompts [2]. How does this method deal with this situation when no ground truth is given at all?

    The authors claim that their ensemble of FMs would mitigate the challenge of missing groundtruth labels. However, if a specific organ is not trained on any of the used FMs, would the author expect their framework to work as well? In the experiment design, the authors have only tested this approach on very common organs which each FM should excel at. However, in real clinical scenario, some organs such as neurovascular bundles have very little or no annotation at all [3], therefore even the most powerful FM cannot perform direct segmentation since it is not specifically trained on such annotations.

    Reference:

    1. Kilim, Oz, et al. “Physical imaging parameter variation drives domain shift.” Scientific Reports 12.1 (2022): 21302.
    2. Mazurowski, Maciej A., et al. “Segment anything model for medical image analysis: an experimental study.” Medical Image Analysis 89 (2023): 102918.
    3. Li, Zhen, et al. “A deep learning-based self-adapting ensemble method for segmentation in gynecological brachytherapy.” Radiation Oncology 17.1 (2022): 152.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the overall idea is straightforward and clear, the limited experiment on a single private dataset cannot convince me that this idea would work on more complex clinical scenarios (where you may have a distribution shift between your training and testing). At least the authors should try to mimic some complex scenarios by bootstrapping samples within their private dataset to show the advantage of their proposed learning method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Although the authors state their method does not require explicit ground truth, they reveal reliance on semi-automatically generated bounding boxes by radiologists during distillation. This undermines their assertion of a ground-truth-free framework.

    Also, the authors have not convincingly demonstrated the robustness or applicability of their approach for less frequently annotated structures in the paper. Their current validation is limited to common and easily identifiable organs, which may not work well for challenging, less-annotated anatomical structures.

    With these reasons, I recommend rejection of the paper.



Review #2

  • Please describe the contribution of the paper

    The manuscript introduces a ground-truth-free knowledge distillation framework that distills an ensemble of segmentation foundation models (SFMs) into specialized medical imaging models, integrating fairness-aware batch sampling to mitigate demographic biases. Combining deterministic (UNet/HSNet) and probabilistic (PUNet) strategies, the approach achieves DSC scores (0.900 overall) and shows good fairness across sex, age, and BMI groups on a 702-scan CT/MRI dataset. Distilled models reduce parameters compared to SFMs while maintaining performance, leveraging mask selection and balanced sampling to address bias and computational overhead.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This manuscript has following advantages,

    1. It eliminates dependency on manual annotations using SFM ensemble pseudo-labels
    2. It integrates fairness as a first-class objective through stratification of sex/age/BMI
    3. Achieves good mean DSC than best SFM with fewer parameters
    4. Probabilistic distillation handles SFM disagreement through latent space modeling
    5. Demonstrates scalability across CNN and Transformer architectures (for student models)
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Few Questions / Comments:

    1. The manuscript has the references which start at number 3 in the introduction section? Can the author elaborate why were references 1 and 2 missing in the introduction section, was it an issue with the numerical ordering of the references?
    2. It would be great if the authors can mention the full form for the acronym “SAT”. Can the author include a definition of SAT when it was first mentioned in the introduction section ?
    3. The specific inclusion/exclusion criteria governed the 702-scan cohort selection are not mentioned in the manuscript. It would be great if the authors can mention the criteria for selecting the subjects for the study.
    4. Can the author explain, how were bounding boxes generated for SAM-based models? If automated, what ensured anatomical correctness without ground truth mask?
    5. Also, the dataset distribution with respective to different modalities were not available in the manuscript. It would be good if the authors added the ratio of distribution for MRI/CT modality, distributed across different demographic subgroups ?
    6. Can the author explain, why were radiologic technician used for annotations rather than radiologist, given the manuscript is focused on clinical dataset?
    7. Also, can the author explain, What was the inter-rater agreement for the annotations provided by the two radiologic technicians? If it was available, can the author mention how is it calculated for each organ or as an overall score?
    8. Figure 2, what does the shades of color “Red” and the signs ‘’, ‘’, ‘’, and ‘**’ represents. The figure looks ambiguous and is difficult to understand for the readers, it would be good if the authors would have added this explanation in the caption for the figure.
    9. The section 3, for results mentions that, “The dataset includes MRI scans of seven anatomical structures”, whereas only 5 were listed, “liver, kidney, spleen, pancreas and IVC”. The number of labels mentioned in this section does not provide an exact count of 7 anatomical structures. Can the author mention the 7 anatomical structures and the label number given to each of the structure ?
    10. The results showed in the manuscript looks good, but how does this results for each of the anatomical structure mentioned in the manuscript compares with the current state-of-art segmentation value for each of the structure. It would be good to have a comparative analysis between the current state-of-art results for each structure with the results from 3 distilled models.
    11. Table 3 shows a standard deviation value between the PUNet model and the same when it was fair trained, it does show a difference in the value. Can the author add the dice score from the extreme cases, specifically, where the model did perform worst? Was there any difference in the result value after Fair Training on the worst cases from each structure ?
    12. Can the author explain, How does the framework handle SFMs with known biases (e.g., SAT’s lower IVC DSC of 0.347)? Does distillation amplify or mitigate such biases?
    13. Also, can the author add, were the SFMs fine-tuned on the distillation set, or used strictly zero-shot inference? If the latter, how does performance compare to lightly fine-tuned SFMs?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The manuscript introduces an approach to distilling knowledge from multiple SFMs into specialized medical imaging models without relying on ground-truth annotations. This ground-truth-free distillation strategy is a valuable contribution to medical image analysis.
    2. As mentioned above there are few points / questions that needs to be addressed by the authors for a strong acceptance of the manuscript.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Thank you for to the authors, for thorough and thoughtful response to the review comments. The clarifications regarding the primary aim of the work (distilling knowledge from existing SFMs rather than training new ones), the methodology for prompt generation in SAM-based models, and the dataset rationale are well-received and address my previous concerns. I particularly appreciate your commitment to implementing all specific corrections I suggested and to including performance metrics across multiple runs. These revisions will significantly strengthen the manuscript.



Review #3

  • Please describe the contribution of the paper

    This paper proposes utilizing multiple foundation models to perform segmentation without ground-truth labels, leveraging probability distillation across all sources. It also incorporates fairness-aware objectives to mitigate demographic biases.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Leveraging large foundation models to improve medical image segmentation in low-label settings is a timely and hot topic. This paper presents a workflow that effectively integrates multiple foundation models to enhance segmentation performance, and the experimental evaluation is fairly comprehensive.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The datasets used in this paper are private, and it lacks evaluation on publicly available benchmarks that are commonly used to assess foundation models. Additionally, the claim of a no-label setting appears contradictory, as bounding boxes are required for SAM-based segmentation generation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Since SAM-based models require bounding boxes as input, this setup implicitly relies on weak annotations rather than being annotation-free. Could the authors clarify how these bounding boxes are obtained? Are they derived from ground truth masks, heuristics, or another model?
    2. The paper states the dataset is split into training and testing sets, but it’s unclear how model performance is reported. Are the results from a single run, or do they reflect average performance across multiple runs? To improve reproducibility and robustness, I suggest the authors report mean ± std over several runs.
    3. Its interesting that MedSAM underperforms compared to the original SAM on the private dataset, despite being pre-trained on a large collection of CT/MRI data that includes the organs studied in this paper. Do the authors have any insights into this surprising result? Why did the authors not evaluate the proposed model on the public benchmarks (common used test sets) used by MedSAM or SAT to provide a more complete comparison?
    4. The paper proposes combining multiple foundation models. Could the authors clarify the individual contribution of each foundation model? For example, how do their combinations impact the overall performance?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant problem and presents a reasonably well-motivated approach using multiple foundation models for segmentation without ground-truth labels. While the methodological novelty is moderate and some evaluation aspects (e.g., public benchmark comparison, clarity on no-label assumptions) could be improved, the overall framework makes sense and the experiments are comprehensive.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have clearly addressed my questions. It would be better if the manuscript could clarify the impact of labeling quality from different foundation models, and make it clear that the method is ground-truth-free but not entirely annotation-free, as it still relies on at least weak annotations. Overall, the experimental design is solid and reasonably fair.




Author Feedback

We sincerely thank all reviewers for their constructive feedback and recognition of our contributions. Below, we provide our responses. We wish to clarify our primary aim: to address key issues—suboptimal accuracy, high computational cost, and fairness concerns—associated with the direct clinical application of existing segmentation foundation models (SFMs). Our core idea is to distill their collective knowledge into a specialized, efficient model. Consequently, we are NOT training new SFMs but strategically leveraging masks generated by SFMs to train models for single-organ segmentation, thereby minimizing the need for extensive manual ground-truth masks.

Prompt Usage (R1, R2, R3)

We acknowledge that some SFMs in our framework require prompts: SAM-based models use bounding boxes, SAT uses fixed text prompts, and TotalSegmentator requires no explicit prompts. For SAM-based models, these bounding box prompts were generated by radiologists via a semi-automatic tool, aligning with standard protocols (e.g., MedSAM), which costs lower than annotating full masks. Crucially, no manual ground-truth segmentation masks are used to supervise the student model during distillation.

Dataset Description, Fairness, and Annotation Quality (R1, R2, R3)

Using a private dataset ensures no SFMs have been previously trained on it for a fair zero-shot evaluation; it also reflects practical usage (applying FMs to a facility’s own data). Our dataset comprises MRI and CT from 702 individuals, curated explicitly for evaluating generalization and fairness. Demographic distributions (sex, age, BMI) were balanced across training and testing splits. Inclusion criteria were adults aged 20–60 without major anatomical distortions or significant imaging artifacts. Although each image was typically annotated by one radiologist due to the large sample size, inter-rater reliability (κ > 0.9) was validated on 30 randomly selected samples.

Multi-Source Knowledge Distillation (R1, R2, R3)

Our approach distills multiple generalist SFMs into specialized models for single-organ segmentation within a specified imaging modality. Therefore, variability in modality, scanner types or organs was beyond our current scope. Our multi-source distillation improved both accuracy and fairness, demonstrating robustness even when SFMs disagreed—for instance, SAM’s strong performance on IVC compensated for SAT’s lower accuracy (DSC: 0.774 vs. 0.347). Notably, all SFMs were used strictly in zero-shot inference mode, preserving our ground-truth-free framework. Future work will explore additional anatomies and modalities (e.g., PET).

Performance Evaluation and Baseline Comparison (R1, R2)

Reported results reflect performance on a hold-out test set. While the current results are based on a single training run, we observed that repeated runs led to negligible variation in the conclusions. To strengthen the manuscript, the mean and standard deviation of performance across multiple runs will be included in the revision. Comparing our best-performing model (PUNet) to nnUNet trained on ground-truth masks demonstrated competitive performance—overall DSC difference was around 0.01—highlighting our method’s effectiveness without requiring ground-truth masks. Furthermore, fair training significantly improved minimum DSC values, reducing bias across samples.

Modality Distribution and SAM vs. MedSAM (R1, R2, R3)

Each organ was consistently imaged using a fixed modality (CT for lung; MRI for others). Regarding generalization capability, SAM consistently outperformed MedSAM, likely due to domain mismatches with MedSAM’s pre-training data. This finding underscores the benefit of employing multiple diverse SFMs within our distillation framework to mitigate variability and enhance performance.

Specific Clarifications (R1)

We will implement all recommended corrections, including precise reference numbering, definition of SAT, explanations of color codes and symbols(*), correcting edit errors.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I tend to agree with the majority of reviewers. But the comments from R3 should still be aware of, for example, “Although the authors state their method does not require explicit ground truth, they reveal reliance on semi-automatically generated bounding boxes by radiologists during distillation. This undermines their assertion of a ground-truth-free framework.”



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top