Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Self-Supervised Learning (SSL) has shown promising results in medical image segmentation, offering advanced performance with minimal annotations. However, the absence of semantics during pre-training limits the performance of downstream tasks (e.g., organ segmentation). To address this issue, we propose a novel SSL framework via Foundation model Distillation and Anatomic Structure-aware multi-task learning (FDAS) for medical image segmentation. Specifically, we distill knowledge from the Segment Anything Model (SAM) and propose SAM-guided anatomic Structure-aware Masked Image Modeling (S2MIM), which randomly masks multiple anatomic structures in the image to enrich representation learning. For better pre-training, we introduce anatomic structure-aware multi-task learning, which integrates reconstruction and segmentation of anatomic structure-fused images to capture richer semantic information, along with fusion-based contrastive learning to preserve the semantic integrity and discriminative power of the learned representations. Experiments on two applications (cardiac MRI segmentation and fetal brain MRI segmentation) demonstrate that our method effectively improved the representation learning and outperformed several state-of-the-art SSL methods. The code is available at https://github.com/HiLab-git/FDAS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2967_paper.pdf

SharedIt Link: https://rdcu.be/eHwYs

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_19

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/HiLab-git/FDAS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QiXia_FDAS_MICCAI2025,
        author = { Qi, Xiaoran AND Zhang, Guoning AND Wu, Jianghao AND Zhang, Shaoting AND Hou, Xiaorong AND Wang, Guotai},
        title = { { FDAS: Foundation Model Distillation and Anatomic Structure-aware Multi-task Learning for Self-Supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {192 -- 202}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of this work lies in exploring the use of the foundation model SAM to assist self-supervised learning. The authors propose leveraging SAM to identify potential anatomical structures.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.This paper is well-formulated and easy to understand.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.Undesirable self-supervised learning (SSL) setting. A well-established SSL paradigm involves pre-training on large-scale unlabeled data with diverse domain knowledge (e.g., different modalities, objects, etc.), followed by fine-tuning on labeled data for downstream tasks. Furthermore, the downstream and upstream tasks are generally distinct. However, this work treats a single dataset for both pre-training and evaluation, which limits the generalizability of the approach. 2.Given that the SAM is frozen, the indicated anatomical structure may become less informative during iterations. Would this lead to a decrease in performance or limit the model’s ability to capture diverse fundamental representations? 3.What is the rationale for reconstructing fused images using the original images as standards? This approach may cause the model to focus on transforming the inherent structure of a random image to match the target, rather than enhancing the model’s understanding of the target itself.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The experimental setting and limited pre-training data scale are insufficient to validate the effectiveness of the proposed SSL design.
2. There is a lack of satisfactory rationale or evidence to support the design of Image Fusion-driven Reconstruction.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
Thank you for the authors’ response. However, my score remains unchanged due to the following reasons:
1. I find it difficult to agree with the authors’ claim that “pre-training and evaluating on the same dataset is widely accepted in medical self-supervised learning (SSL).” While several works are cited in support of this point, I encourage the authors to consider the scale of data used in those studies. For instance, SurgNet [2] pre-trains on approximately 3 million unlabeled images from a single dataset, whereas this work utilizes fewer than 1,000 images for pre-training—raising concerns about generalizability and representativeness.
2. Moreover, the referenced works in the rebuttal—MG [1], SurgNet [2], and Rubik [3]—were all published before 2023. I recommend the authors refer to more recent SSL studies, such as BrainMaSS (TMI 2024) and MIM (TMI 2025), which better reflect the current landscape and practices in SSL research.

Review #2

Please describe the contribution of the paper

This paper proposes FDAS (Foundation model Distillation and Anatomic Structure-aware multi-task learning), a novel self-supervised learning (SSL) framework designed to enhance medical image segmentation. It introduces three key innovations: A SAM-guided anatomic structure-aware masked image modeling strategy, which leverages Segment Anything Model (SAM) masks to generate semantically meaningful masked inputs. Anatomic structure-aware multi-task learning, combining: Image fusion-driven reconstruction (IFR), SAM knowledge distillation (SKD), and Fusion-based contrastive learning (FCL). Experiments demonstrate superior performance over existing SSL methods on two tasks: cardiac MRI segmentation (M&MS dataset) and fetal brain MRI segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Incorporates zero-shot SAM masks into the SSL pre-training phase via S2MIM, improving semantic relevance during masking.
2. Combines reconstruction, segmentation, and contrastive objectives, resulting in stronger representations and better generalization.
3. Outperforms state-of-the-art SSL baselines (e.g., MAE, HybridMIM, VoCo) on Dice and ASSD, even rivaling fully supervised models.
4. Each component of the framework is individually assessed, showing meaningful incremental improvements.
5. Achieves strong segmentation accuracy using only 10% of labeled data; performance approaches “upper bound” using full supervision.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Relies heavily on SAM for pretext supervision; if SAM generates poor masks (e.g., on noisy or out-of-distribution data), performance may suffer.
2. Only two datasets are evaluated—one private (fetal brain). Additional modalities (e.g., CT, ultrasound) would strengthen claims of generality.
3. Multi-task objectives and SAM preprocessing may be costly; the paper doesn’t report pre-training efficiency or overhead compared to other SSL methods.
4. No qualitative failure examples or analysis of when SAM-guided masks or fusion methods might degrade performance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the FDAS framework presents a meaningful advance in semantic-aware SSL for medical image segmentation by distilling foundation model knowledge and structuring a multi-task learning approach. Its performance and ablations are strong, though further generalization studies and computational profiling would strengthen its impact. I suggest weak accept.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors address most of my concerns.

Review #3

Please describe the contribution of the paper

To address the issue of the absence of semantics during pre-training that limits the performance of downstream tasks, the authors proposed a new SSL approach called Foundation Model Distillation and Anatomic Structure-aware multi-task learning (FDAS) for medical image segmentation. The authors use SAM to distil its knowledge into an SAM-guided anatomic structure-aware masked image modelling called S2MIM. The method also incorporates image fusion driven by both reconstruction and segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It is novel to use SAM’s guidance for pretraining with image fusion and the contrastive learning method. Achieved superior performance over SOTA.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Some of the details of the method are missing, such as details of the lightweight reconstruction head and the simple segmentation head. It is not explained how these two differ, and whether a simple sigmoid or softmax is used, or another layered network in the segmentation head is not explained. The methodology can be further improved as it has multiple weight parameters (\theta). How optimisation is happening during model training needs to be explained. As I understood, both heads were trained in a similar manner using the same loss function. The explanation of the Auxiliary image is not clear. It is not detailed whether the auxiliary image is taken from the same slice number of a different volume or whether it can be a different slice from the same medical volume to recover structural semantics. Authors mentioned that the proposed method captures richer semantic information, yet apart from segmentation masks and scores, there are no empirical results, such as activation maps, to visualise the model’s representation ability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the paper has novelty in its proposed framework, yet there’s room for improvement, especially when it comes to writing and explaining the methodology.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We appreciate that the reviewers recognized the novelty and effectiveness of our method (R1 & R4), the clarity of writing (R1, R2 & R4), the comprehensiveness of the experiments (R1 & R4), and the superior performance (R1 & R4). Our detailed responses are provided below.

Self-Supervised Learning (SSL) setting (R2) While our work pre-trains and evaluates on the same dataset, this setting is widely accepted in medical SSL due to data access and privacy constraints, and has been proven effective in prior works such as MG [1], SurgNet [2] and Rubik [3], where the model is first pre-trained with unlabeled images in a dataset, and then fine-tuned on a small annotated subset. Our goal is not to build a general-purpose foundation model, but to explore a novel self-supervised method that can leverage unannotated images for pre-training. The concept of our multi-task self-supervised pretraining has been validated in this preliminary work, which serves as a foundation for building a general model using multiple datasets in the future.

Effectiveness and robustness of using SAM (R2) The aim of using SAM is to generate some semantically meaningful subregions, so that some of these regions are randomly masked for image reconstruction. Compared with MAE, where the masked region is image context agnostic, our marked region for reconstruction is more structure-adaptive, which helps the pretrained network learn more contextual and semantic features. Although the parameters of SAM are frozen, we apply spatial transformations to the input image and vary the number of point prompts (in everything mode) to keep the generated regions diverse, ensuring they remain informative during iterations.

Motivation of Image Fusion-driven Reconstruction (IFR) (R2) Note that our IFR replaces some region of an image with content from a different image, and then encourages the network to reconstruct the original content in the replaced region. The network needs to reconstruct these regions based on the unmasked regions in the original image under perturbations from other images. Therefore, the network will not transform a random image to match the target. Actually, the rationale follows MAE, where the network is trained to reconstruct masked region of an image from unmasked part, and it has been widely used for recent SSL methods. Note that rather than using structure-agnostic masking, our method uses structure-aware replacement for the reconstruction task, which can better learn contextual features.

Dealing with noisy SAM output (R1) We considered the domain gap between SAM’s training set and our medical dataset. To reduce noise, we filtered the top K potential regions covering 95% of the image (Eq. 1) and applied FCL (Sec. 2.4) for robust feature learning.

Pre-training efficiency and experiments (R1) The entire pre-training process on the M&MS dataset takes about 5.6 hours on a single RTX 3090 GPU, while HybridMIM and VoCo require 17.6 and 10.7 hours, respectively. We agree on the value of other modalities and failure analysis, and will explore both in future work.

Clarity of the method (R4) Both task-specific heads use 1×1 convolutions with distinct output channels. All parameters (theta) are optimized jointly using the overall loss. The auxiliary image is taken from a different slice within the same volume. Visualizations during pre-training showed that our method captures richer semantics, but were omitted due to space constraints. The method details will be clarified in the manuscript. [1] Zhou, Z, et al. Models Genesis: Generic autodidactic models for 3D medical image analysis. MICCAI, 2019. [2] Chen, J, et al. SurgNet: Self-supervised pretraining with semantic consistency for vessel and instrument segmentation in surgical images. IEEE TMI, 2023. [3] Tao, X, et al. Revisiting Rubik’s Cube: Self-supervised learning with volume-wise transformation for 3D medical image segmentation. MICCAI, 2020.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This work presents enough technical contributions and meets the bar of MICCAI.

back to top

FDAS: Foundation Model Distillation and Anatomic Structure-aware Multi-task Learning for Self-Supervised Medical Image Segmentation

Author(s):