Abstract

Vision Transformer (ViT) has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network (CNN), which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling (\textbf{AdvMIM}) method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at https://github.com/zlheui/AdvMIM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0577_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zlheui/AdvMIM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhuLei_AdvMIM_MICCAI2025,
        author = { Zhu, Lei and Zhou, Jun and Goh, Rick Siow Mong and Liu, Yong},
        title = { { AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {54 -- 64}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposed the Adversarial Masked Image Modeling (AdvMIM) method, which introduces a novel strategy combining masked image modeling and adversarial domain adaptation to effectively train Vision Transformers (ViTs) for semi-supervised medical image segmentation with limited labeled data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces a combination of Masked Image Modeling (MIM) and Adversarial Domain Adaptation. Unlike traditional masked image modeling approaches that focus on reconstructing masked patches, this method shifts the task toward predicting the complete segmentation mask from partially masked images.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper lacks sufficient detail regarding the training setup, which raises concerns about reproducibility. Specifically, it states: “Note the CNN is trained together with the transformer using the same loss functions except that pseudo-labels for CNN are assigned by transformer.” However, it is unclear whether the Swin-UNet and UNet models are trained simultaneously or in a staged manner. If they are indeed trained jointly, it is important to clarify how the pseudo-labels are initialized at the beginning of training, particularly in the early iterations when both models may be untrained or poorly calibrated.

    2. Additionally, the efficiency of the proposed method is not discussed in the paper. The framework involves multiple components, including a Vision Transformer and a CNN, two domain discriminators, and several loss branches (supervised, masked domain, and adversarial), which likely leads to increased computational overhead in terms of training time and memory usage. However, no further analysis or reporting on this part is provided.

    3. The proof of Theorem 1 is insufficient as a formal theoretical contribution. It supports the intuition behind the method but lacks the rigor, completeness, and clarity required for a fully formal proof.

    4. The method incorporates adversarial training using GAN-style discriminators to reduce the domain gap between original and masked domains. However, training stability, convergence behavior, and optimization difficulty of GANs are not discussed at all. It is well known that GAN training can be unstable and sensitive to hyperparameters. Without any convergence analysis, training curves, or empirical stability metrics, it is unclear how reliably the adversarial component contributes to performance, or whether it introduces additional instability into the learning process.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While AdvMIM is somewhat interesting and performs well empirically, the lack of methodological clarity, reproducibility challenges, absence of computational analysis, and weak theoretical grounding prevent this paper from reaching the level of a publishable contribution at this time.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thanks to the authors for their responses during the rebuttal. However, the main concerns from the initial reviews are still not properly addressed. Specifically: 1) While GAN training is well studied, this does not mean it is guaranteed to be stable in training even with WGAN[1]. The authors refer to an empirical loss observation, but do not address the original concern regarding the stability of the training using GAN, especially considering the added complexity from masked inputs, pseudo labels, and weighted loss components in the method. 2) The rebuttal does not sufficiently explain how the adopted weighting scheme (i.e., weight = max(C(x))) effectively mitigates the risks associated with noisy or unreliable pseudo labels, particularly in early training iterations where models are untrained or poorly calibrated (e.g. the performance of the C appears to be weak). 3) The authors provide only a sketch of the proof for Theorem 1 in Section 2.3, which does not constitute a rigorous proof. The space limit is not a convincing reason for omitting a proper proof, as the space used for the sketch could have included the key steps of the proof instead. Additionally, the response contains questionable justifications, such as removing the absolute sign in Lemma 4 of [3], introducing the parameter γ without explanation (this is also relevant to the concerns raised in original question 1), and selecting a=0.5 without theoretical or empirical support.

    [1]Mescheder, L., Geiger, A. and Nowozin, S., 2018, July. Which training methods for GANs do actually converge?. In International conference on machine learning (pp. 3481-3490). PMLR.



Review #2

  • Please describe the contribution of the paper

    The main contribution is a new data augmentation method. It combines standard image masking with an adversarial training between labelled and pseudo-labelled data.

    The main idea is that image masking can lead to non-optimal segmentation outputs when trained with pseudo-labels. Therefore, an adversarial loss ensures that the generated segmentation masks are similar when trained on labelled and pseudo-labelled data.

    The authors demonstrate that this adversarial training leads to better segmentation performance when there is only very limited labelled data available.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • strong evaluation: the method has been evaluated on 3 different 2D and 3D open access datasets together with an ablation study of the main constituents of the approach and a sensitivity analysis of main parameters.

    • strong results: the approach is beating state of the art methods by a rather large margin.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • the main contribution is the adversarial masked image modelling. However, the ablation study shows that the improvement using it is very small and the main advantage comes from cross-teaching and image masking which are both well known.

    • the approach has been compared to state of the art methods, but not using any method (as far as I see) that uses masked image modelling. Combining masked image modelling with any of the existing methods or adding specific masked image modelling methods for comparison would make it clearer how much of a delta the proposed adversarial approach adds.

    • the reproducibility is generally ok, but there is very little information about the discriminator and how it is trained other than that it is a 5-layer network. More information would help

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I found looking at Figure 1 to be the easiest way to understand what the approach is doing as it clearly shows what inputs are used and what ground truth masks are utilised at what step. The description in the text in my opinion is rather complicated and sometimes confusing. The terms unlabelled and pseudo-labelled are used interchangeably. It would have been clearer to call the unlabelled data pseudo-labelled consistently. Also the abstract and large parts of the introduction focus a lot on learning the masked image domain which is well known. The summary at the end of the introduction then shows that the contribution of the paper is uniquely in the adversarial training step. The paper would be clearer and easier understandable if the main contributions are more prominently highlighted in the abstract and introduction.

    The proposed network is trained with 30,000 iterations on rather small datasets. The authors mention using standard data augmentation. It would be interesting to see whether there are problems of overfitting and how the methods performs on validation vs test sets.

    It would also be interesting to know more about the adversarial training stability as this is usually a major issue in adversarial methods. If space permits, some elaboration in this matter would help.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper presents a novel method which has been well evaluated and which shows very good results compared to the state of the art on the 3 tested datasets when using only very few labelled samples.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My main questions have been adequately addressed in the rebuttal. The authors stated that they will release source code when accepted which addresses reproducibility concerns. I think this is a well evaluated approach which shows good results and would be interesting to the community beyond the application domain presented,



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel adversarial masked image modeling (AdvMIM) approach that effectively combines three existing techniques: (1) masked image modeling with transformers, (2) cross-training a transformer using CNN-generated labels on unlabeled data, and (3) adversarial training to align the distribution of unlabeled data with that of labeled data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. It successfully combines different approaches that have shown promise in leveraging unlabeled data for segmenation
    2. The paper is very well written with sufficient experimentation to support its claims of performance gains
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    No major weaknesses

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. In equations 2 and 4, shouldn’t the weight be on the combined loss rather than just the CE component?
    2. Consider adding the cross teaching part to figure 1.
    3. It’ll be interesting to see how the domain adaptation on full image vs masked image vs both perform.
    4. Why wasn’t the ISIC dataset show in the ablation study?
    5. For ACDC Dice score where you see a 10% improvement, it’ll be nice to see a more in-depth explanation for why that could be the case.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed approach is novel and has improved results over SOTA

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Keeping my original recommendation




Author Feedback

We thank all reviewers for their valuable feedback. We address the main concerns below:

Reproducibility[R1,R3]: We will release our code upon acceptance.

GAN/Adversarial Training[R1,R2]: GAN training is a well-studied area. Benefiting from existing techniques in GAN training, we use modified GAN loss from original GAN paper to avoid optimization difficulty of saturation; we adopt gradient clipping, which is initially proposed in Wasserstein GAN paper and then widely used to avoid gradient explosion and ensure training stability. We observe stable training curve of our method where the generator loss decreases from 1.5 and converges to 0.45 without instability. We will add key details in Implementation of Sec 3. As shown in Table 1, the adversarial component is effective. Hope our response clarifies the concern.

R1.

  1. Training Setup. Following CTCT[15] and M-CnT[12], we cross-teaches transformer and CNN jointly. But different from CTCT and M-CnT, which directly use predicted pseudo-labels from start, as described in Sec 2.1, we specially design the weighted cross-teaching loss, where the weights alleviate the issue of untrained or poorly calibrated models at the early iterations as they will assign low weights to low quality pseudo-labels. We will further clarify this in Sec 2.1.

  2. Efficiency. We provide the efficiency results on ACDC(3%) on RTX3090 GPU (Batch Size=16) as below:

Method CTCT M-CnT AdvMIM
Train Memory (GB) 8 >8 13
Train Time (h/30000 iter) 2 >2 4
Inference Memory (GB) 2 2 2
Inference Time (s/volume) 0.4 0.4 0.4

Our method increases training memory by 5GB and training time by 2 hours compared to the SoTA CTCT. While M-CnT code is unavailable, its training cost is likely higher due to additional manifold modeling. The overhead is practical on modern hardware and brings a substantial gain of 10.1 in Dice and 8.7 in HD over the current best M-CnT method. All methods share the same inference profile with transformer.

  1. Formal Proof. Due to space limit and MICCAI25 policy disallowing proofs in Supplementary Material, we provide a proof sketch in Sec 2.3 to convey the core idea of full proof. We elaborate here for clarification: we apply Lemma 4 in [3] with a=0.5 on distribution P and Q’ and remove the absolute sign, we obtain e_P <= 1/2 e_P + 1/2 e_Q’ + 1/4 d(P,Q’) + 1/2 \lambda. We then bound both e_P and \lambda by e_P’ + \gamma and \lambda + \gamma, where \lambda* is the optimal error on P,Q’, and \lambda is that on P,Q. Note we omit intermediate steps with triangle inequality to bound e_P and \lambda*. After rearrangement, we derive the final bound. The main challenge is to introduce the noisy ratio to the upper bound. We acknowledge a minor typo in the theorem where \lambda should be on P and Q, not Q’, we will correct this. Hope our response clarifies the concern.

R2:

  1. Main contributions. Original masked image modeling (MIM) reconstructs masked patches, one of our novelty is to adapt MIM to predict full segmentation mask from visible patches; another contribution is our theorem, where we treat all masked images as a masked domain and propose adversarial masked domain adaptation loss to bridge the domain gap. We also extend our technique to CNN and perform extensive experiments to evaluate our method, which outperforms SoTA methods significantly.

  2. Other MIM methods. MIM has mostly been used as a pre-training method. To the best of our knowledge, we are the first to investigate masked image modeling for semi-supervised medical image segmentation and demonstrate significant performance boost.

  3. Writing. We will further proofread and improve the paper.

R3:

  1. Weight. As the weight is pixel-wise, we apply it to CE only, where DICE is calculated over all pixels for each class. Hope this clarifies.

  2. Figure. Thanks for the advice, we will consider.

  3. ISIC dataset. We use ISIC for comparison study but not test it in ablation study as we have tested on the other two datasets. Hope this clarifies.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two reviewers recommended acceptance after their initial concerns were adequately addressed in the rebuttal. However, the third reviewer still raises issues related to GAN training stability, the effectiveness of the proposed weighting scheme in handling noisy labels, and the completeness of the theoretical proof. The meta-reviewer believes these concerns can be resolved in the camera-ready version and strongly encourages the authors to address them accordingly.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have not adequately addressed the concerns raised by Reviewer 1. The absence of comparisons with foundation model–based segmentation approaches (e.g., SAM variants) limits the overall credibility of the results.



back to top