List of Papers Browse by Subject Areas Author List
Abstract
Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M3HL) to address the aforementioned challenges, which consists of two key components: 1) M3: An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature representations. Our method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at https://github.com/PHPJava666/M3HL.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1551_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/PHPJava666/M3HL
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LiuYaj_M3HL_MICCAI2025,
author = { Liu, Yajun and Zhang, Zenghui and Yue, Jiang and Guo, Weiwei and Li, Dongying},
title = { { M3HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15961},
month = {September},
page = {313 -- 322}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper introduces M3HL, a semi-supervised learning framework that leverages both data augmentation and hierarchical consistency.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.The paper is clearly written and well-structured. 2.The authors provide open-source code, which enhances the reproducibility and accessibility of their method. 3.Experiment demonstrate that results outperforms previous methods on standard benchmarks.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.In the current description of M3, it is unclear how the method ensures that augmented samples effectively cover the target area. The process of generating pseudo labels and the formulation of the associated loss function, particularly in scenarios where the target area may not be sampled, require further clarification and justification. 2.To demonstrate the effectiveness and uniqueness of M3, the method should be evaluated with other data augmentation methods in place of M3.
If you could provide more detailed information, I would consider improving the score.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Detailed information of the proposed M3.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
The authors seem to have misunderstood my question. I meant to ask whether, to validate the effectiveness of their proposed Mutual Mask Mix method, comparisons should be conducted by replacing other data augmentation approaches while maintaining the same baseline model.
Review #2
- Please describe the contribution of the paper
1) Dynamic Mutual Mask Mixing (M3): A strategy that overcomes the rigid data mixing limitations of existing methods by introducing an adjustable random mask generator with tunable mask block size and ratio. 2) Hierarchical High-Low Level Feature Consistency (HL): A framework designed to enforce consistency on different feature levels, which enhances segmentation performance with pseudo-labels.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The overall framework design is clear and logical, first addressing the issue of scarce labeled data through the M3 strategy for generating pseudo-labels and then utilizing the HL framework to ensure the consistency of features across different levels. The proposed method shows significant performance improvements in complex medical image segmentation tasks, as demonstrated through comparative and ablation experiments.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1) Unsubstantiated Claims in the Introduction: The introduction mentions that the model “compelling the model to develop a more comprehensive understanding of anatomical structures” and “filters out outlier noise in pseudo-labels through hierarchical feature calibration.” However, these claims are not sufficiently supported by experimental results in the main text. It is crucial to explain how the experimental findings align with these stated objectives.
2) Lack of Discussion on Mask Patch Size and Ratio: In the experimental section, the paper shows that the performance is best when the mask patch size is 64 and the mask ratio is 50%. However, there is no in-depth discussion on why these particular values were chosen or why they led to optimal performance. This aspect should be thoroughly discussed, explaining the underlying reasons behind the choice of these hyperparameters.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Minor Issues:
1) Concerns Regarding Learning Rate:
The training process uses a very small learning rate (10e-9) for 30k iterations with a batch size of 24. Given the small value of this learning rate, there is a concern about the model’s convergence and training effectiveness. It would be helpful to provide additional justification for why such a small learning rate was chosen and how it impacts the training dynamics and final performance.
2) Potential for Improvement in HL Framework: The HL framework currently aligns only the highest-level and lowest-level features, shown by the figure. The question arises as to whether this strategy is optimal, or whether aligning all feature layers could lead to better results. Further investigation into the impact of aligning features at multiple levels of the network would be valuable.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Theoretical contribution
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
same decision as last round
Review #3
- Please describe the contribution of the paper
This paper introduces M3HL, a semi-supervised segmentation framework for medical imaging that integrates Mutual Mask Mixing and high-low feature consistency constraints. The architecture leverages two learning branches to facilitate information exchange and improve mask quality, while enforcing intra- and inter-consistency losses to promote structural stability. Experiments on three public datasets demonstrate competitive results, indicating the framework’s potential for label-efficient segmentation tasks in clinical scenarios.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Innovative Architecture: The proposed dual-branch framework with Mutual Mask Mixing (MMM) introduces a novel and creative approach to semi-supervised medical image segmentation. High-Low Feature Consistency Constraint: The introduction of intra- and inter-consistency losses to align low- and high-level feature maps enhances the structural coherence of segmentation results and improves learning stability. Strong Practical Relevance: The paper addresses the important and realistic scenario of limited annotation availability, making it highly relevant to clinical applications where labeled data are scarce. Evaluation on Multiple Datasets: The method is tested on three public datasets (ACDC, Prostate, and Fundus), demonstrating the authors’ efforts to ensure generalizability across anatomical and imaging variations. Competitive Quantitative Results: The framework outperforms several strong baselines (e.g., UA-MT, CPS, DTC), confirming its efficacy and potential for broader adoption in semi-supervised settings.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
No Statistical Validation of Results: The absence of statistical significance testing (e.g., p-values or confidence intervals) makes it impossible to determine whether observed improvements are meaningful or within the margin of error—especially when differences are small across datasets. Lack of Ablation Studies: The paper proposes several key components (MMM module, intra- and inter-consistency losses), but fails to conduct any component-wise or ablation experiments. This makes it unclear which parts of the architecture are most critical to performance. Underdetailed Methodology: Several methodological details are insufficiently explained. The structure and functioning of the Mutual Mask Mix strategy, the definition and enforcement of consistency losses, and the pseudo-labeling process are either vague or entirely missing. Reproducibility Concerns: No code, training scripts, or reproducibility checklist are provided. Combined with vague methodological descriptions, this severely limits reproducibility and makes independent validation by the community difficult. Limited Realism in Semi-Supervised Setting: The model is only evaluated in clean, partially labeled data settings. There is no experimentation under conditions with noisy, weak, or clinically inconsistent labels, which are far more reflective of real-world clinical scenarios. Dataset Scope is Too Narrow: The paper uses only three datasets with limited modality diversity (no CT or ultrasound). The anatomical and imaging variance is insufficient to establish broad generalizability across clinical applications. Missing Comparisons with Recent Methods: Competitive analysis lacks strong recent baselines published in 2023–2024. Omission of diffusion-based and prompt-guided semi-supervised approaches leaves a gap in comparative positioning. Lack of Qualitative Error Analysis: The paper does not analyze segmentation failures, outlier predictions, or examples where the model underperforms. Such insights are critical to understand model limitations and failure modes. No Uncertainty Estimation or Calibration Metrics: In a clinical context, uncertainty quantification is essential. The paper does not report entropy maps, calibration curves, or any form of predictive confidence assessment. No Exploration of Training Stability: There is no analysis of training dynamics (e.g., convergence rate, robustness to random seeds). Given the multi-branch architecture and multiple loss terms, it is unclear how stable the model is across different training runs. Limited Theoretical Justification for Consistency Design: The dual consistency losses are motivated intuitively but lack rigorous theoretical or empirical justification. No analysis is provided on why both intra- and inter-consistency constraints are required or synergistic. Weak Clinical Interpretation and Deployment Considerations: The paper focuses heavily on numerical benchmarks but does not translate the findings into clinical context. It lacks discussion on how such models could be integrated into real-world workflows, and what challenges or benefits that might entail.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Dear Authors, Thank you for your submission. Your work proposes an interesting dual-branch semi-supervised segmentation framework incorporating Mutual Mask Mixing and consistency losses. While the approach has potential, I would like to offer several critical questions and comments aimed at deepening the clarity and rigor of the paper: Can you provide a detailed schematic or pseudo-code to clarify the implementation of the Mutual Mask Mixing mechanism? The current description is too abstract to fully understand how masks are generated, exchanged, and reintegrated. How exactly are pseudo-labels generated and updated during training? Are they filtered based on confidence thresholds or consistency across views? Are intra- and inter-branch consistency losses applied symmetrically? How are they weighted during training? Is there a dynamic ramp-up schedule? What backbone is used in the dual-branch network? Are both branches identical? Is there any form of parameter sharing? What augmentations are used for labeled vs. unlabeled images? How do you ensure consistency across augmentations when computing the consistency losses? Why was an ablation study not performed? Can you quantify the contribution of each module (MMM, intra-consistency, inter-consistency) independently? Have you tested whether the observed improvements are statistically significant (e.g., p-values, CI)? This would strengthen the validity of your claims. How robust is the model across different random seeds? Any standard deviation or confidence interval across runs? Can you show examples of failure cases or provide qualitative error maps? Understanding where the model fails is as important as where it succeeds. Would this model generalize well to CT or ultrasound images? Have you tested it beyond the three MRI/fundus datasets? What is the theoretical basis for combining intra- and inter-branch consistency? Can their synergy be formally justified or visualized? Did you compare your approach with other popular semi-supervised strategies such as entropy minimization, contrastive regularization, or consistency training with perturbations? A dedicated limitations section is missing. Please explicitly discuss potential pitfalls, such as dependence on mask quality, sensitivity to pseudo-label errors, or poor generalization to out-of-distribution data. Is your model capable of estimating predictive uncertainty? This is important for clinical deployment and model confidence calibration. How would you envision integrating such a model into a clinical segmentation pipeline? What would be the added value compared to current semi-supervised tools? These comments aim to help you strengthen the clarity, rigor, and potential impact of your work. The proposed method is promising but would benefit significantly from more comprehensive validation, interpretability, and clinical insight.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I recommend a Weak Accept (Score: 4) for this submission. The paper presents an original semi-supervised segmentation framework using a dual-branch architecture with Mutual Mask Mixing (MMM) and high-low level consistency constraints. These components are novel and conceptually interesting, particularly the attempt to jointly enforce intra- and inter-branch consistency in unlabeled data. The method demonstrates strong empirical results across three public datasets and shows clear promise for real-world applications in data-scarce environments. The inclusion of an anonymized code link is commendable and supports reproducibility. However, several important limitations prevent a higher score at this stage. The lack of ablation studies, missing statistical validation, and vague descriptions of critical modules (e.g., pseudo-labeling, loss balancing, and training dynamics) weaken the methodological clarity and experimental rigor. Moreover, the paper does not sufficiently explore model limitations, generalization across modalities, or clinical deployment feasibility. In summary, this paper introduces valuable ideas with potential impact, but it requires a strong rebuttal that includes clearer methodological descriptions, additional analysis (e.g., ablations, uncertainty, limitations), and stronger justification for design choices. If addressed, this work could merit full acceptance.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have provided a thoughtful and well-organized rebuttal that addresses the majority of the key concerns raised by all reviewers. They clarified the architecture and mechanism of the Mutual Mask Mix (M3) module, explained the design and role of intra- and inter-feature consistency losses, and provided sufficient detail about the model’s training setup, pseudo-labeling strategy, and evaluation protocols. Notably, the authors: Offered a step-by-step explanation of how the M3 module operates and how it differs from prior augmentation methods such as BCP. Justified the choice of hyperparameters (patch size, mask ratio, loss weights), and corrected a critical typo in the learning rate. Clarified the update mechanism for the teacher model, backbone architectures used for different datasets, and consistency loss implementation. Referenced an ablation study already present in the paper (Table 2), which quantifies the contribution of M3 and HL components, partially addressing concerns about component-level analysis. Responded to questions about training stability, error visualization, and generalization across modalities, acknowledging limitations and pointing to future extensions. Compared the proposed method against competitive and recent baselines (e.g., BCP, OMF, ABD, DiffRect, VCLIPSeg) and demonstrated superior performance. While certain theoretical aspects (e.g., synergy between intra/inter consistency) and broader generalization (e.g., across CT/US) remain to be validated more thoroughly in future work, the authors have shown a clear understanding of the raised concerns and provided practical, relevant answers. The proposed M3HL framework remains novel, relevant, and empirically strong, with demonstrated improvements across three datasets. Given these clarifications, the paper now presents itself as a solid contribution to the MICCAI community, particularly in the area of semi-supervised learning for medical image segmentation.
Author Feedback
We appreciate the reviewers’ feedback. Following rebuttal guidelines, we’re unable to add new experiments in the responses. *Unsubstantiated Claims(R1) Claim 1: M3 generates unique mixed images each time, disrupting fixed anatomical patterns in medical images. Compared to BCP’s patch-copy-paste fusion, this dynamic recombination enhances diversity, promoting robust feature learning. Claim 2: The low-level consistency uses L1-distance to align local edge features, reducing localized noise in pseudo-labels, while the high-level consistency uses cosine similarity to maintain semantic coherence and mitigate misaligned pseudo-labels. *Mask Patch Size and Ratio(R1) A 64×64 patch size (1/16 of 256×256) preserves sufficient anatomical context, while a 50% mask ratio balances labeled and unlabeled data fusion. Smaller patches (32×32) caused fragmented features, and larger patches (128×128) reduced sample diversity (see Fig 4). *Minor Issues(R1) We apologize for the typo; the correct learning rate was 10e-3 per standard SGD settings in SSMIS and will be corrected in the final version. Aligning only the highest and lowest-level features in the HL framework captures key local and global features, with minimal computational cost compared to aligning all layers. *Methodological Clarity(R1&R2&R5) M3 consists of 3 steps: (1) Mask Generation: A binary mask with 64×64 patches covers 50% of the 256×256 grid, independently generated for each sample. (2) Exchange: Mask M selects regions from X_u^a (where M=1) and X_l^a (where M=0), ensuring mutually exclusive regions. (3) Reintegration: Mixed images are formed via element-wise operations, preserving the channel dimension, with labels mixed similarly. This process effectively partitions and recombines two images, segmenting and reassembling target regions to ensure comprehensive sampling without leaving any region unsampled. M3 extends BCP by mixing labeled and unlabeled data, allowing the teacher to generate robust pseudo-labels without confidence filtering. The teacher’s parameters are updated via EMA from the student using the same architecture: 3D V-Net for LA and 2D U-Net for ACDC. The consistency loss weight λ=0.5 and the equal summation of L_{high} and L_{low} into L_{HL} were empirically set, prioritizing M3 and HL consistency impacts over detailed λ tuning. Both labeled and unlabeled data undergo standard augmentations (rotation, flipping), consistent with baselines. L_{low} and L_{high} leverage multi-view learning to align local and global features, and their combined effect in L_{HL} is evident in Tab 2. Ablation studies in Tab 2 confirm the contributions of M3(L_{mix}) and HL consistency(L_{HL}). *Comparative Evaluation(R2&R5) M3HL was compared with top 2023–2024 methods (BCP, OMF, ABD) in Tab 1, each using distinct augmentation strategies: BCP copies patches, OMF swaps foregrounds, and ABD transfers high-confidence regions. M3HL consistently outperforms these methods and surpasses diffusion-based DiffRect and prompt-guided VCLIPSeg(MICCAI’24). *Training Stability, Error Analysis & Generalization(R2) For fair comparison, a fixed seed (1447) was used, consistent with baseline methods, without additional testing across different seeds. Fig 3 highlights minor segmentation gaps (red regions) where M3HL, despite outperforming baselines, shows slight incompleteness compared to GT, indicating areas for refinement. Nonetheless, M3HL demonstrates SOTA performance on 2D ACDC and 3D LA datasets, suggesting strong potential for generalization to other modalities like CT and ultrasound. *Uncertainty Estimation & Clinical Relevance(R2) Explicit uncertainty estimation is omitted, as Dice, Jaccard, HD95, and ASD comprehensively assess accuracy, overlap, boundary precision, and surface alignment. Clinically, M3HL can serve as a pre-processing tool, generating initial segmentations from limited labeled data for clinician refinement, outperforming existing SSMIS methods across both 2D and 3D images.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
I agree with the first and second reviewers that this paper is acceptable; however, I would like to point out that the issue raised by the last reviewer also should be fixed in the next version.