Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The segmentation of non-salient objects in medical images plays a crucial role in the early detection and diagnosis of diseases. However, due to the low contrast and unbalanced distribution of the non-salient objects, their feature extraction still suffers from dimensional collapse. To address the inherent feature representation challenges of non-salient objects, we propose a pre-trained Multi-Granularity Masked AutoEncoder (MG-MAE) framework with diversified feature learning capabilities. In the global level, masked image reconstruction captures holistic structural and contextual features. Subsequently, in the local level, patches are extracted from the global visible patches, and the Histogram of Oriented Gradient (HOG) features of these patches are then reconstructed to enhance the texture details. Based on local perception, the framework integrates Nuclear Norm Maximization (NNM) constraint to foster diversity of the local representations in the feature encoding process. In the HOG reconstruction process, the framework also adopts a Dynamic Weight Adjustment (DWA) strategy, assigning greater reconstruction weights to challenging image patches, thereby solving the problem of representation bias towards salient objects. We evaluate our method on a private dataset, CCTA139, and two public datasets, BTCV and LiTS, respectively. Our method achieves DSC of 80.71%, 82.60%, and 71.77%, respectively, surpassing the performance of current state-of-the-art methods. The code is available at https://github.com/zhangbbin/mgmae.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0889_paper.pdf

SharedIt Link: https://rdcu.be/eHwNp

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_39

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zhangbbin/mgmae

Link to the Dataset(s)

CCTA139: private BTCV: https://www.synapse.org/#!Synapse:syn3193805/files/ LiTS: https://competitions.codalab.org/competitions/17094#participate

BibTex

@InProceedings{ZhaBin_NonSalient_MICCAI2025,
        author = { Zhang, Bin AND Ruan, Dongsheng AND Qi, Ronghui AND Xu, Chenchu AND Zhang, Yanping AND Yu, Chengjin AND Xu, Lei AND Wang, Rui},
        title = { { Non-Salient Object Segmentation in Medical Images via Pre-trained Multi-Granularity Masked Autoencoders } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {409 -- 419}
}

Reviews

Review #1

Please describe the contribution of the paper

N/A.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

N/A.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
This paper explores MAE for improving medical image segmentation, especially for non-salient objects. Specifically, HOG features are combined, and DWA is used to enhance the learning of challenging patches.

Weakness:
1. First of all, it is not reasonable to use MAE to enhance the segmentation of non-salient objects, especially for the tiny tumors and lesions. When you mask out tumors, without annotations for guidance, MAE cannot reconstruct them, while mainly reconstructing the backgrounds and organs that have larger proportions of regions. The authors did not show any visualization results in reconstructing tumors and lesions (actually, no reconstruction results are provided). The motivation is originally wrong.
2. HOG features for MAE is directly borrowed from [11]. The reason/motivation why using HOG for medical images is not clear and makes no sense.
3. Lack of comparisons with state-of-the-art pre-training methods. The authors only compare their method with MAE-based methods.
4. In addition, how can you compare FocusMAE [25], a method for ultrasound images on CT datasets? The differences between these two modalities are large. Then, how to use SG-MAE [26] on CT datasets is not introduced. I am sorry that I highly doubt the experimental results in your paper.
5. For non-salient objects in medical images, I believe most researchers are interested in the tumor/lesion datasets. However, the authors only evaluate their method on liver tumors. Overall, the motivation and method are not technically sound. The experiments are far from comprehensive, and the results are questionable. I am so sorry that I think it would be better if the authors could rethink this work and prepare for the next submission.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I am confident with my comments. The motivation and method are not technically sound.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
The authors did not address my concerns in the rebuttal.
1. Reconstruction results visualization is important in MAE-based methods. The authors did not present them in the paper, and skipped my question in the rebuttal.
2. Quote another work (GLMAE) to prove the effectiveness of your own method is not very convincing. GLMAE is also mainly focused on organ segmentation. In my view, GLMAE did not solve this problem of lesions/tumors and is also far from the state-of-the-art.
3. HOG features usage is borrowed from previous works, while the authors skip answering my concern. No results or visualization can prove that HOG can solve the non-saliency problem, especially for lesions and tumors.
4. The comparisons in the current version are far from enough. Comparing only MAE methods is not reasonable.

Review #2

Please describe the contribution of the paper

To make full use of the local features in non-salient objects segmentation, this manuscript proposes a kind of multi-granularity MAE. Based on the vanilla MAE framework, MG-MAE adds a branch to reconstruct the HOG features with the masked HOG of the masked local patch. And NNM is introduced to promote diverse feature representation, and DWA is introduced to adjust the learning weights. Experiments on both private and public datasets for non-salient objects are conducted to evaluate the performance of MG-MAE. Related methods, including supervised and unsupervised methods, are included in the comparison. Results show that MG-MAE achieved the best on DSC on all datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

MG-MAE adds a new branch on MAE to reconstruct HOG features of local patches to make the encoder capture more local information, thereby enhancing the feature capture of non-salient objects.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Some details are not provided as listed in the comments below.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Some related works are proposed for 2D tasks. How did you reimplement them for comparison?
- In Table 2, highlighting the second best in another format would be better.
- Can you explain more about the function of the log term in Equation 6, although a reference is given?
- The encoder is shared by the global and local branches, so what are their patch sizes, respectively?
- Enforcing the encoder to capture HOG features seems to bring some benefits to non-salient objects. Do you think it will bring the same improvement to salient object segmentation?
- HOG is a handcrafted feature that contains the gradient information of original image. Do you think other types of features can bring the same effect? What type of features do you think it will be?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The manuscript introduces a local branch to reconstruct the HOG features of the patches to make the encoder catch more local details for non-salient objects, and the results have some improvements over some other methods.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I think it can be accepted based on the reported experimental results.

Review #3

Please describe the contribution of the paper

This paper proposes a pre-training framework with Multi-Granularity Masked AutoEncoder. The network simultaneously reconstructs original images in global level and Histogram of Oriented Gradient features in local level. Nuclear Norm Maximization is adopted to foster diversity of local representations. The authors also present a Dynamic Weight Adjustment strategy to assign greater weights on challenging patches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well written and organized.
2. The authors promote some novel and effective mechanism to refine the segmentation results, including Nuclear Norm Maximization, Dynamic Weight Adjustment strategy, and Histogram of Oriented Gradient features.
3. Comprehensive ablation studies demonstrated the effectiveness of this framework.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The authors are supposed to compare more baseline methods in both supervised learning and self-supervised learning. For example, UNETR may not be the SOTA method in medical segmentation. Also, other pretraining methods like contrastive learning should be considered in comparison.
2. It would be better to compare the computation costs of MG-MAE with original MAE.
3. The code is expected to be public.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors propose a novel pre-training framework for medical image segmentation. Comprehensive experiments have been conducted to support the effectiveness of this network.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have answered most of questions reviewers concerned about. The paper should be accepted due to its good quality.

Author Feedback

We sincerely thank the reviewers for their valuable feedback. R1 appreciates our work “enhancing the feature capture of non-salient objects”. R2 points out key issues in the motivation and comparisons studies. R3 acknowledges our work “promote some novel and effective mechanism” and “comprehensive ablation studies”. Main concerns include HOG, Comparisons, Implementation details and some misunderstanding of our Motivation. -Motivation[R2]: 1) “MAE is unreasonable for non-salient objects”; 2) “I believe most researchers are interested in tumor/lesion datasets” A: 1) We clarify that our motivation lies in incorporating a local branch with NNM and DWA to improve MAE’s perception of non-salient objects when tiny lesions are partially masked, thereby addressing the MAE’s oversight of local details caused by dimension collapse. The experimental results in Table I validate MAE’s effectiveness in non-salient objects. Besides, the compared GL-MAE is also employed in non-salient objects(tiny organs), with its improved version(published in TMI, May 14, 2025) further revealing MAE-based models’ efficacy on small tumors/lesions. 2) The tumor/lesion datasets are undoubtedly important. Nevertheless, we chose CCTA139(coronary artery), BTCV(abdominal organ), LiTS(liver tumor) due to their diverse characteristics, which better showcase our model’s generalization. Additionally, the coronary artery and abdominal organ studies continue to be crucial to the MICCAI Society. -HOG[R1][R2]; 1)”motivation of HOG”, “role of HOG in salient objects”, and 2) “other feature type”; A: 1) We’ve added the motivation of HOG: Non-salient objects exhibit ambiguous boundaries due to indistinct visual features, thus we adopt HOG for its edge-aware local shape modeling capability (validated in Fig. 3(f)). In fact, we find that HOG also improves segmentation of salient objects (e.g., liver). 2) We believe that MHOG, EHD, SIFT, Gabor and other descriptors may further enhance performance. Due to the page limit, we only select HOG for evaluation, we will add more feature type in the future journal version. -Comparisons:1) “lack of SOTA pre-training methods, such as CL methods” [R2, R3] and “UNETR is not a SOTA method”[R3]; 2) “computation costs”[R3]; 3)”highlighting the second best in table 2”[R1] A: 1) Our method is based on the MAE, and most MAE-based works (e.g., GL-MAE) are generally reported outperform CL methods in medical image datasets. Thus, we only chose MAE-based works for comparison. Thank R2 and R3 for valuable suggestion, we will compare our method with more SOTA pre-training methods in the future journal version. The network architecture of UNETR is inherently adopted by most 3D MAE works, thus it is also selected as a compared method. 2) We have added discussion on computation costs: our method increases computational cost after adding a local branch, thank you for raising this crucial issue. 3) The second-best results will be underlined suggested by R1. -Implementation details[R1,R2,R3]:1) “Lack of details and available code” and 2) “related works, such as FocusMAE and SG-MAE”. 3)”the log term in Eq. 6”. A:1) We’ve added more implementation details, such as “patch size in global branch is 16×16×16; patch size in local branch is 8×8×8.” Our code has been uploaded to GitHub and will be released upon paper acceptance. 2) We’ve addressed MAE-based 2D variants (LoMaR, R-MAE, SG-MAE) and FocusMAE. We implement 2D variants by adopting their core strategies to one 3D MAE. For FocusMAE, we use a pre-trained region proposal network to identify high-information regions and an auxiliary network to generate masking probabilities. For SG-MAE, we computed attention map by dot product attention and masked the top-k regions with the highest aggregated importance scores. 3) The log term in Eq. 6 regulates weight adjustment. When Si→0, it prevents over-weighting on salient objects. When Si→1, it assigns moderate weights to stabilize focus on non-salient objects.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This is a borderline submission. I am inclined to recommend rejection due to the limited clinical applicability of the proposed approach. Additionally, the lack of comparisons with foundation model–based segmentation methods (e.g., SAM) undermines the credibility and completeness of the evaluation.

back to top

Non-Salient Object Segmentation in Medical Images via Pre-trained Multi-Granularity Masked Autoencoders

Author(s):