Abstract

Recent advances in Masked Autoregressive (MAR) models highlight their ability to preserve fine-grained details through continuous vector representations, making them highly suitable for tasks requiring precise pixel-level delineation. Motivated by these strengths, we introduce MARSeg, a novel segmentation framework tailored for medical images. Our method first pre-trains a MAR model on large-scale CT scans, capturing both global structures and local details without relying on vector quantization. We then propose a Generative Parallel Adaptive Feature Fusion (GPAF) module that effectively unifies spatial and channel-wise attention, thereby combining latent features from the pre-trained MAE encoder and decoder. This approach preserves essential boundary information while enhancing the robustness of organ and tumor segmentation. Experimental results on multiple CT datasets from the Medical Segmentation Decathlon (MSD) demonstrate that MARSeg outperforms existing state-of-the-art methods in terms of Dice Similarity Coefficient (DSC) and Intersection over Union (IoU), confirming its efficacy in handling complex anatomical and pathological variations. The code is available at https://github.com/Ewha-AI/MARSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5400_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Ewha-AI/MARSeg

Link to the Dataset(s)

We utilize the Medical Segmentation Decathlon (MSD) dataset for training and evaluating the MARSeg model. The MSD dataset is a comprehensive collection of medical imaging data, covering multiple organs and modalities with expertly annotated segmentation labels. http://medicaldecathlon.com/

BibTex

@InProceedings{HwaJeo_MARSeg_MICCAI2025,
        author = { Hwang, Jeonghyun and Rhee, Seungyeon and Kim, Minjeong and Viriyasaranon, Thanaporn and Choi, Jang-Hwan},
        title = { { MARSeg: Enhancing Medical Image Segmentation with MAR and Adaptive Feature Fusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a novel segmentation model based on a masked autoregressive framework, aimed at improving performance on small and complex structures such as lesions. A feature fusion module combines information from a masked autoencoder and decoder to preserve both spatial and channel context. The model is validated on four MSD tasks liver tumor, pancreatic tumor, colon cancer, and spleen segmentation and outperforms common segmentation architectures on three of the four tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The use of a continuous vector tokenizer for autoregressive modeling adapts recent advances in an innovative way.
    2. The feature fusion module preserves spatial and channel information in parallel, providing richer context for the final segmentation layer.
    3. Strong experimental results demonstrate the model’s effectiveness across multiple segmentation tasks.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Key components are underexplained. Details on the diffusion refinement module, such as the number of denoising steps and its original training procedure, are missing.
    2. Figure 1 lacks information on the pretraining stage.
    3. The term “generative” in the feature fusion module’s name is not justified or explained.
    4. Experiment protocols lack clarity. It is not stated which datasets were used for pretraining, what parameters were applied, or how comparison models in Table 1 were trained and optimized.
    5. It is unclear whether the models were trained on 2D slices or full 3D volumes, particularly for architectures capable of 3D segmentation such as nnUNet with its multiple configurations.
    6. Some reported performance numbers seem inconsistent for example, SwinUNet surpassing nnUNet by ten points in pancreatic tumor yet trailing by five points in spleen segmentation.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1.Discuss the computational cost of autoregressive models compared to traditional segmentation approaches. 2.A comparison with benchmark MAE models such as the SwinUNETR would help clarify the advantages of the MAR approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novel architecture and strong results are promising, but the manuscript requires detailed descriptions of pretraining, benchmarking protocols, model training settings, and clear explanations of key modules before it can be fully evaluated.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed many of my initial concerns and have promised to include details in the final paper on the remaining concerns raised.



Review #2

  • Please describe the contribution of the paper

    The paper proposes MARSeg, a two-stage framework for medical image segmentation that leverages a Masked Autoregressive (MAR) generative model for pretraining and introduces a Generative Parallel Adaptive Feature Fusion (GPAF) module for segmentation. The MAR model, pretrained on large-scale CT data, captures both global anatomical context and fine-grained local details. In the segmentation stage, features from the pretrained encoder and decoder are fused via the GPAF module, which combines spatial and channel attention in parallel to preserve structure and enhance segmentation accuracy. Experiments on four CT datasets demonstrate that MARSeg consistently outperforms state-of-the-art methods in Dice and IoU.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-organized and clearly written. The proposed method is reasonably novel, presenting a practical and effective approach for incorporating generative pretraining into medical image segmentation workflows.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The proposed method is limited to 2D image inputs, which may restrict its applicability to fully volumetric medical imaging tasks. Additionally, the evaluation relies solely on overlap-based metrics (Dice and IoU) without accompanying statistical significance testing, making it difficult to assess the robustness of the reported improvements.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. The authors are encouraged to include additional evaluation metrics, such as distance-based measures (e.g., HD95 and ASSD). Overlap-based metrics like DSC and IoU often exhibit similar behavior and may not fully capture boundary-level differences.

    2. Statistical testing is needed to determine whether the observed performance improvements are significant. This would strengthen the validity of the comparative analysis, especially when the margin is small.

    3. The authors should clarify whether the baseline or comparative methods also incorporate pretraining. Are they implemented in 2D or 3D?

    4. The authors may discuss the potential extension of their method to 3D volumetric data. It would be valuable to assess whether the proposed approach, particularly in terms of computational efficiency, can feasibly support such an extension. This aspect is also important when comparing the method against other segmentation networks that operate in 3D.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the method is somewhat novel, the current evaluation is limited in scope. Additionally, the restriction to 2D segmentation should be discussed more thoroughly, particularly in terms of feasibility and comparison with 3D network approaches.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my concerns regarding the clarity, evaluation metrics, and the level of innovation. They have acknowledged the importance of a 3D entension as future work.



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel segmentation framework, Masked Autoregressive Segmentation (MARSeg). This algorithm considers both fine-grain details through masked autoregressive modelling whilst considering global tissue information in the segmentation head - which is based on a feature pyramid network. Feature fusion is addressed through a generative parallel adaptive module that applies spatial and channel attention, which results in a balanced and context-adapted architecture for organ and tumour segmentation tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well structured and provides enough information for reproducibility and clear interpretation of the results. Authors leveraged the advantages of masked autoregressive encoders and generative modelling to pretrain the segmentation task, which makes their model robust against anatomical abnormalities as it considers both global and local information through generative parallel adaptive fusion. The fine-tuning of the segmentation model using feature pyramid networks is an interesting approach for keeping semantic context and information.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    No major weaknesses have been observed. Nevertheless, a few comments are proposed:

    • The results and analysis section could benefit from further insights into medical implications and what gaps is the work addressing
    • The figures and tables in-between conclusions can distract the reader from the final message
    • The efficiency of the presented framework against state-of-the-art methods could also provide transparency to the paper
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The conclusions are split and show results from Table 2/3 and Figure 2 in-between. This can make it hard to follow the implications of the study for the reader.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (6) Strong Accept — must be accepted due to excellence

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work represents a novel and interesting approach for medical segmentation tasks by addressing global and local detail fusion through a generative parallel adaptive module. The combination of masked autoregressive modelling with simple segmentation exhibited superior performance to current state-of-the-art methods and can be of high interest for the MICCAI community, specifically for future research addressing tumour and organ segmentation in clinical settings.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors addressed all comments from reviewers.




Author Feedback

We sincerely thank the reviewers for their insightful feedback of our work. We appreciate the opportunity to clarify and further improve our manuscript in response to the comments. (R1W1, W2, W4) In the MAR model, diffusion refinement is applied for computing the diffusion loss, with a small MLP-based denoiser operating under a cosine schedule and 1000 denoising steps. For pretraining, MAR_Base was trained on MSD Pancreas and Liver datasets(excluding test split; Sec. 2.1) using AdamW, batch size 64, learning rate 1e-4. An illustration of Stage 1 will be added in the final version for greater clarity. (R1W4 W5 W6, R2C3) All models, including MARSeg and the benchmarks, are based on 2D segmentation and were trained under the same experimental setup with sufficient epochs and appropriately tuned learning rates to ensure optimal performance. Differences in dataset characteristics and distributions may influence model performance differently, but all models were individually optimized to ensure a fair comparison. We observed that MARSeg achieved consistently strong average performance across all organs in terms of Dice and IoU. (R1C1, R3W3) We will update the computational cost in Table 1 for clarity. nnU-Net requires 59.93G FLOPs, 46.33M parameters, and 7.5 ms inference; MT-UNet, 40.45G FLOPs, 79.07M, and 117.11 ms. MARSeg(125.63G FLOPs, 159.59M) achieves much faster inference(36.80 ms) despite higher complexity, offering better accuracy and efficient runtime compared to SOTA models. (R1W3) We designed the GPAF module to adaptively utilize features from the generative MAR model for the segmentation task. The term “Generative” reflects the origin and nature of the features being fused. (R2C3) Comparative methods were trained without pre-training. In MARSeg, the encoder was pre-trained in generative model(MAR model), and this encoder is frozen during the segmentation stage. In this study, we aim to utilize the features from the generative model for the segmentation task, not transfer learning aimed at improving performance. In other words, pre-training is not an optional technique but an essential component for implementing the core idea of the framework. This approach is an attempt to present a new direction for the segmentation by leveraging generative representations, and it clearly possesses structural differences from existing methods. (R1C2) We recognize that comparison with MAE models can clarify the advantage of our approach. We will address this in the conclusion section and expand this analysis of a comprehensive MAE comparison in future work. (R2W1, C4) In this study, we focused on 2D segmentation and demonstrated its feasibility. We also consider extension to 3D to be important and will pursue it as future work. Indeed, MARSeg is readily extensible to full 3D volumetric segmentation by replacing the initial patch embedding with a 3D convolution and adding thin 3D adapters at each stage. The computational overhead can be reduced by simplifying KL-16, using thin adapters, and trimming MAE layers. (R2W2, C1, C2) We have added HD95 and ASSD. For example, on the spleen dataset, HD95 scores are 2.40, 1.40, and 1.08, and ASSD scores are 0.62, 0.42, and 0.28 for nnUNet, MTUNet, and our model, respectively. Our model shows superior or comparable performance across standard metrics—Dice, IoU, HD95, and ASSD—demonstrating robustness and accuracy. As these are widely used in medical image segmentation like our benchmark models, we believe the results sufficiently support our claims. A discussion on statistical considerations will be added in the final conclusion. (R3W1) Thank you for the suggestion. We will include a discussion on the clinical significance of our work, specifically its potential for early diagnosis by effectively capturing small tumors and for reducing clinical workload, in the final version of the paper. (R3W2, C1) As suggested, we will adjust the placement of Table 2, 3, and Figure 2.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal does a good job in responding to the reviewers’ concerns. All reviewers now recommend acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top