Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

While deep learning has significantly advanced medical image segmentation, most existing methods still struggle with handling complex anatomical regions. Cascaded or deep supervision-based approaches attempt to address this challenge through multi-scale feature learning but fail to establish sufficient inter-scale dependencies, as each scale relies solely on the features of the immediate predecessor. To this end, we propose the AutoRegressive Segmentation framework via next-scale mask prediction, termed AR-Seg, which progressively predicts the next-scale mask by explicitly modeling dependencies across all previous scales within a unified architecture. AR-Seg introduces three innovations: (1) a multi-scale mask autoencoder that quantizes the mask into multi-scale token maps to capture hierarchical anatomical structures, (2) a next-scale autoregressive mechanism that progressively predicts next-scale masks to enable sufficient inter-scale dependencies, and (3) a consensus-aggregation strategy that combines multiple sampled results to generate a more accurate mask, further improving segmentation robustness. Extensive experimental results on two benchmark datasets with different modalities demonstrate that AR-Seg outperforms state-of-the-art methods while explicitly visualizing the intermediate coarse-to-fine segmentation process. Source code is made available at https://github.com/takimailto/AR-Seg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1847_paper.pdf

SharedIt Link: https://rdcu.be/eHwMO

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_3

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/takimailto/AR-Seg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheTao_Autoregressive_MICCAI2025,
        author = { Chen, Tao AND Wang, Chenhui AND Chen, Zhihao AND Shan, Hongming},
        title = { { Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {24 -- 34}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper is the first to apply the next-scale autoregressive model in medical image segmentation. It combines medical conditions, including the target segmentation class and medical embeddings extracted by MedSAM.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. It is the pioneer in using the next-scale autoregressive model for medical image segmentation.
2. The approach of using MedSAM and SVD features as conditions is interesting.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Is SVD necessary, why? The attention in AR calculates across the entire global scale.
2. Equation 4 might be incorrect. The multinomial distribution (\mathcal{M}) should have multiple probabilities. However, as described in Section 2.2, “Then, for each scale k, the next-scale segmentor (S_{\theta})…as input to predict the k-scale token map”, (S_{\theta}(r_1,\ldots,r_{k - 1},c,f)) should only output a one-scale token map, which can be considered as one probability value.
3. The last-scale token map is based on all previous scale token maps, which are adaptively aggregated according to their relationships. Is it necessary to use the consensus-aggregation strategy to generate the final token map? Generally, the earlier scales have limited information, resulting in less accurate token maps, while the last scale has more information and can generate a more accurate token map (as verified in Figure 4, the later scale has more accurate segmentation results). Sampling and averaging from “inaccurate” token maps may degrade the quality of the final token map.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is the first to apply VAR [27] in medical image segmentation. Additionally, it makes certain modifications to adapt the model for this specific task.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The author introduces the autoregressive segmentation framework via next-scale mask prediction for medical image segmentation. AR-Seg consists of a multi-scale mask autoencoder, a next-scale autoregressive mechanism, and a consensus-aggregation strategy. The efficacy of the method is evaluated on LIDC-IDRI and BRATS2021.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The author introduces the autoregressive segmentation framework via next-scale mask prediction for medical image segmentation. AR-Seg naturally adapts the idea of next-scale prediction for segmentation tasks. The consensus aggregation strategy is also shown to effectively improve performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. This work seems highly influenced by the NeurIPS 2024 paper VAR[1]. What’s the major difference between Sec.2.2 compared to the one in VAR?
2. Can the Sec.2.3 Consensus-Aggregation Strategy be regarded as an “ensemble and vote” process? How much will it cost for the inference time?
3. As shown in Table 1 and Table 2, the improvement seems marginal, while autoregressive models require quite a lot of resources for training. Could the author elaborate on the cost of training with SOTAs? [1] Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction. In: NIPS (2024)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper naturally adapts the idea of next-scale prediction to the segmentation task.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The research proposes AR-Seg, a framework consisting of a (a) transformer-based next-scale mask predictor, (b) a multi-scale mask autoencoder for (de-)quantizing masks and (c) a mask consensus-aggregation strategy. The method is used for segmentation of ambiguous and hard to delineate structures in medical images. It obtained good results on 2 public benchmarks when compared against other methods
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed approach if capable of capturing the ambiguity / aleatoric uncertainty in the segmented structure. The LIDC benchmark is also typically considered a segmentation uncertainty quantification benchmark, yet this work does not mention the word uncertainty (ambiguity only mentioned in the consensus aggregation part). It is definitely a large contribution of the approach although not introduced as such.
- The Multi-Scale Mask Autoencoder is interesting for quantizing the masks. An ablation experiment comparing the approach to normal up/downsampling is welcome.
- The Next-Scale Autoregressive Mechanism for mask prediction is a key contribution and enables scale conditional mask generation (core contribution of the work)
- The proposed method obtains strong segmentation performance on multiple benchmarks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- From the formulation, it is not completely clear how the multinomial distribution (M) from which the varying segmentation maps are obtained is modelled. Please provide more details on its implementation
- Importantly, key works on the LIDC benchmark are missing: https://arxiv.org/pdf/2006.06015 https://arxiv.org/pdf/2303.08888 https://arxiv.org/abs/2108.02155 The benchmark has also evolved into a version 1 and 2. Please consider discussing these methods w.r.t. the proposed method (as some of them still outperform AR-Seg)
- The line of research improving segmentation of ambiguous structures is very valuable, it is just a pity that after all this time 2D methods are still being used to solve a problem that is natively 3D. Maybe the authors can comment on extending the approach to 3D
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Please reduce the number of claimed contributions in the introduction– some of the contributions are just a reformulation of the others. Consider open-sourcing the code
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall a strong paper with an interesting method. The method obtains high performance on difficult benchmarks, outperforming some state-of-the-art approaches.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their thorough summaries and valuable feedback. We will address these in the future version and make the source code publicly available.

1) 3D extension (MR, R1) Extending AR-Seg to 3D is feasible by replacing 2D convolutions with 3D variants in the autoencoder and adopting axial attention in the transformer. We will discuss this as future work, noting computational trade-offs.

2) LIDC-IDRI benchmark discussion (MR, R1) We acknowledge this oversight and will refer to and discuss these methods based on normalized flow, categorical diffusion and low-rank multivariate normal distribution.

3) Clarification of equation 4 (MR, R1, R2) The multinomial distribution (\mathcal{M}) in Eq. 4 is parameterized by per-pixel probability vectors generated by the next-scale segmentor (S_{\theta}). Specifically, for each scale k, S_{\theta} outputs a probability tensor of shape h_{k}\times w_{k}\times V, where h_{k} and w_{k} denote the spatial dimensions of the k-th scale mask, and V denotes the size of codebook. Interpreting each pixel’s V-dimensional vector as a categorical distribution, we can sample token indices on a per-pixel basis. We will clarify this in the future version.

4) SVD-based adapter necessity (MR, R2) The SVD-based adapter complements self-attention by explicitly preserving low-rank structural patterns (e.g., anatomical contours) that standard MLP adapters may degrade. Our ablation studies (Table 4) demonstrate its effectiveness, showing a measurable improvement from 0.595 to 0.616 in HM-IoU.

5) Clarification of consensus-aggregation (MR, R2, R3) Contrary to the concern that aggregating across scales might degrade fine-scale detail, our consensus-aggregation only uses the last-scale token map, never uses coarser, lower-resolution predictions. By drawing 16 independent samples at the highest resolution, we avoid any quality trade-off due to early-scale noise. This procedure is effectively an “ensemble and vote” over last-scale predictions: it reduces sampling variance and yields a more stable, accurate final mask.

6) Inference cost (MR, R3) Inference cost of AR-Seg splits into two parts: 1. Autoregressive prediction iterations, 8 for AR-Seg versus 10 for diffusion-based methods [7,29]. 2. Consensus-aggregation over 16 samples (identical to the diffusion-based methods). Thus, the additional runtime of our consensus-aggregation is negligible compared to existing diffusion pipelines, while delivering improved performance.

7) Training cost vs. SOTA (R3) AR-Seg converges in 83400 iterations (300 epochs), whereas diffusion-based methods [7] require 86,500 iterations to converge—so in practice AR-Seg trains with comparable or slightly lower GPU-hour requirements. The modest +1.17% Dice gain on BRATS stems from finer boundary delineation (Fig. 3), crucial for tumor subtypes.

8) Differentiation from VAR (R3) Although VAR employs autoregressive next-scale prediction for image generation, AR-Seg departs in three key ways: 1. AR-Seg takes the learned medical image features as conditions in addition to class labels. 2. We use discrete tokenization for masks rather than images. 3. We further introduce consensus aggregation for clinical robustness.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Paper Summary:  The authors aim to improve segmentation of anatomically complex and ambiguous structures in medical images by explicitly modeling dependencies across spatial scales. To that end, they introduce AR-Seg, which first employs a multi-scale mask autoencoder to quantize ground-truth masks into hierarchical token maps, capturing coarse-to-fine structure. A next-scale autoregressive transformer then predicts each finer-scale mask conditioned on all previous scales and on image embeddings (enhanced via MedSAM and an SVD-based adapter). Finally, a consensus-aggregation strategy samples multiple mask predictions from the learned multinomial distribution and averages them to robustly resolve ambiguous boundaries.

Key Strengths:  Reviewers uniformly praise AR-Seg’s novel application of next-scale autoregressive modeling to segmentation and its clear demonstration of state-of-the-art performance on both the LIDC-IDRI and BRATS 2021 datasets. The multi-scale autoencoder is highlighted for its ability to capture hierarchical anatomical detail, and the consensus-aggregation strategy is seen as an elegant method to quantify and mitigate ambiguity.

Key Weaknesses:  All three reviewers identify areas requiring more detail. The parameterization and sampling process of the multinomial distribution (Eqn. 4) is only briefly described, leaving the implementation unclear. Reviewers request a deeper discussion of related LIDC-IDRI benchmarks and of how AR-Seg builds on or diverges from the recent VAR paper. The necessity of the SVD-based adapter over a standard MLP is questioned, as is the computational overhead and potential quality trade-off introduced by sampling and aggregating predictions from early, coarse scales. Finally, extending the 2D approach to inherently 3D medical data is noted as an important topic for future discussion.

Review Summary:  All three reviewers agree that AR-Seg’s core idea—progressive, scale-conditioned mask prediction—is novel, that its empirical results are strong, and that the paper is well-written and reproducible. They differ mainly on secondary points: Reviewer 2 questions the value of the SVD adapter that Reviewer 1 praises; Reviewers 2 and 3 express concern that consensus aggregation may degrade mask quality or add inference cost, whereas Reviewer 1 views it as essential; and only Reviewer 1 highlights missing LIDC-IDRI literature and the lack of a 3D extension.

Decision:  Despite these minor differences, no reviewer identifies a fundamental flaw, and all three recommend acceptance. Given the unanimous positive recommendation, the robustness of the core contributions, and the fact that the reviewers’ concerns can be addressed in the camera-ready version, the paper merits Early Accept.

back to top

Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Author(s):