Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The effectiveness of Vision Transformer (ViT)-based feature encoding network has been demonstrated in medical image analysis tasks. However, the complexity growing quadratically with the token number limits its application in dense prediction. To accelerate ViT, we propose an efficient and accurate token halting and reconstruction encoder framework, termed HRViT, designed for precise medical image semantic segmentation. Our approach is motivated by the observation that background and internal tokens can be easily identified and halted in early layers, while complex and ambiguous edge regions require deeper computational processing for accurate segmentation. HRViT leverages this insight by incorporating an edge-aware token halting module, which dynamically identifies edge patches and halts non-edge tokens. The preserved edge tokens are propagated to deeper layers and further refined through edge reinforcement. After encoding, all tokens are restored to their original positions, and auxiliary supervision is also introduced to strengthen the encoder’s representation power. We evaluate the segmentation performance of our method using two public medical image datasets and the experimental results show that our method achieves promising performance compared with the state-of-the-art approaches. Our code is released at https://github.com/guoyh6/hrvit.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3977_paper.pdf

SharedIt Link: https://rdcu.be/eHaVA

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04965-0_18

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/guoyh6/hrvit

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GuoYuh_EdgeAware_MICCAI2025,
        author = { Guo, Yuhao AND Song, Bo AND Fan, Heng AND Cheng, Erkang},
        title = { { Edge-Aware Token Halting for Efficient and Accurate Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {185 -- 195}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a ViT-specific token halting mechanism to increase the computational efficiency, based on the intuition that “the tokens corresponding to the background or interior of objects can be easily recognized and halted in the early encoder layers”.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The implementation of the proposed insight through a supervised edge-aware token selection mechanism does appear to yield meaningful efficiency improvements without sacrificing accuracy.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Although the scalars in Table 1 are “improving” when using the proposed HRViT, the significance is not discussed (see Christodoulou et al., “Confidence Intervals Uncovered, MICCAI 2024).

Related work missing: DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, Rao et al., NiPS 2021

I fail to understand what is the “binary edge label GT_edge”, as the sentence is written badly, which is unfortunate, as this is a core part of the method.

Experiments-wise, on the BTCV dataset the nnUNet MedNeXT baseline achieves 0.85 (nnU-Net Revisited, Isensee et al., MICCAI 2024), which is much better than the reported results.

Assertions. The very core insight (namely that “background and internal tokens can be easily identified and halted in early layers”) is highly debatable; if true, then we can “easily” solve image segmentation.

Finally, I guess the title should read “Halting”, not “Haulting” ?!
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The weaknesses outweigh (missing related work, clarity, assumptions) the strengths (empirical improvement) for me.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

ViT architectures achieve improved computational efficiency, and the authors provided convincing responses to the raised concerns.

Review #2

Please describe the contribution of the paper

In this paper, the authors propose HRViT, a novel and efficient Vision Transformer-based encoder framework tailored for medical image segmentation. The authors introduce an edge-aware token halting mechanism that reduces computational redundancy by stopping background and interior tokens early, while preserving and refining edge tokens that are critical for accurate segmentation. An edge reinforcement module and auxiliary loss are further incorporated to enhance representation learning. The proposed method demonstrates competitive segmentation performance with significantly improved inference speed on two public benchmarks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Based on the fact that edge tokens require deeper computation, the proposed model designs a series of strategies to recognize edge tokens and halt non-edge tokens. Unlike the existing adaptive token halting with different layers, the proposed strategies are interesting and useful for medical image segmentation. It improves the performance while reducing the computational consumption.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

a) Motivation still needs to be clarified. Although edge tokens demand a deeper computational process, they should be identified equally with non-edge tokens in early layers. Consider providing a visualization of the process or cite relevant literature to better support the underlying motivation. b) While the performance is promising, the reduction in computational consumption is not much improved compared to the baseline. c) The edge token retraining strategy (i.e., non-edge token halting policy) visualized in Figure 2 does not highlight the model’s advantages, especially on the BTCV dataset. Additionally, ablation results without this strategy should be included in Figure 2 to enable a more direct comparison.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

a) It is recommended to standardize the text orientation in Figure 1 for consistency—for example, “encoder aux. loss” should align with the overall layout. Additionally, key vectors should be clearly labeled to enhance interpretability. b) Section 3 should use the past tense. c) The evaluation metrics should be presented in a separate paragraph for clarity. Both metrics were applied to the BTCV and BraTS datasets.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

HRViT demonstrates some improvements, though it also has some limitations. On the positive side, HRViT introduces a non-edge token halting strategy that prioritizes important edge tokens, offering a fresh approach to addressing edge ambiguity in medical images. Additionally, it achieves moderate performance gains with a slight reduction in computational cost. However, the weaknesses include the motivation is not clearly articulated, the reduction in computational consumption is limited, and the visualizations do not sufficiently highlight the model’s advantages. Therefore, the current recommendation is borderline accept, subject to change after rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Most of my comments are addressed

Review #3

Please describe the contribution of the paper

The main contribution is a new method to accelerate vision transformers for medical image segmentation. The idea is that boundaries are more important than background and homogeneous regions. Therefore, more computation is allocated to tokens related to edges and other tokens are halted in the encoder.

Evaluation shows segmentation performance matching state of the art while requiring fewer computational resources.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the paper presents a strong evaluation with respect to segmentation accuracy and ablation studies which demonstrate that the method retains overall strong segmentation performance while substantially lowering computational costs
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The idea of edges being important for medical image segmentation is certainly true for a lot of structures. However, I think it can be argued that there are many boundary areas which are fuzzy and not well defined in medical images and that these areas are often the more critical ones for a network to learn as they are more complex. What I am missing in this work is an evaluation and discussion how such areas are affected by the proposed token halting. E.g. transitions between pancreas and bowl or stomach and bowl, liver and spleen, especially in non-contrast enhanced images or for patients without a lot of visceral fat. Such an evaluation would better highlight the advantages and limits of the method. Overall segmentation accuracy is an indicator that the method works in general and the ablation study shows that edges are important, however, there is likely a trade off. E.g. Figure 2, 3rd column, shows that the network ignores a lot of tokens and in turn misclassifies areas such as air in the bowl, likely because it regards it as background. Depending on the application, such limitations can be very critical and should be discussed in detail.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- the methodology section is a bit short and details such as mathematical definitions of the losses are missing. Source code is provided, but it would be good to have critical definitions in the paper as well for reproducibility and self-explainability.
- Table 1 is a bit confusing with column 3 (Aorta) and Stomach having two red-coded best results, but no second best. The Aorta column also has the exact same result for SCD and HRViT-S + UNETR (89.39). Is that correct?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the paper presents a solid contribution with good results and is generally well evaluated, except that there are no discussion / results about potential major limitations for regions without strong edge boundaries. This is critical since it would significantly limit the applicability of the method if there is poor performance in such regions. If there is a rebuttal, this should be addressed in detail.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Considering the other reviewers comments and the authors’ rebuttal, I think the paper needs substantial changes to better motivate and demonstrate the effectiveness of edge/non-edge token identification and use, which are out of scope of the rebuttal.

“12] demonstrates how background pixels and core regions of large objects naturally achieve higher confidence scores through early-layer segmentation heads”. It would need to be demonstrated if this holds for medical image segmentation and smaller structures. I appreciate the authors’ willingness to provide additional analysis and comparisons, but I think it would be a substantial change of the current paper. The same applies to the comparison to DynamicViT and nnUNet. Furthermore, it also applies to the performance of detecting blurry edge tokens and the impact on overall segmentation accuracy. Table 4 provides accuracy of detecting edge tokens, but it does not give insight about how missed edge tokens impact segmentation accuracy.

Author Feedback

We thank all reviewers (R1, R2, R3 and Meta) for their valuable suggestions and would like to make the following clarifications. [Q1-R1] Insufficient motivation. Thanks for the valuable question. While edge and non-edge tokens receive equal treatment in early layers, we note that DToP [12] demonstrates how background pixels and core regions of large objects naturally achieve higher confidence scores through early-layer segmentation heads. To verify this observation and better illustrate our motivation, we will include a comparative analysis of segmentation accuracy between edge/non-edge tokens in early layers versus all layers, and provide quantitative evidence of the confidence score distribution across different token types in the revision. [Q2-R1] Inference latency. The computational efficiency constitutes a key advantage of our method. In Table 2 (BTCV), our method shows a 2.35× improvement in encoder efficiency, 1.46× higher overall FPS, and requires only 52.5% FLOPs compared to baseline models. The efficiency gains are further amplified on BraTS (Table 3), where we observe 3.07× encoder efficiency, 1.78× overall FPS, and 49.5% FLOPs consumption. [Q3-R2] Evaluation and discussion on recognition accuracy of blur edge tokens are missing. Thanks for raising this question. The edge-aware halting module has achieved a remarkably high recognition accuracy (see Table 4), and about 90% edge tokens were correctly identified. We will illustrate this point through visualization, where most of the blurred edge areas will be preserved. The segmentation performance of minority unretained edge regions remains comparable to or exceeds the baseline, as our token reconstruction compensates for them. [Q4-R3] About statistical significance analysis. Thanks and we agree statistical significance analysis in medical image evaluation is important. The performance recorded in our experiments is the average result of five repeated measurements. The results show consistent superiority over baseline, with a marginal mean DSC standard deviation across runs, demonstrating robust performance. [Q5-R3] Related works are missing: DynamicViT and nnUNet. While DynamicViT applies a similar token halting concept, our method differs in its application to medical imaging. Additionally, although the nnUNet MedNeXT baseline achieves strong accuracy, our framework maintains comparable performance while prioritizing computational efficiency for ViT-based architectures. We will expand our comparative analysis to include these works in the revised manuscript. [Q6-R3] Confusing definition and spelling error. In our revision, we will replace “binary edge label GT_edge” with the more precise “binary label of token GT_token”, where GT_token=1 indicates edge-region tokens, and GT_token=0 denotes non-edge-region tokens. Also, we will correct the title’s “Haulting” to the proper “Halting”. [Q7-R3] The core insight is unconvincing. We appreciate your insightful observation. Our empirical analysis demonstrates that tokens located farther from edge regions achieve consistently higher prediction accuracy than those near boundaries. To further substantiate this finding, we will include supporting visual evidence in our revision. [Q8-Meta] Whether the comparative experiments are fair. We acknowledge the importance of fair comparative evaluation. For our baseline implementation, we strictly adhere to the original experimental setups from the open-source works [25,26], without any modifications. Our method maintains identical experimental conditions to ensure a valid comparison. [Q9-R1] Insufficient visual comparison. We will include additional comparative results with other methods in the version. [Additional comments-R1, R2] Additional comments regarding charts, tenses, mathematical descriptions of metrics and losses. Thanks for these helpful suggestions. We will accordingly adjust the charts, correct the tenses, and add these mathematical descriptions if space is sufficient.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Authors should address critical concerns raised by the reviewers. Especally, is there any change in experimental setup that resulted a change in baseline model performance (compared to original paper.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Please address the concerns raised by the reviewers in the camera ready, especially by R2.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper receives mixed reviews before and after rebuttal. I tend to accept this paper to raise discussions in our community as it addresses an interesting issue in ViT-based medical image segmentation.

back to top

Edge-Aware Token Halting for Efficient and Accurate Medical Image Segmentation

Author(s):