Abstract

Precise brain tumor segmentation is critical for effective treatment planning and radiotherapy. Existing methods rely on voxel-level supervision and often struggle to accurately delineate tumor boundaries, increasing potential surgical risks. We propose an Attention-Guided Vector Quantized Variational Autoencoder (AG-VQ-VAE) — a two-stage network specifically designed for boundary-focused tumor segmentation. Stage 1 comprises a VQ-VAE which learns a compact, discrete latent representation of segmentation masks. In stage 2, a conditional network extracts contextual features from MRI scans and aligns them with discrete mask embeddings to facilitate precise structural correspondence and improved segmentation fidelity. Additionally, we propose an attention scaling module to reinforce discriminative feature learning and a soft masking module to refine attention in uncertain tumor regions. Comprehensive evaluations on BraTS 2021 demonstrate that our AG-VQ-VAE sets a new benchmark, improving the HD95 metric by 4.83 mm (Whole Tumor), 2.14 mm (Tumor Core), and 2.39 mm (Enhancing Tumor), compared to state-of-the-art methods, while achieving a 0.23% improvement in Dice score for whole tumor. Furthermore, our qualitative results and ablation study demonstrate that feature-level supervision significantly enhances boundary delineation compared to voxel-level approaches. The code is available at https://github.com/danishali6421/AG-VQVAE-MICCAI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3774_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/danishali6421/AG-VQVAE-MICCAI

Link to the Dataset(s)

BraTS 2021 dataset: https://www.synapse.org/Synapse:syn25829067/wiki/610863

BibTex

@InProceedings{AliDan_AttentionGuided_MICCAI2025,
        author = { Ali, Danish and Mian, Ajmal and Akhtar, Naveed and Hassan, Ghulam Mubashar},
        title = { { Attention-Guided Vector Quantized Variational Autoencoder for Brain Tumor Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {67 -- 77}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    An Attention-Guided Vector Quantized Variational Autoencoder (AG-VQ-VAE) is proposed— a two-stage network specifically designed for boundary-focused tumor segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A two-stage attention-guided VQ-VAE framework is proposed, which leverages feature-level supervision to ensure structural consistency and accurate tumor boundary segmentation, while avoiding reliance on low-level features from skip connections.
    2. An attention scaling mechanism and a soft masking module are introduced to dynamically adjust the contribution of each attention head and emphasize uncertain boundary regions, effectively enhancing the model’s generalization and segmentation performance.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The authors are encouraged to provide qualitative results for both the comparison with baseline methods and the ablation studies. Such visual evidence would offer clearer insights into the effectiveness of the proposed approach, particularly in accurately delineating tumour boundaries.
    2. The authors mention that three separate VQ-VAE models are trained for three different labels. It would be beneficial to discuss how this design choice impacts computational efficiency, especially in comparison with some window-based Transformer models introduced earlier in the paper.
    3. One of the key contributions of this work is the adoption of feature-level supervision instead of pixel-level supervision. However, it appears that applying a pixel-level loss in the mask generation process using VQ-VAE might not be intuitive. Are there any references or prior studies that support this design choice?
    4. Following the previous point, it would be helpful to include relevant references on VQ-VAE in the introduction, to better contextualize its use in this work. Additionally, the authors should elaborate on why low-level features from skip connections may pose problems in the context of precise tumour boundary segmentation.
    5. In Equation (3), two hyperparameters are introduced to balance the Dice loss and the commitment loss. The authors are encouraged to provide more details on how these parameters are chosen or tuned, and to discuss their impact on model performance.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weakness section.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author’s responses address the concerns raised effectively. So I change my prior evaluation and rating.



Review #2

  • Please describe the contribution of the paper

    The main contribution of this paper lies in proposing a novel two-stage network called AG-VQ-VAE (Attention-Guided Vector Quantized Variational Autoencoder) to improve the accuracy of boundary delineation in brain tumor segmentation. It also introduces modules such as attention scaling and soft masking to enhance the model’s performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Two-Stage Network AG-VQ-VAE: Introduces a new approach to brain tumor segmentation by learning discrete latent representations using VQ-VAE, effectively handling diverse anatomical variations.
    2. Modules: Attention Scaling enhances feature learning efficiency, and Soft Masking focuses on uncertain tumor boundaries, leading to more accurate segmentation.
    3. Accurate Boundary Delineation and Evaluation: Achieves clinically significant accurate delineation of tumor boundaries through feature-level supervision, and demonstrates improved performance with a significant enhancement in the HD95 metric on the BraTS 2021 dataset.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. High performance on the BraTS 2021 dataset is encouraging, but it’s not enough to prove applicability in real clinical settings. Clinical data can differ from the BraTS 2021 dataset in data quality, patient characteristics, scanning protocols, and other aspects. These differences may degrade model performance, making it difficult to determine clinical utility.
    2. One of the claimed benefits of multi-head attention with the scaling module is the ability to focus on different tumor features. Visualizing which heads are being scaled up or down, as well as the final attention maps, could provide valuable insight into interpretability.
    3. The proposed pipeline involves multiple VQ-VAEs (for each sub-region), eight transformer layers, and additional attention modules. This complexity raises concerns about the computational cost and memory requirements, especially when processing large 3D MRI datasets. To better assess the real-world feasibility of the method, it would be valuable to include a runtime and memory usage analysis. Specifically, what is the average inference time per patient, and how does the memory footprint scale with the size of the 3D MRI volume? Including this information would significantly strengthen the paper’s practical contribution.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of this method is limited. Additional experiments are needed to demonstrate its superiority.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Most of my earlier concerns have been adequately addressed by the authors.



Review #3

  • Please describe the contribution of the paper

    The author proposes a method for evaluating brain tumor segmentation using multimodal MRI images (T1, T1 with gadolinium, T2, T2FLair), which aims to improve the assessment of boundaries, crucial for estimating the tumor surface, particularly for planning surgical interventions. The author’s method consists of two steps:

    1. Training a Vector Quantized Variational Autoencoder: This first step involves training the autoencoder on segmentations to define a codebook of vectors that represent characteristic features of the segmentation.

    2. Implementing an Attention-Based Architecture: In the second step, an attention-based architecture is used to extract attention features from the four MRI modalities. The images are concatenated and then passed through a CNN to extract characteristic features. These features subsequently go through eight attention heads before being upscaled to fit the vector space of the Vector Quantized Variational Autoencoder. These features are then associated with the closest features from the codebook to finally be decoded and obtain the segmentation.

    The model is trained on the Brats 2021 dataset and compared in terms of Dice score and HD95 against six methods from the literature(3D U-Net, TransBTS, UNETR, NestedFormer, DBTrans, Casual Intervention). In terms of results, the proposed method by the author is the most accurate in terms of boundary and whole tumor evaluation

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-designed with a clear and concise explanation of the clinical context. The quick review of existing tumor segmentation solutions is well-appreciated and demonstrates the seriousness of the authors.

    The solution implemented by the authors takes advantage of the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, which firstly reduces the computational complexity of the solution by embedding in the latent space. This approach also allows for the extraction of key features of the segmentation through the use of the quantized implementation.

    Some novel aspects are added to the attention structure, including the integration of attention scaling and soft masking modules.

    A key strength of this paper is the thorough validation of the solution, including a comparison with 6 other methods from the literature and an ablation study.

    Even though the proposed solution does not have the best score in all aspects, the authors clearly define the strong advantage of their solution, which is the accurate boundary estimation of the tumor.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Even though the overall quality of the paper is clear, some information should be added to ensure that the study can be well assessed and reproducible:

    Major Issues:

    Ablation study: The ablation study does not include the AG-VQ)VAE without the attention scaling and Soft Masking. This is a major comparison that should be addressed. Please explain why this comparison was not included.

    Reproduction of literature solutions: Please specify in one line how you reproduced the literature solutions. Did you train these models from scratch or use pre-trained solutions available? This information is crucial for understanding the baseline comparisons.

    Quadratic cost of transformer solutions: You mention the quadratic cost as a drawback of transformer solutions. It would be beneficial to include the inference time for each solution to provide a clearer comparison of computational efficiency.

    Hyperparameters for reproducibility: For reproducibility, please clearly define all your hyperparameters and explain why you chose them. This includes training parameters, the number of attention heads, the number of CNN layers, the size of the codebooks, etc.

    Pre-trained CNN for conditional network: Please explain why you did not use a pre-trained CNN for the conditional network. This could provide insights into the design choices and their implications on the model’s performance.

    Minor Issues:

    1- Introduction: In the introduction, it would be pertinent to discuss the overall inter and intra variability of manual tumor evaluations through solutions such as RANO. This would provide context for the importance of improving segmentation methods. 3- Resluts : “Note that accurate tumor boundary delineation is crucial for surgical decision-making, as minor inconsistencies in inner voxel predictions of tumor regions are more interpretable for clinicians than errors in boundary predictions [30]” this sentence should not be in the results analysis. 2- Conclusion: One limitation that could be addressed in the conclusion is the fact that multimodality is only concatenated instead of fully utilizing attention mechanisms through cross-attention channels, for example. This could indicate potential areas for further improvement.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I thank the authors for their work and based on the evaluation, the proposed study should be accepted:

    Novelty and innovation: The paper introduces a novel method for brain tumor segmentation by leveraging Vector Quantized Variational Autoencoder (VQ-VAE) and an attention-based architecture. This method could be also use for explainability and therefore have great perspecitves.

    Clinical relevance: The method proposed by the authors is highly relevant to clinical practice, particularly in the accurate boundary estimation of tumors, which is crucial for surgical planning but also for the tumor grade evaluation which is based principally on the tumor area ad thus accord a great importance to boundary.

    Thorough validation: The authors have conducted a comprehensive validation of their solution, including comparisons with six other methods from the literature and an ablation study.

    Clear presentation: The paper is well-structured and clearly presented, with detailed explanations of the methodology, results, and comparisons. The inclusion of a quick review of existing solutions further strengthens the paper by providing context and demonstrating the authors’ understanding of the field.

    Potential for improvement: While the paper has some weaknesses, such as the need for additional information on the ablation study and hyperparameters for reproducibility, these issues can be addressed easily.

    In conclusion, the paper’s novel approach, clinical relevance, thorough validation, and clear presentation outweigh the identified weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all the reviewers for their constructive feedback and recognition of our model’s novelty (R1,R2,R3), its boundary-focused segmentation (R2,R3), and the comprehensive evaluation & ablations (R2).
Qualitative visualization(R1,R3): We emphasized boundary-focused visualization to highlight HD95 gains. As shown in Fig. 2, inner discrepancies explain the slight Dice drop for TC/ET. We will include links to further (attn maps) visualizations, ablation/baseline comparisons, and our model in camera-ready. Computational cost, inference time, and memory scaling(R1,R2,R3): Our three networks (one per tumor region) results in total 441G FLOPs and 40M parameters. In comparison, window-based transformer models, DBTrans [28] and Swin-UNETR [10] require 146G/25M and 395G/62M, respectively. Our model processes a 3D MRI (240×240×155) in ~780 ms on NVIDIA RTX 3090(24GB), vs 254 ms (DBTrans) and 674 ms (Swin-UNETR). While slower, our model still performs segmentation in < 1 second and gives superior HD95 scores: 5.0/4.1/3.7 mm (WT/TC/ET) vs 9.8/6.2/6.1 mm for DBTrans (Table1). Our model consumes <8 GB memory during full-resolution inference (per patient). The fixed latent representation enables efficient patch-based scaling for larger inputs. For instance, 128³ patches with 50% overlap result in 18 windows for a standard input; memory scales with window count. We will mention these details. Feature vs pixel level supervision(R1): We use pixel-level Dice loss only once in Stage1 to train VQ-VAEs on binary masks. In Stage2, feature-level supervision uses cross-entropy loss to align MRI features with pre-trained latent codes. This 2-stage design is loosely inspired by DALL·E [Ramesh et al., 2021], where a discrete VAE is trained with reconstruction loss, then text embeddings are aligned to visual tokens via cross-entropy. Our conditional network maps MRI latent features to the discrete latent features of segmentation mask. Skip connections and VQ-VAE references(R1): We will add references (Ramesh et al., ICML 2021; Gu et al., CVPR 2022) to support our VQ-VAE design. Shallow skip features carry rich spatial details but lack semantic clarity; fusing them with deep features introduces noise. Xiao & Nie (MIUA 2024) show this degrades boundary delineation in blurry MRI regions. Our ablation (Table2) confirms AG-UNet(skip-based) underperforms compared to our 2-stage model without skip connections. Hyperparameters justification(R1,R2): We used grid search and set weights in Eq. (3) as αd=0.75 & αc=0.25. Higher αc slowed adaptation to codebook vectors for varying encoder outputs, while lower αd weakened structural guidance. We used codebook size of 512 (dim=32) for compact yet expressive and memory-efficient representation. The 4-layer CNN encoder provides a sufficient receptive field for 3D context while keeping the model lightweight. Eight attention heads capture diverse spatial patterns while maintaining alignment with latent tokens; fewer heads reduce precision, while more cause over-fragmentation and higher compute cost. Ablation study(R2): We initially compared 1-stage AG-UNet & 2-stage AG-VQ-VAE, both with AS & SM enabled, and focused ablations on the better-performing 2-stage model. Nonetheless, 1-stage AG-UNET ablations without AS or SM also show clear performance drops: No AS: WT/TC/ET=92.84/87.70/83.02(Dice), 6.42/5.95/4.71(HD95) No SM: WT/TC/ET=92.92/86.59/83.57(Dice), 6.53/6.17/4.54(HD95) Clinical generalization(R3): BraTS is a widely used benchmark with HGG/LGG cases from multiple scanners and institutions. Numerous recent works, e.g., [15, 26, 28], use it to validate their models. We carefully followed current conventions in our evaluation. Reproduction of existing solutions(R2): Results of prior works are taken from their respective papers. We will mention this. Pretrained CNN(R2): Our goal is to map MRI features to discrete latent codes, requiring precise alignment; pretrained CNNs aren’t tailored for structured discrete priors.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The reviewers acknowledged the novelty of this work, but the have concerns on the experiments. The authors are encouraged to clarify them int he rebuttal:1) model complexity compared with existing works; 2) the choice and impact of hyper-parameters; 3) no visualization of attention to show the interpretability; 4) motivation of feature-level supervision; 5) insufficient ablation study

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes an interesting framework aimed at improving boundary delineation in MRI brain tumor segmentation. However, I feel obliged to highlight concerns in evaluation that may have been overlooked. I assume the authors used the data splits from references #15 and #28 and reported the baseline numbers in Table 1 directly from those papers, as stated in the rebuttal, which is fine. But BraTS is a well-established benchmark, a thorough evaluation on the BraTS 2021 dataset should not ignore more recent and robust models, such as nnU-Net (original and/or updated variants), MedNeXt, Mamba-based models, and SwinUNETR. Moreover, while boundary delineation is clinically important, relying only on HD95 (which is sensitive to voxel outliers) to highlight this point might not be ideal. Metrics like the normalized surface distance, which better quantify boundary accuracy and are more clinically relevant, should also be included.



back to top