Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multimodal image segmentation has been gaining significance with the advancement of deep learning and increasing diversity of datasets. Although researchers have been actively exploring multimodal U-Net structures, improvements in the segmentation of fine features in medical images remain limited. In this study, we propose a novel U-Net model based on hybrid local-window attention, for multimodal medical-image segmentation. This study aims to effectively analyze overlapping brain-tumor lesions and extract essential information from different magnetic-resonance-imaging modalities for more precise segmentation. The proposed hybrid local-window–attention mechanism comprises local-window self-attention and cross-attention, disentangled representation learning (DRL), and region-aware contrastive learning (RCL) modules. We apply local-window self-attention for achieving efficiency over global attention, and local-window cross-attention between the encoder and decoder to enhance the modality interaction. The hybrid local-window–attention structure extracts modality-specific features, whereas DRL preserves modality and lesion information. RCL utilizes the contrast loss within the lesions to improve segmentation. We perform comprehensive experiments on the BraTS 2023 and BraTS 2024 datasets and confirm that the proposed model provides enhanced medical-image segmentation performance, compared with U-Net based benchmark models without pre-training.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1913_paper.pdf

SharedIt Link: https://rdcu.be/eHwM7

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_22

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

BraTS 2023 dataset: https://www.kaggle.com/datasets/shakilrana/brats-2023-adult-glioma BraTS 2024 dataset: https://www.synapse.org/Synapse:syn53708249/wiki/627759

BibTex

@InProceedings{KimJiw_Hybrid_MICCAI2025,
        author = { Kim, Jiwon AND Jin, Seyong AND Noh, Yeonwoo AND Moon, Hyeonjoon AND Lee, Minwoo AND Noh, Wonjong},
        title = { { Hybrid Local-Window-Attention–Assisted U-Net Model for Multimodal Medical-Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {229 -- 238}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a novel U-Net-based architecture for multimodal medical image segmentation, incorporating hybrid local-window attention mechanisms, disentangled representation learning (DRL), and region-aware contrastive learning (RCL). The proposed model is designed to effectively capture modality-specific features and improve segmentation of small and complex tumor regions. Comprehensive experiments on BraTS 2023 and BraTS 2024 datasets demonstrate that the method outperforms several baseline U-Net variants in both Dice score and HD95 metrics.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) Well-structured model architecture combining self-attention, cross-attention, and contrastive learning tailored for multimodal brain tumor segmentation. (2) Strong experimental performance with clear improvements over conventional U-Net variants across multiple tumor subregions (ET, TC, WT). (3) Comprehensive ablation studies validate the individual contributions of the proposed modules (Self-Attention, DRL, RCL, Cross-Attention).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) The paper does not include any comparisons with recent transformer-based models, which are widely considered state-of-the-art in medical image segmentation. This omission weakens the strength of the experimental evaluation and makes it difficult to assess the competitiveness of the proposed model in the current research landscape. (2) Although the model integrates multiple attention and learning strategies effectively, each component (e.g., local-window attention, DRL, contrastive learning) has been explored individually in prior works. The contribution mainly lies in architectural combination and task-specific tuning, rather than introducing fundamentally new mechanisms. (3) While the model introduces multiple attention modules and contrastive learning branches, the paper does not provide any information on parameter count, inference speed, or memory footprint. This is especially important given the increasing need for efficient deployment in clinical settings.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a technically solid segmentation model using attention and contrastive learning. However, it largely builds on existing components, lacks standout innovations, and does not sufficiently differentiate itself from prior U-Net-based approaches.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I believe this paper should be rejected for two main reasons. First, it lacks originality—the method primarily combines existing techniques without introducing novel mechanisms, which limits its academic contribution. Second, it does not include comparisons with recent transformer-based models such as Swin-UNETR, and most of the baseline methods referenced are from before 2021. This makes it difficult to evaluate the method’s competitiveness in the current research landscape. Nonetheless, I would understand and respect the final decision if the paper is accepted.

Review #2

Please describe the contribution of the paper

The authors propose a new type of U-Net architecture utilizing attention modules along with region-aware contrastive learning and disentangled representation learning. They have shown significant improvements with regard to neural networks that utilize similar architectures such as Attention U-Net and Recurrent residual U-Net, as well as nnUNet (No new U-Net); when evaluating on a well established BRATS MRI dataset used for tumor segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper had a good evaluation protocol with the use of several network architectures to compare with their own formulation as well as a systematic network ablation study to determine the key components of the model. Also the results do show good improvements over the existing methodologies such as a final result of 87.69 dice score over 85.68 for nnUNet for all averaged classes of the BRATS dataset.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The main limitation of the paper is the lack of references to previous work on region-aware contrastive learning (RCL). There is a paper on this topic on Hu H, Cui J, and Wang L, “Region-aware contrastive learning for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16291–16301. The reader cannot explicitly know what are the specific technological improvements without clear connection to previous work. While the attention based network has satisfactory derivation and motivation, the RCL formulation and derivation needs more justification.

One other concern is with the number of epochs chosen. 100 epochs is a bit low considering the default nnUNet formulation is for 1000 epochs, while it is possible that the new method does lend itself to fast convergence, it would be better to justify why there was training to only 100 epochs by eg. showing loss curve for nnUNet and proposed model.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My main factor for the overall score is the lack of relevant citations regarding their mathematical derivations. I would be willing to push to accept with a clear explanation as to how the RCL was derived and how this differs from previous work. While the work does show promise and seems innovative, without a detailed background on the derivations of the novel concepts such as RCL it is hard to determine the novelty of the method.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors addressed most of the rebuttal concerns adequately and their combination of segmentation techniques does lead to better results. The reviewer suggests to recommend the paper for acceptance due to the organization of the work and the promising results on the dataset evaluation.

Review #3

Please describe the contribution of the paper

The paper proposes a novel combination of U-Net with hybrid local-window attention, disentangled representation learning (DRL), and region-aware contrastive learning (RCL), for application to multimodal segmentation. The method is tested on the BraTS 2023 and 2024 challenge datasets, using the modalities T1, T1Gd, T2 and T2-FLAIR. Clear comparison results and ablation studies show that the method performs well and benefits from all the included components.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strengths of the paper are:
- Good results obtained for both datasets and regions, particularly for TC.
- Good clear ablation study, showing advantages of using all components (self-attention, DRL, RCL and cross-attention).
- Novel combination of U-Net with hybrid local-window attention, disentangled representation learning (DRL), and region-aware contrastive learning (RCL).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The major weaknesses of the paper are:
- A number of the experimental details and hyperparameter settings are provided without explanation or any indication of how crucial they were or whether the results were particularly sensitive to changes in them.
- No indication about computational time or memory usage (besides the GPU size) was included, which would have been useful given the complexity of the model.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Results are good and the method combines a number of components in a novel way, which are verified well with publicly available challenge datasets and comparison with a number of well known alternative methods.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The implementational details provided are appreciated and necessary to increase the chances of this being able to be replicated.

Author Feedback

We thank all reviewers for their constructive feedback.

[R1-W1] In this work, instead of transformer-based SOTA models typically require pre-training, comparisons were conducted with U-Net-based models without pre-training. Because medical image data differ across institutions, devices, and patients, and pre-trained weights often fail to generalize across domains, we focus on improving segmentation performance without pre-training. Future work will compare to pre-trained Transformer-based methods.

[R1-W2] While previous work has addressed each component individually, we systematically modify and integrate them for multimodal segmentation. The differences from the legacy similar modules can be summarized as:

Local-window Attention (Previous): Mainly applied in 2D or global form, resulting in high memory usage. (Proposed): Apply 3D local-window self and cross attention to limit computation and extract tumor features.

Disentangled Representation Learning (DRL) (Previous): Focused on representation separation in single-modality data, without linkage to contrastive learning. (Proposed): Separate shared and modality-specific features in multimodal settings and structurally connect DRL to RCL.

Region-aware Contrastive Learning (RCL) (Previous): Emphasized inter-class separation at the global level. (Proposed): Define positive and negative pairs based on tumor presence in local regions to support region-level contrastive learning and boundary awareness.

[R1-W3, R3-W2] The model integrates Local-Window Attention, DRL, and RCL to enhance segmentation while maintaining computational efficiency.

Parameter Count -A single decoder shared across modalities reduces the number of parameters compared to modality-specific decoders.

Inference Speed

Local-window operations replace global attention, reducing computation range and enabling faster inference.

Memory Footprint

Local-window attention lowers memory consumption. Though profiling is not reported, efficiency is indirectly confirmed by consistent results in Tab. 2.

[R2-W1] We thank the reviewer for pointing out Hu et al. (ICCV 2021). (Previous): Focused on single-modality segmentation and global inter-class discrimination. (Proposed): Perform region-level contrastive learning by integrating RCL with DRL. Tumor regions labeled positive or negative, and cosine similarity is computed to improve boundary discrimination. The citation and clarification will be added to the final version if accepted.

[R2-W2] The default nnU-Net uses 1000 epochs for generalization across diverse settings. Capellán-Martín et al. [1] demonstrated that 100 epochs can be effective for brain tumor segmentation. We designed the model for fast convergence and achieved stable performance within 100 epochs on both BraTS 2023 and 2024 datasets (Tab. 2).

[R3-W1] The model was trained for 100 epochs using hybrid attention [1]. A batch size of 1 and a learning rate of 1e-4 were used, following standard 3D segmentation settings [2]. For contrastive learning, a temperature of 0.07 was used [3]. Preprocessing included Z-score normalization and tumor-centered cropping (128³), as in BraTS [4]. The total loss combined cross-entropy with region-aware contrastive loss [5].

[1] Capellán-Martín, Daniel, et al. “Model ensemble for brain tumor segmentation in magnetic resonance imaging,” crossMoDA 2023, Springer, 2023. [2] Isensee, Fabian, et al. “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature Methods, 2021. [3] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations,” ICML, PMLR, 2020. [4] Ferreira, André, et al. “How we won brats 2023 adult glioma challenge? just faking it! enhanced synthetic data augmentation and model ensemble for brain tumour segmentation,” arXiv preprint arXiv:2402.17317 (2024). [5] Oord, Aaron van den, et al. “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748 (2018).

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors proposed a method combining hybrid local-window attention, disentanglement representation and contrastive learning for brain tumor segmentation. The paper is clearly written with large improvement from compared methods and detailed ablation study. Though some relevant works are missed in the comparison, the author basically showed the effectiveness of the proposed method.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper combines several existing components but offers only incremental novelty. My primary concern aligning with Reviewer #1 is the evaluation: the authors test their method on 3D MRI brain tumor segmentation but doesn’t include comparisons with more recent models such as Swin-UNETR, nnUNet-v2, MedNeXt, and UMamba; all baselines are outdated. In addition, BraTS 2023 and 2024 are well-established, continually evolving benchmarks, but the authors do not use the official external validation set or discuss results from BraTS challenge top-performing models. The BraTS 2023 continuous validation leaderboard is readily accessible, and metrics can be found in papers in the challenge proceedings.

back to top

Hybrid Local-Window-Attention–Assisted U-Net Model for Multimodal Medical-Image Segmentation

Author(s):