Abstract

The rise of Transformer architectures has advanced medical image segmentation, leading to hybrid models that combine Convolutional Neural Networks (CNNs) and Transformers. However, these models often suffer from excessive complexity and fail to effectively integrate spatial and channel features, crucial for precise segmentation. To address this, we propose LHU-Net, a Lean Hybrid U-Net for volumetric medical image segmentation. LHU-Net prioritizes spatial feature extraction before refining channel features, optimizing both efficiency and accuracy. Evaluated on four benchmark datasets (Synapse, Left Atrial, BraTS-Decathlon, and Lung-Decathlon), LHU-Net consistently outperforms existing models across diverse modalities (CT/MRI) and output configurations. It achieves state-of-the-art Dice scores while using four times fewer parameters and 20% fewer FLOPs than competing models, without the need for pre-training, additional data, or model ensembles. With an average of 11 million parameters, LHU-Net sets a new benchmark for computational efficiency and segmentation accuracy. Our implementation is available on github.com/xmindflow/LHUNet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4333_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xmindflow/LHUNet

Link to the Dataset(s)

Synapse Dataset: https://www.synapse.org/Synapse:syn3193805/wiki/217789 BraTS and Lung Decathelon: http://medicaldecathlon.com/ LA dataset: https://www.cardiacatlas.org/atriaseg2018-challenge/atria-seg-data/

BibTex

@InProceedings{SadYou_LHUNet_MICCAI2025,
        author = { Sadegheih, Yousef and Bozorgpour, Afshin and Kumari, Pratibha and Azad, Reza and Merhof, Dorit},
        title = { { LHU-Net: a Lean Hybrid U-Net for Cost-efficient, High-performance Volumetric Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work introduces a lightweight 3D medical image segmentation model called LHU-Net. The main novelty is the hybrid attention module, which combines LKA and self-attention. The hybrid attention module is applied to both encoder and decoder paths, leading to lower computational complexity. The model is evaluated on four datasets, achieving state-of-the-art performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is lightweight while maintaining the high segmentation performance.

    2. The design of the hybrid attention module is reasonable and easy to understand.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The novelty is limited. The overall U-shape architecture, the LKA module and the self-attention mechanism are all well-known techniques. This work simply combines the LKA with self-attention blocks in parallel, and insert it to some UNet blocks. This design is quite straightforward and contributes little to the existing knowledge.

    2. The performance gain over existing methods is marginal (Tables 2 and 3), even less than 0.5%.

    3. The reduction in computational complexity is purely theoretical, without a runtime assessment.

    4. Some details in writing can be further improved. For example, the x-axis labels inFig. 1 seem swapped. The description of the OmniFocus block is vague and not illustrated in Fig. 2. Why?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is well structured and self-contain, but its methological contribution is limited and the results are not convincing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Sorry for the late reply. After reading the rebuttal, I am somehow convinced that the combination of these existing methods also brings about overall novelty. And the improvement is acceptable though marginal. Overall, I turned to a more positive oppinion to this paper after reading the rebuttal.



Review #2

  • Please describe the contribution of the paper

    The paper presents LHU-Net, a novel 3D segmentation model that combines CNNs and vision transformers using a hybrid architecture. It combines spatial attention in early layers and channel attention in deeper layers. With this hybrid design, it achieves strong performance across four benchmark datasets with fewer parameters and lower FLOPs.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The architecture design seems to be novel and effective. The integration of CNNs and vision transformers in a unified 3D segmentation framework effectively leverages both local and global feature representations. The use of spatial attention in early layers and channel attention in deeper layers allows the model to capture fine-grained spatial details while maintaining semantic richness in high-level features. It also shows a lower computational cost.

    2. The experiments demonstrate its strong performances, which is thoroughly validated on four benchmark datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper primarily emphasizes the combination of architectural blocks, but lacks sufficient motivation and theoretical justification for the design choices

    2. For the Self-Adaptive Contextual Fusion Module, it is unclear how the learnable weights (γ and δ) are optimized.

    3. The ablation study focuses mainly on combinations of attention mechanisms but does not explore other factors such as network depth, kernel sizes, or the number of hybrid layers.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please address the concerns in the weakness section.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Overall, this paper demonstrates SOTA performance while using significantly fewer parameters than existing methods. The authors have addressed most of the concerns in their rebuttal. I recommend accepting this paper.



Review #3

  • Please describe the contribution of the paper

    The authors present LHU-Net, a Lean Hybrid U-Net designed for volumetric medical image segmentation. The core idea is that using tailored modules at different network depths can improve efficiency while reducing computational cost. Specifically, LHU-Net applies spatial attention in the early layers to capture local features and channel attention in the deeper layers to model global context. Experiments on four benchmark datasets (Synapse, Left Atrial, BraTS-Decathlon, Lung-Decathlon) demonstrate that the model achieves state-of-the-art Dice scores while using four times fewer parameters and 20% fewer FLOPs than competing approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Optimized hybrid attention mechanisms for improved contextual representation
    • Parameter and FLOP reduction while maintaining top Dice performance
    • LHU-Net demonstrates encouraging performance on various imaging modalities, effectively handling both single-label and multi-label segmentation tasks with high versatility.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It remains unclear how the boundary between early and deeper layers is defined
    • The comparisons with existing methodologies could be confirmed using a statistical analysis
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The methodology involved within LHU-Net is of high interest for the medical image analysis community. The submitted paper is innovative and well written. The following comments could be taken into account for further improvements.

    Main comments: 1- The acronym ‘LKAd’ is introduced without explanation. Its full form should be provided upon first mention. 2- Sect.2.2 states that the self-adaptive contextual fusion module is integrated into the top hybrid blocks. However, it remains unclear how you define the boundary between early and deeper layers. At what point in the architecture is this transition considered, and how is the integration point determined? 3- The OmniFocus attention block should be illustrated in Fig.2. In addition, why using it at the deepest network level only. 4- Could you motivate the use of large kernel convolutional attention followed by deformable convolution within core hybrid blocks? 5- The comparisons with existing methodologies could be confirmed using a statistical analysis through t-tests.

    Minor comments: 6- The integration of the models within nnUNet could be more deeply described, in Sect.3.1. 7- In Sect.1, you could add a reference when you state that 3D models have been shown to outperform 2D models by capturing better context and improving segmentation accuracy.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Using tailored modules at different network depths is relevant
    • Robust method with high versatility
    • High efficiency with minimal cost
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal helps clarifying the innovation, and also specifies the boundary between the early and deeper layers regarding the integration of self-adaptive contextual fusion modules.




Author Feedback

We thank you for your valuable feedback and the chance to clarify.

Novelty, Motivation, and Performance (R2, R3, M): Our novelty lies in proposing a “go-to” hybrid segmentation architecture that is lightweight, efficient, and fast in inference while outperforming state-of-the-art models across diverse datasets. This is enabled by two key modules-OmniFocus and Self-Adaptive Contextual Fusion-that integrate ViT and CNN-based attention effectively. We separate channel and spatial attention within the ViT blocks depending on network depth, avoiding the computational overhead of applying both simultaneously at all levels, as done in prior works. Two trainable vectors (γ, δ) are optimized end-to-end via backpropagation to adaptively calibrate the fusion of attention paths per layer and dataset, as confirmed by our ablation study. Our design fuses local feature extraction via residual blocks, large receptive field convolution (LKA with deformable layers), spatial ViT attention (capturing global spatial dependencies), and channel ViT attention (capturing semantic dependencies across channels) for the first time in this combination. This selective, adaptive fusion reduces redundancy and parameter cost while improving segmentation accuracy. While it is difficult to theoretically determine exactly how many and where to insert these modules-due to complex interactions with training protocols and dataset characteristics-this must be established experimentally, as done in our ablation study, consistent with standard practice in architecture design papers. By showing that applying both spatial and channel ViT attentions at all levels is unnecessary, and that adaptive, selective fusion leads to more efficient and accurate segmentation, we present not only a practical, fast, and simple backbone but also a flexible blueprint for future medical imaging architectures. Our method improves DSC across multiple datasets, ranging from 0.17% to 1.28% (avg. 0.75% DSC and 10.5% HD95 w.r.t. 2nd-best SOTA). These gains align with recent studies such as UNETR++ (TMI 2024, 0.14%–0.75%, avg. 0.45%), Beyond Self-Attention (WACV 2024, 0.01%–0.63%, avg. 0.5%), and MedNeXt-M-K3 (MICCAI 2023, 0.35%–1.2%, avg. 0.66%). This performance comes with ~81% fewer parameters and ~39% fewer FLOPS across datasets w.r.t. the 2nd-best SOTA. On BraTS, inference for 73 patients takes 2.31 minutes, 33% faster than UNETR++ (3.47 min) and much faster than nnUNet (6.61 min). Training on BraTS took 20 GPU hours. Full results will be added to Tab 1.

OmniFocus module (R1,R2): Fig. 2 shows two attention types—Self-Attention (S) for the Self-Adaptive Fusion module and Channel Attention (C) for the OmniFocus block. We’ll update the figure for clarity. This module is applied only in the last layer because the channel dimension is large enough there to capture the necessary information, as supported by ablation results.

Ablation study, motivations (R1, R3): We use a kernel size of 3 to keep computational cost low. Ablations show the best setup is two CNN layers followed by three hybrid layers. A four-stage network also performed well but was not optimal. Reducing spatial size to about 4 voxels (similar to nnUNet’s practice) via five-stage downsampling in our model improves results across datasets. As spatial size shrinks, attention shifts from spatial to channel; below ~5 voxels, channel attention dominates. Deformable convolution dynamically adapts the receptive field with little overhead, boosting performance (shown in ablations).

R1: The acronyms for LKAd, nnUNet configuration, and references supporting 3D over 2D (nnUNet, Beyond SA) will be added. Our model’s p-value for the LA dataset was < 0.05 compared to nnUNet, indicating significant improvement.

R2: Regarding Fig. 1, the text above each point shows the third metric for DSC, FLOPs, and Params. We’ll revise the caption for clarity and are happy to clarify any aspects of the OmniFocus module that need further explanation.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    This paper has received mixed reviews, where the main criticisms come from the lack of novelty and actual improvements. In particular, two reviewers stress that the proposed approach is a simple combination of existing architectural blocks, whose integration is not properly justified and motivated. Furthermore, reviewers (R2) also consider that the performance improvements are marginal, making the empirical validation unconvincing. Last, while number of FLOPS is indeed a proxy to assess the computational complexity, reducing the number of flops does not necessarily translate into faster training/inference times (i.e., FLOPs are theoretical, and real hardware might execute some FLOPs in parallel, masking their cost). Thus, reporting also the actual training/inference times could enhance the validation in terms of computational cost.

    Therefore, I recommend the authors to address these important issues in the rebuttal, particularly stressing the actual novelty of the proposed approach (beyond a combination of existing methods), properly motivating the use of each component, and discuss the concerns related to the empirical validation.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors addressed the concerns from the reviewers well in the rebuttal, and all the reviewers suggested acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers are positive to accept this work after the rebuttal. Following these ratings, I think this work can be published in MICCAI 2025.



back to top