Abstract

Precise segmentation of medical images is fundamental for extracting critical clinical information, which plays a pivotal role in enhancing the accuracy of diagnoses, formulating effective treatment plans, and improving patient outcomes. Although Convolutional Neural Networks (CNNs) and non-local attention methods have achieved notable success in medical image segmentation, they either struggle to capture long-range spatial dependencies due to their reliance on local features, or face significant computational and feature integration challenges when attempting to address this issue with global attention mechanisms. To overcome existing limitations in medical image segmentation, we propose a novel architecture, Perspective+ Unet. This framework is characterized by three major innovations: (i) It introduces a dual-pathway strategy at the encoder stage that combines the outcomes of traditional and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better comprehension of the global structure of images while retaining detail sensitivity. (ii) The framework incorporates an efficient non-local transformer block, named ENLTB, which utilizes kernel function approximation for effective long-range dependency capture with linear computational and spatial complexity. (iii) A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, meticulously refining features from various levels to harmonize global and local information. Experimental results on the ACDC and Synapse datasets demonstrate the effectiveness of our proposed Perspective+ Unet. The code is available in the supplementary material.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3008_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3008_supp.zip

Link to the Code Repository

https://github.com/tljxyys/Perspective-Unet

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hu_Perspective_MICCAI2024,
        author = { Hu, Jintong and Chen, Siyan and Pan, Zhiyi and Zeng, Sen and Yang, Wenming},
        title = { { Perspective+ Unet: Enhancing Segmentation with Bi-Path Fusion and Efficient Non-Local Attention for Superior Receptive Fields } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work targets the canonical medical image segmentation challenge on learning long-range spatial dependencies. The authors propose the Perspecrive+ Unet on using dual pathway strategy which maintain the local receptive field and expands global feature. The proposed method also includes an efficient non-local transformer block, ENLTB. The design is validated with ACDC and BTCV datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The writing is good. Easy to follow. Datasets are standard.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are extensive discussion and works on learning local and global receptive fields, features, kernels. The authors can highlight the difference of this work and clarify the challenges and problem the solution addresses. Table 1 on segmentation accuracy, the Synapse dataset’s SOTA performance on several organs, such as Liver and spleen can be up to 0.96. The numbers on the anatomies are lower, the authors can describe more on the comparison, datas splits and experiment design on the gap and differences. The U-Net compared in the table is the canonical unet? Maybe use nnUNET as comparison baseline would be better for vigorous validation. In Table 2, there are 2D and 3D methods, how are the 2D-3D be compared? In volumetric or slice-wise?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors provides the code in the Supple Materials.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors can add more discussion on the method comparisons, the result table is less convincing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper revisited an relative dated topic without solid methodolgy and fair comparisons. Lack of innovation and challenges addresses.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The authors provided good rebuttal response. But the base of the paper didn’t change, I feel lots of similar works are already been done and published, this paper didn’t stand out among prior works which make it limited contribution to the MICCAI society.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a segmentation model architecture named Perspective++ UNet for 3D image segmentation. It proposes a Bi-path CNN blocks to capture local and global information, a spatial cross-scale integrator (SCSI) module to merge information from different stages and an efficient transformer block for reducing computational complexity. Results across different dataset are presented.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, Clear organization and illustrations. Readers can follow the logic easily. 2, Sufficient comparisons with transformer and CNN based segmentation models. 3, From the architectural perspective, the design of bi-path cnn block is novel for 3d segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1,Lack of comparisons on complexity between other models lack of comparison on time costs. 2, The effectiveness of SCSI and ENLTB needs to be further justified as ablation study only evaluates when both SCSI and BPRB is used.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Codes are provided as supplementary material.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clear flow and very easy to follow. The idea of bi-path CNN and integrator is somehow new. And sufficient comparison experiments are presented with great performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    To overcome the existing limitations in medical image segmentation, the authors propose a novel architecture called Perspective+ Unet. This architecture introduces a dual-pathway strategy in the encoder stage, combining the outcomes of traditional convolutions and a highly efficient non-local transformer block. Additionally, it incorporates a Spatial Cross-Scale Integrator strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The dual-pathway strategy is introduced in the encoder stage, combining the results of traditional convolutions and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better understanding of the global structure of the image while preserving detail sensitivity.

    2. The framework incorporates an efficient non-local transformer block named ENLTB, which utilizes kernel function approximation to achieve effective long-range dependency capture with linear computational and spatial complexity.

    3. A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, finely adjusting features from different levels to achieve the harmonization of global and local information.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of theoretical analysis on Efficient Non Local Transformer Block
    2. Lack of complete experimental results for nnFormer
    3. Lack of comparison results for nnUNet
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author provided the corresponding code in the supporting materials, and I believe the results of the paper can be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Add theoretical analysis on efficient non local transformer groups
    2. Add complete experimental results for nnFormer
    3. Add comparison results for nnUNet
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Detailed method introduction and relatively complete experiment, as well as provided reproduction code.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

(R1&R3&R4) We thank all the reviewers for their careful considerations and beneficial suggestions. (R1&R4) About nnUNet. Our method exhibits a performance decrease on the Synapse dataset compared to nnUNet. Conversely, on the ACDC dataset, our approach demonstrate an improvement over nnUNet, with a HD reduction of 6.1% and a DSC increase of 1.0%. These contrasting performances highlight the distinct advantages of our method and nnUNet across different datasets. Furthermore, it is important to note that our method adopt the MISSFormer framework and follow a similar data augmentation paradigm as previous works, while nnUNet employed a stronger augmentation scheme, we refrain from direct comparison with nnUNet in the paper to ensure a fair evaluation. (R1) Fair comparison. Our method follows the framework of MissFormer, and the division of training and testing sets is identical to previous work. We believe there is no unfair comparison. Supplementary materials include the source code and list for dataset division utilized in this study. Additionally, we commit to making all code and pth file publicly available to ensure the reproducibility and fairness of our results if our paper is accepted. (R1) Comparison between 2D and 3D methodologies. In the comparison between 2D and 3D methodologies, our analysis employs a slice-wise evaluation approach as previous works. (R1) Performance of our method on the liver and spleen. These organs possess larger volumes and smoother boundaries, which generally reduces the complexity of segmentation. This, indeed, contributes to our method achieving DSC accuracies of up to 0.96 for these organs. It is common for larger organs with smoother boundaries to score higher in segmentation tasks, a trend that can be observed in other segmentation studies as well. (R3) Latency. Our method can perform segmentation on one 512x512 medical image frame in less than one second. The computational time of our approach is comparable to small model such as Swin-Unet, thereby satisfying the real-time inference requirement for clinical applications of medical image analysis. (R3) Ablation Study. Due to the page limit, we can only show four of the eight ablation settings. The ablation study consistently highlights the efficacy of the BPRB and SCSI modules in enhancing performance. Full model incorporating all three modules achieves the highest performance across both evaluation metrics. The configuration with only BPRB and SCSI modules ranks second, outperforming the individual use of either BPRB or SCSI alone. Additionally, configurations incorporating BPRB module exhibited relatively higher segmentation accuracy, with an approximate 0.1% improvement in DSC, and lower boundary errors, with a reduction of around 21.3% in HD, compared to those without BRPB. (R4) Theoretical analysis of ENLTB. The main idea of ENLTB is to employ linear approximations for reducing the computational complexity of non-local attention modules. Under nonlinear transformations, salient regions in feature space exhibit robustness to minor perturbations, retaining their saliency in the approximate representation. Although linear transformations cannot fully replace the original attention mechanism, they are sufficient in distinguishing critical differences between background and salient objects. This approach enables emphasizing feature points with the most substantial impact on the final task while minimizing the effect on segmentation quality. Besides, mapping high-dimensional features to a low-dimensional space facilitates rapid identification of significant features in the high-dimensional space. (R4) About nnFormer. On the Synapse dataset, the nnFormer surpass our model’s performance. However, on the ACDC dataset, our approach demonstrate superior results, exhibiting a marginally higher DSC by approximately 0.5% and a smaller HD by approximately 6.1%.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces Perspective++ UNet for 3D medical image segmentation, featuring a dual-pathway encoder strategy and an Efficient Non-Local Transformer Block (ENLTB) to capture local and global features. The Spatial Cross-Scale Integrator (SCSI) refines feature integration, validated on ACDC and BTCV datasets with good performance. However, the paper lacks complete experimental results for all baselines, thorough comparisons of computational complexity and runtime efficiency. Moreover, the technical contributions compared with similar works need to be discussed more thoroughly. Overall, the paper is on the borderline but I am slightly leaning towards an accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces Perspective++ UNet for 3D medical image segmentation, featuring a dual-pathway encoder strategy and an Efficient Non-Local Transformer Block (ENLTB) to capture local and global features. The Spatial Cross-Scale Integrator (SCSI) refines feature integration, validated on ACDC and BTCV datasets with good performance. However, the paper lacks complete experimental results for all baselines, thorough comparisons of computational complexity and runtime efficiency. Moreover, the technical contributions compared with similar works need to be discussed more thoroughly. Overall, the paper is on the borderline but I am slightly leaning towards an accept.



back to top