Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The Transformer architecture and versatile CNN backbones have led to advanced progress in sequence modeling and dense prediction tasks. A critical development is the incorporation of different token mixing modules such as ConvNeXt, Swin Transformer. However, findings within the MetaFormer framework suggest these token mixers have a lesser influence on representation learning than the architecture itself. Yet, their impact on 3D medical images remains unclear, motivating our investigation into different token mixers (self-attention, convolution, MLP, recurrence, global filter, and Mamba) in 3D medical image segmentation architectures, and further prompting a reevaluation of the backbone architecture’s role to achieve the trade off in accuracy and efficiency. In the paper, we propose a unified segmentation architecture—MetaUNETR featuring a novel TriCruci layer that decomposes the token mixing processes along each spatial direction while simultaneously preserving precise positional information on its orthogonal plane. By employing the Centered Kernel Alignment (CKA) analysis on feature learning capabilities among these token mixers, we find that the overall architecture of the model, rather than any specific token mixers, plays a more crucial role in determining the model’s performance. Our method is validated across multiple benchmarks varying in size and scale, including the BTCV, AMOS, and AbdomenCT-1K datasets, achieving the top segmentation performance while reducing the model’s parameters by about 80% compared to the state-of-the-art method. This study provides insights for future research on the design and optimization of backbone architecture, steering towards more efficient foundational segmentation models. The source code is available at https://github.com/lyupengju/MetaUNETR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2749_paper.pdf

SharedIt Link: https://rdcu.be/dV51z

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_43

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2749_supp.pdf

Link to the Code Repository

https://github.com/lyupengju/MetaUNETR

Link to the Dataset(s)

https://github.com/JunMa11/AbdomenCT-1K https://www.synapse.org/Synapse:syn3193805/wiki/89480 https://amos22.grand-challenge.org/

BibTex

@InProceedings{Lyu_MetaUNETR_MICCAI2024,
        author = { Lyu, Pengju and Zhang, Jie and Zhang, Lei and Liu, Wenjian and Wang, Cheng and Zhu, Jianjun},
        title = { { MetaUNETR: Rethinking Token Mixer Encoding for Efficient Multi-Organ Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {446 -- 455}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes the MetaUNETR framework consisting of the TriCruci layer which helps learn better representation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-motivated and has images that provide a good backing to the results stated in the paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The paper is difficult to follow with the writing. It is unclear how the TriCruci layer works and as it is the foundation for the paper, it warranted a little more explanation in depth.
- The CKA figures are difficult to follow and not entirely descriptive. There is a need to delve into why stage 4 encodings differ distinctly, but the analysis is not present in the paper.
- The results section shows in great detail the usefulness of the proposed approach (with respect to Flops and parameters), however, has not been analyzed in detail.
Minor:
- Conceptually it synergistically amalgamates orthogonal spatial cues along the cardinal dimensions of height, width, and depth for volumetric data representation learning while preserving the precise positional information on their respective perpendicular planes, i.e., coronal (co), sagittal (se) and transverse (tr), as illustrated in Figure 1. This sentence is too long and very difficult to break down.
- The paper cites [7], but the following paper states CKA as an average HSIC score over minibatches and should be cited as well: Nguyen, T., Raghu, M. and Kornblith, S., 2020. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. ICLR 2021.
- The name MetaUNETR is confusing. There is no direct explanation for why the word “meta” is used as a prefix in the paper (it derives from the MetaFormer structure that uses Token-mixing, but it is not explained in the paper).
- While equation 1 and 2 are great, they take up too much space, have not been referenced in the paper correctly and could be moved to the supplemental. A figure that the difference between the two is more interpretable.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

This paper shows a very good result in terms of reducing parameters and flops for a transformer network, while still maintaining a good accuracy. However, the paper needs significant rewriting to clearly state the contribution and results, which is easier for a reader to follow.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Reject — should be rejected, independent of rebuttal (2)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is difficult to follow in terms of contribution, novelty and analysis. The results clearly show the efficacy of the method, however, have not been analyzed in detail, neither in the main nor the supplemental. Qualitative results and better figures distinguishing standard attention with token mixing can really boost this paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The rebuttal answered most of the questions posed, and although the initial rating was independent of rebuttal, this reviewer feels that would be injustice to the paper.

Review #2

Please describe the contribution of the paper

This paper analyzes the effect of mixers on performance in transformer structure in the multi-organ segmentation task. Six mixers such as MLP and Mamba were applied in the same meta structure, and the experimental results showed that there was no significant difference in performance. Based on these experimental results, it was concluded that the general structure of the meta former had a significant impact on performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper also proposed a method to efficiently reduce the amount of computation and improve the performance through TriCruci layer. The experimental results demonstrate the effect of improving the performance despite reducing a large amount of computation and parameters compared to the comparison method as intended in this paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This paper has concluded that the general structure of the meta former had a significant impact on performance. The conclusion is already obtained from the preceding paper, meta former, and it is difficult to regard it as a unique contribution of this paper.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

This paper is so dense that it needs to be improved in readability. Almost all sections consist of one paragraph.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This reviewer made a decision by analyzing the strengths and weaknesses of the paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

After reviewing the author’s answer and the opinions of other reviewers, I keep my rating as “Weak Accept”. I think it would be better if it was more concise and increased readability.

Review #3

Please describe the contribution of the paper

The paper proposes a new layer, named TriCruci, to better integrate 3D spatial information in the token mixer. The paper also conducts a systematic study of different token mixers to identify which has superior and more efficient representational learning capabilities in 3D medical image segmentations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is very well written and organized
2. Thorough experimentation and visualizations to fully support all the claims made in the paper
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No obvious weakness
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

I don’t have any comments. Great work!
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The comparative study of different token mixers is a useful result that would benefit the community.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

After reviewing the comments from authors and other reviewers, I still believe this is a good paper. Although the readability could be improved, I didn’t find it particularly hard to follow.

Author Feedback

We thank all reviewers for affirming the novelty of our method, the usefulness of results for the community, and their constructive feedback for paper improvement. We especially appreciate Reviewer #4’s strong endorsement of the contributions and quality of our manuscript. The primary concerns raised by Reviewers #1 and #3 are addressed in four parts below.

Question 1: Unknown code and dataset availability (Reviewer #1) Response: Our code will be released duly as mentioned in the abstract of our manuscript and the datasets used are publicly accessible as cited in our manuscript.

Question 2: Contributions need to be further clarified (Reviewer #3), e.g. MetaFormer’s impact (Reviewer #1) Response: We further clarify our contributions as follows: 1) We propose a novel MetaUNETR architecture for 3D multi-organ segmentation. It features lightweight TriCruci layers for parameter-efficient token mixing spanning Mamba, Self-Attention, Convolution, MLP, Recurrence, and Global Filter. (MetaUNETR: UNETR-style model with diverse token mixer encoding backbones.) 2) Based on the extensive comparison of these token mixers, we validate that the capability of these models is derived from MetaFormer architecture and less influenced by specific token mixers. (Original MetaFormer only considered Attention and MLP in 2D natural image domain.) 3) Comparative analyses using CKA uncover significant similarities and highlight the importance of features within the upper encoder layers while identifying redundant computations in the deeper layers. 4) Based on these findings, we conducted reasonable layer pruning on MetaUNETR, which achieved heightened computational efficiency and accuracy on three datasets compared to prior arts. (*Reduction on Parameters and FLOPs are explained at the end of Section 3.2)

(*Note: the latest study [1] (released on 26 March 2024) on Large Language Models (LLM) also demonstrates that shallow layers play a critical role in storing knowledge and a high degree of parameter redundancy in the deeper layers of the network. We believe our findings in the vision domain, provide a solid foundation for further research in PEFT by efficiently leveraging parameters in the deeper layers or exploring new structurally efficient architectures.

Question 3: TriCruci layer warrants a little more explanation. (Reviewer #3) Response: The TriCruci layer, elucidated in Section 2.3, is further explained as follows: The TriCruci layer introduces an artificial inductive bias akin to the locality enforced by large kernel convolutions and Swin Transformer layers. It establishes a mutually cruciform receptive field across three axes (depth, height, and width), each capturing long-range dependencies. For depth mixing, the input X of size H×W×D×C is reshaped to HWC×D, and diverse paradigms (e.g., linear projections with MLP) are applied to each transverse plane token to mix information. Analogous operations are performed along the height and width axes.

Question 4: The CKA figures need to be more descriptive (Reviewer #3) Response: The CKA results, elucidated in Section 3.2, are further explained as follows: Fig. 3 depicts the pairwise similarities between corresponding layers of diverse token mixer encoded backbones (e.g., Attention vs. MLP). The x and y axes denote the backbone 4 stages. Similarity scores range from 0.4 to 1, with lighter colors indicating higher similarity. Along the antidiagonal in plots, layer-wise and stage-wise consistencies across backbones are revealed for the top three stages, indicating stage 4 contributes minimally to model performance, rendering it redundant. Similar to the advance in [1], further understanding of network learning dynamics, such as why stage 4 encodings differ and how to effectively use the deep layer parameters, deserves dedicated exploration in our future works.

[1] Gromov, Andrey, et al. “The unreasonable ineffectiveness of the deeper layers.” arXiv preprint arXiv:2403.17887 (2024).

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This paper proposes a segmentation method with a TriCruci layer, which efficiently reduces the amount of computation and improves performance. Its effectiveness was evaluated in detail in the experiments. Some mathematical descriptions in the paper are incorrect:
- Do not use Python code-like writings, including slicing representations.
- Scalar, vector, and matrix symbols should be differentiated by regular, bold, and capital letters. While this paper has some problems, contributions in (1) the proposal of the reasonable model for organ segmentation and (2) a detailed analysis of its performance in the experiments could be evaluated. Therefore, I recommend accepting this paper.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
This paper proposes a segmentation method with a TriCruci layer, which efficiently reduces the amount of computation and improves performance. Its effectiveness was evaluated in detail in the experiments. Some mathematical descriptions in the paper are incorrect:
- Do not use Python code-like writings, including slicing representations.
- Scalar, vector, and matrix symbols should be differentiated by regular, bold, and capital letters. While this paper has some problems, contributions in (1) the proposal of the reasonable model for organ segmentation and (2) a detailed analysis of its performance in the experiments could be evaluated. Therefore, I recommend accepting this paper.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

back to top

MetaUNETR: Rethinking Token Mixer Encoding for Efficient Multi-Organ Segmentation

Author(s):