Abstract

Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0712_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0712_supp.pdf

Link to the Code Repository

https://github.com/ChongQingNoSubway/SelfReg-UNet

Link to the Dataset(s)

Medical Image Segmentation on Synapse multi-organ CT: https://paperswithcode.com/sota/medical-image-segmentation-on-synapse-multi Automated Cardiac Diagnosis Challenge (ACDC): https://www.creatis.insa-lyon.fr/Challenge/acdc/ Nuclear segmentation: https://paperswithcode.com/dataset/glas Gland segmentation: https://github.com/McGregorWwww/UCTransNet

BibTex

@InProceedings{Zhu_SelfRegUNet_MICCAI2024,
        author = { Zhu, Wenhui and Chen, Xiwen and Qiu, Peijie and Farazi, Mohammad and Sotiras, Aristeidis and Razi, Abolfazl and Wang, Yalin},
        title = { { SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    User This paper investigates patterns learned in UNet for medical image segmentation and identifies two factors affecting its performance: (i) irrelative feature learning due to asymmetric supervision, and (ii) feature redundancy in the feature map. To address these issues, the paper proposes balancing supervision between encoder and decoder and reducing redundant information. This is achieved by leveraging feature distillation from the decoder’s last layer to provide additional supervision to other blocks. While the main advantage lies in the plug-and-play capability of the module for standard UNet models, its effectiveness on enhanced architectures remains unclear.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper analytically shows that: (i) Redundant features exist in the feature channel, with the shallow channels exhibiting more diversity than deep channels in a feature map; (ii) Asymmetric supervision between the encoder and the decoder in a UNet leads to semantic loss. The paper proposes the semantic consistency regularization (SCR) to balance the supervision between the encoder and the decoder.
    2. Interpretability of the UNet model is enhanced by analyzing the gradient-weighted class activation mapping (Grad-CAM) and performing similarity analysis in a feature map.
    3. Well-written and motivated.
    4. Ablation study.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experiments are limited to the U-Net and Swin-UNet models and have not been tested on more complex models.
    2. The improvement in the model may stem from an inefficient design of the base model, and the approach does not appear to be sufficiently general.
    3. evaluation on the SOTA methods can provide more insight to the limitation and advantage of the method
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Some insight into the performance drop for the spleen organ using both the Unet and Swin-Unet strategies would be useful.
    2. How do you confirm the generalization of this strategy to other network architectures, specifically recent Mamba-based models?
    3. Is feature recalibration also clinically effective and does it provide more insight into the decision-making process?
    4. I would like to see the method effect on more complex structure like HiFormer or a more recent paper from the same authors [1].

    [1] Azad, et. al, “Beyond Self-Attention: Deformable Large Kernel Attention for Medical Image Segmentation”, WACV, 2024.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation behind the regularization strategy is commendable, yet it requires deeper exploration into its effectiveness. Additionally, it’s worth noting that this approach may encounter limitations when applied to more complex architectures, proving effective primarily in simpler methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors introduced an approach to address issues of asymmetric supervision and feature redundancy in UNet-based medical image segmentation. This method focuses on optimizing the loss functions by incorporating semantic consistency regularization and internal feature distillation. Authors validate the method on four medical image segmenation tasks in 2D setting, outperforming some baselines with their improved plug-and-play design.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Simplicity: Instead of introducing complex structures with more parameters, the authors introduced a novel approach by optimizing loss functions specifically for UNet architecture with semantic consistency regularization and internal feature distillation techniques.
    • Extendability: The method can be applied to already existing U-Net segmentation architecture (at least in 2D settings) with minor changes.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Limited Discussion on the 3D Scenario: The authors have developed and evaluated their method exclusively within 2D settings for medical image segmentation. Considering the significant relevance of 3D imaging in clinical practice—for instance, in volumetric analysis, which is crucial for accurate diagnosis and treatment planning—this limitation might restrict the applicability of the proposed methods. Readers may find the absence of discussion around extending this approach to 3D scenarios a notable omission.
    • Overparameterization and Feature Redundancy: Although the paper aims to address feature redundancy, the proposed models, including CNNs and ViTs integrated within UNet structures, are inherently complex and potentially overparameterized. This could lead to inefficiencies in computation and model training, particularly if not managed well within the scope of the practical deployment scenarios​.
    • Regarding the State of the Art (SOTA): While the authors describe the methods used for comparison as state-of-the-art, this characterization may not provide a complete picture. For instance, on the Synapse dataset, there are other methods that have achieved significantly better performance than the ones discussed in this paper ( see https://paperswithcode.com/sota/medical-image-segmentation-on-synapse-multi). It would be more accurate to refer to the methods used for comparison as baseline methods to prevent any potential misunderstanding or misrepresentation of their comparative performance.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The paper focuses exclusively on 2D medical image segmentation. Considering the clinical importance of 3D imaging, it would be valuable to discuss the feasibility and potential adaptations of the proposed methods for 3D scenarios.

    • Re-evaluate the terminology used to describe the compared methods. If these are not the top-performing models on benchmark datasets, consider labeling them as baseline or representative models instead of SOTA. This would maintain the paper’s credibility and provide readers with a clear understanding of where the proposed method stands in relation to the current best practices.

    • Some minor changes may help: Make sure using consistent descriptions, e.g., “l-th layer of the m-th” vs. “the ith layer of the mth block”. Do more proofreading, e.g., in page 8 “our Unet and swinUnet”, “UCTTransNet”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents novel methods and demonstrates potential improvements in medical image segmentation, the weaknesses, as currently outlined, slightly outweigh the strengths.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    We still believe that comparing with 3D-based methods is important and needs to be included when considering these relatively simple datasets. I will not change my rating.



Review #3

  • Please describe the contribution of the paper

    The paper proposes additional loss functions for training U-Net like architectures that can help reduce feature redundancy and semantic information loss. The paper shows that adding these losses can significantly improve results on various datasets

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Two loss functions defined that quantify the amount of redundancy in the features and semantic information loss.

    2) Extensive experimentation using four public datasets

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) No major weaknesses. Advisable to add ablation experiments with lamba1 and lamda2 as 0 one at a time. That will show the effectiveness of each loss separately.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Authors should specifically mention that the code will be released in the final version.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) Please add ablations with the losses being used separately. More specifically, there should be tables with lamda1=0 and lamda2 nonzero, and vice versa.

    2) Kindly comment more on the claim that the encoders are looking at uninteresting regions while decoders are looking at regions of interest. Why should it be so? From Figure 1b, in column 8, the decoder has a high attention to the background and in column 4, the encoder has more attention to the regions of interest. Please clarify this.

    3) Does similar pattern of the encoder and decoder hold for different backbones of the UNet? It would help to add some experiments in ablations which show the effect of backbones of the UNet on the results.

    4) From the shown heatmap for similarity in Fig 1c, it seems like some features are more redundant than others. In such scenarios, would it be better to choose features through a probability distribution based on this instead of uniform sampling as suggested in Eq 1?

    5) Please correct minor typos in the paper. For example, in Eq 1, I believe it should be RSC and not RCS

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper overall is well written and presents a thoughtful analysis of UNet and its learned patterns. There are a few open questions and required ablations for completion, but the merits outweigh them. However, I request the authors to address the important comments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Some of my questions were answered. While the method is applicable to all UNet like architectures, analysis about these is not sufficient the paper, as also pointed out by other reviewers. There is no comment on the sampling policy as well. Hence, I won’t be raising my rating and retain my original rating.




Author Feedback

Reviewer #3: Reviewer 3 raises questions about the generalizability. While suggesting evaluation on Mamba models, they also question the performance drop for the spleen organ in our tests.

  • Our method is designed for most mainstream CNN/ViT-based Unet models, such as SwinUNet, which have an encoder-decoder architecture with skip-connections. We validate the proposed method using standard CNN-UNet and SwinUNet due to their popularity and success. This simple yet effective method enhances the performance of UNets with different configurations. Our method is compatible with all UNet family excluding unusual and rare architectures like HiFormer, which comprises an encoder and a simple segmentation head with no decoder and skip connections. We will discuss and further explore this matter in the final manuscript.

  • Comparison with Mamba: We recognize Mamba-based UNet as an important emerging alternative architecture to CNN/ViT-based Unet. The proposed method is a generic solution that can be easily extended to any U-shaped network including mamba-based UNet (we will include this in our final version). For the demonstration of the methodology, we have compared with the recently established and peer-reviewed baselines [8,17,23] . In addition, as of our submission deadline for MICCAI, none of mamba-based vision models have been accepted by prestigious journals or conferences.

  • Performance drop for the spleen organ: we conjecture that the primary reason is due to the inadequate localization of the spleen organ (see supplementary). This may also be attributed to the heterogeneous appearances of organs.

  • Feature recalibration: We believe the feature recalibration, which further enhances the segmentation performance of existing models across multiple datasets can potentially benefit the monitoring of disease progression, streamlined image-driven analysis.

Reviewer #4: The main questions include the extension of the method from 2D domain to 3D, the overparameterization of DL methods (like CNN/ViT), and inappropriate usage of the term “SOTA”.

  • Limited Discussion on the 3D Scenario: Our work aligns with Hiformer and Uctransnet, which purely focuses on the study of 2D-based medical image segmentation. We do not envision any obstacles in extending the proposed method to 3D models if they satisfy similar conditions (see reviewer#3 generalizability). However, overemphasizing 3D seems to overshadow the significant contributions made in the 2D context.

  • Overparameterization and Feature Redundancy: Our proposed method applies these baseline methods due to their success and popularity in medical image analysis. Using deep learning methods with this level of complexity is a common practice accepted by the community and and our method only introduces negligible extra compute cost; therefore, “ inefficiencies in” is not the issue caused by our method. (please refer to proposed two losses details)

  • “SOTA”: we will use the most accurate term of “representative methods” in our final version as suggested.

Reviewer #5 Positive feedback on our main contributions followed by requests for further clarifications of on ablation analysis and Fig. 1.

  • Please refer to Fig 5 (c and d) in the manuscript, where we have performed an ablation study on the presentence/absence of these two losses.

  • The main issue is that semantic information from skip connections is harmful. Features from the encoder(e.g. E2,E4,B), when passed through skip connections to the decoder, cause semantic damage (as detailed in our supplementary material and observed in UCTransNet).

  • As mentioned in our response to Reviewer #3, the proposed method is applicable to all standard UNet architectures.

  • The similarity matrix in Fig 1 (c) shows the similarity between features in deep and shallow channels, not all features. Thus, it can only conclude that shallow features are more diverse than deeper features.

  • We will correct typo errors in final versions.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed ratings with two reviewers rejecting the work. R3 did not participate in the submitting final score. R3’s concerns were on generalizbility and how the method works on Mamba models and some other recent works. I feel this is not a valid point as Mamba is very new and the paper really focuses on CNN/transformer architecture. The method in the paper is introduced as a simple addition to existing architectures but the reviewer simply asks to apply on super complicated architectures like HiFormer. I do not think there any reasonable arguments from R3 to reject the paper.

    R4 rejects the paper stating that the paper does not do any experiments on 3D images. While there are a lot of 3D medical images, one must understand that more than 50% data in medical imaging is still 2D (with X-rays being the most widely used modality which is 2D). So, it is unfair to simply reject the paper for that.

    Overall, I think this paper offers a neat addition to existing methods to improve performance. The intuition and the methodology seem to make a lot of sense to me. I think it is a worthy paper to discuss at MICCAI as it has contributions to present to the broader research community. So, I recommend accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mixed ratings with two reviewers rejecting the work. R3 did not participate in the submitting final score. R3’s concerns were on generalizbility and how the method works on Mamba models and some other recent works. I feel this is not a valid point as Mamba is very new and the paper really focuses on CNN/transformer architecture. The method in the paper is introduced as a simple addition to existing architectures but the reviewer simply asks to apply on super complicated architectures like HiFormer. I do not think there any reasonable arguments from R3 to reject the paper.

    R4 rejects the paper stating that the paper does not do any experiments on 3D images. While there are a lot of 3D medical images, one must understand that more than 50% data in medical imaging is still 2D (with X-rays being the most widely used modality which is 2D). So, it is unfair to simply reject the paper for that.

    Overall, I think this paper offers a neat addition to existing methods to improve performance. The intuition and the methodology seem to make a lot of sense to me. I think it is a worthy paper to discuss at MICCAI as it has contributions to present to the broader research community. So, I recommend accept.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The first reviewer did not read the author’s argument in its entirety and gave a final comment, and it is clear that the author’s reply can solve these problems. Another reviewer gave a positive score at the beginning and at the end of the argument. The last reviewer felt that the authors did not compare it with the 3D model and therefore gave insufficient reasons for rejection. This is obviously a bit too much to do with the 2D work that this article focuses on.

    In summary, it is recommended to accept this work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The first reviewer did not read the author’s argument in its entirety and gave a final comment, and it is clear that the author’s reply can solve these problems. Another reviewer gave a positive score at the beginning and at the end of the argument. The last reviewer felt that the authors did not compare it with the 3D model and therefore gave insufficient reasons for rejection. This is obviously a bit too much to do with the 2D work that this article focuses on.

    In summary, it is recommended to accept this work.



back to top