Abstract

Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the inter-image and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that CutMix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively ‘wrong’ images. We present our implementation as a DA toolkit for multi-organ segmentation on GitHub for future benchmarks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0674_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0674_supp.pdf

Link to the Code Repository

https://github.com/Rebooorn/mosDAtoolkit

Link to the Dataset(s)

https://zenodo.org/records/7262581

BibTex

@InProceedings{Liu_Cut_MICCAI2024,
        author = { Liu, Chang and Fan, Fuxin and Schwarz, Annette and Maier, Andreas},
        title = { { Cut to the Mix: Simple Data Augmentation Outperforms Elaborate Ones in Limited Organ Segmentation Datasets } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors compare four recent mix-based data augmentation methods, CutMix, CarveMix, ObjectAug, and Anatomix, for multi-organ segmentation. They use nnUNet with and without traditional data augmentation added on. They say that they will release their implementation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Good literature review and clear writing. Nice illustrations and images.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors do not have an intellectual or novel contribution. The paper is a comparison and re-implementation of four recent mix-based augmentation methods. The authors reassess these methods and confirm that they essentially work well.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Take inspiration from these data augmentation methods and propose a new method. You have all the benchmarked results and the dataset. Take time to develop something novel and something that works.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Reject — must be rejected due to major flaws (1)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    No novel contribution. Reimplementation of previous works for assessment.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I do not believe a paper which which has the main message that someone else’s method outperforms other peoples’ methods (despite its simplicity) is a strong reason for acceptance in a conference such as MICCAI. The amount of similar papers that could be written in the same ideology are countless. ie comparing transformers and CNNs over a given task like small lesion segmentation, or GANs and diffusions models for large lesion generation etc.

    Furthermore, given the lack of detailed hyperparameters and high sensitivity that such experiments have to them and the datasets, experiments can be designed such that ‘elaborate ones’ outperform simple ones.

    The main outcome of the experiments is that CutMix, a simpler augmentation method outperforms elaborate ones. However the experimental results are very close, in some cases CutMix is only 0.1% DSC ahead of other methods. And with no statistical significance test such as a paired student t-test, the point that simple augmentations are better than elaborate ones is not proven.



Review #2

  • Please describe the contribution of the paper

    This paper investigates multiple strategies for data augmentation in the context of medical image segmentation. More precisely, several methods (CutMix, CarveMix, ObjectAug, Anatomix) for generating new images as a mix of two training samples are compared. The main contribution of the paper is the counterintuitive finding that CutMix, the most simple approach that does not try to preserve anatomical plausibility, is also the most effective one.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The topic of data augmentation for image segmentation is very relevant due to the typically low number of scans in medical datasets.
    • The finding of the paper is interesting and could spark some discussions, especially because it questions the importance of anatomical correctness of the training data. This means that more original data augmentation methods could be considered and have a positive impact.
    • The message is simple and the paper is generally clear, the figures are useful.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The evaluation focuses on CT scans which are a very standardized modality where multi-organ segmentation is not really a challenge (see TotalSegmentator for instance). Using another more challenging dataset (MR, Ultrasound, etc.) would have been more convincing on the one hand and on the other hand interesting to assess how general the findings are.
    • More details or experiments on the choice of the hyperparameter for the different augmentation methods would have been useful to make sure the comparison is fair.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • Source code is claimed to be released upon acceptance but not available yet.
    • One of the dataset is public, the other one private.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Questions:

    • In CutMix, how do you select the right part of the image? Do you assume that the images roughly have the same field of view? (imagine a full-body scan vs a thoracic scan)
    • How many organs were impacted by each data augmentation strategy?
    • In general, how were the hyperparameters selected?
    • The performance of ObjectAug in Table 2 is surprisingly bad. Can you comment on that?

    Comments:

    • It would be nice to run some statistical tests on the numbers reported in the table (for instance Wilcoxon signed-rank), or at least some confidence intervals
    • Is there a difference in computation time between the different strategies? This could be a further argument for the simple methods
    • The description of CarveMix in Section 2 is not very clear. Also it might make sense to add the citation there as well.

    Minor remarks:

    • In the abstract/introduction, you could quickly mention the dataset you are using for your experiments.
    • ClassMix and ComplexMix are mentioned in the introduction but they are not described (only much later)
    • Figure 1: can you explain the meaning of the colors? also the shades of pink are so similar that they are not so readable
    • “CutMix is originally proposed” -> “was”
    • “two images are fused to a new classification label” is not very clear
    • Table 1: maybe for clarity/consistency, it would make sense to invert the sentences “cause broken organs”/”have artificial voxels”
    • “4 Nvidia A100 GPU” -> what is the point of this info, if you don’t report computation times/memory requirements?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper does not introduce a novel method, it is very interesting because its conclusion is surprising: a simple approach outperforms more complex ones. In particular, it challenges the idea that anatomical plausibility is very important in the field of medical imaging. That makes me want to understand why this happens and conduct further experiments, so I think it could be of interest for the community. Finally, I believe such papers are welcome because the community should find a balance between technical innovation and simplicity.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    After considering the other reviews, I haven’t seen a reason to downgrade my recommendation so I leave it as Accept.



Review #3

  • Please describe the contribution of the paper

    The paper explores the effectiveness of data augmentation (DA) techniques for improving multi-organ segmentation in medical imaging, especially when training data is limited. It introduces and tests four inter-image and object-level DA strategies—CutMix, CarveMix, ObjectAug, and AnatoMix— and validated them on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1- Assessment of different data augmentation (DA) methods, including CutMix, CarveMix, ObjectAug, and AnatoMix, for multi-organ segmentation.

    2- Effective Use of Data: It utilizes limited data sets for training by implementing these data augmentation strategies.

    3- Evaluation and Comparison: The paper provides a comparison of how different DA strategies and traditional DA affect the segmentation performance on various datasets (AMOS and DECT). This includes specific improvements in both micro and macro averaged dice similarity coefficients (dsc). The proposed combination of traditional and the aforementioned DAs could bring an important clinical application for the improvement of multi-organ segmentation in cases with limited annotated datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1- The paper should discuss whether the utilized strategies could lead to overfitting, especially when augmentation parameters are not optimally chosen or when the same augmented examples are repeatedly used during training.

    2- The paper does not discuss the trade-offs between computational cost and performance gain, especially given the level of improvements shown.

    3- While DSC is a standard metric, relying solely on it may not fully capture the practical utility of the segmented outputs, such as their use in clinical diagnostics where precision in critical regions is more important.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1- Please make sure to consider future combinations of DA methods and their impact on performance.

    2- Please consider adding other evaluation metrics besides Dice.

    3- Please discuss how you address overfitting and the potential impacts of the experiments on it.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper explores various data augmentation methods including CutMix, CarveMix, ObjectAug, and AnatoMix for multi-organ segmentation. It demonstrates how these techniques can optimize limited datasets for training, which is crucial in scenarios with scarce annotated data. By comparing these innovative strategies with traditional methods, and highlighting their impact on segmentation performance through detailed evaluations on datasets of AMOS and DECT, the paper substantiates improvements in both detailed and overall accuracy Dice metric.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Based on the authors’ detailed rebuttal, I find their work to be a meaningful contribution to the field of medical image segmentation. The evaluation of data augmentation (DA) strategies, particularly the effectiveness of the simple CutMix method over more complex ones, is valuable. Their clarification on computational costs and hyperparameter choices adds necessary transparency, and their approach to addressing overfitting seems practical.




Author Feedback

First of all, we thank all the reviewers (R1,R4 and R5) for the valuable and useful comments and appreciate the effort of area chair (AC) and meta-reviewer (MR) for the chance for this rebuttal. We are encouraged by the positive feedbacks from the reviewers regarding our overall organization and the figures, as well as our critical reflection on the simplicity and novelty of the data augmentation (DA) strategies for medical image segmentation. We hope that such reflection will be beneficial to the community in researching DA for segmentation.

As implied by R1 and R4, our contribution lies in evaluating and discussing the design of effective DA strategies, instead of proposing another DA strategy. The results in our paper show that the counter-intuitive but simple DA strategy, i.e. CutMix, outperforms other more complex strategies specially designed for medical image segmentation, i.e. ObjectAug, CarveMix and AnatoMix. As noted by R1, our research may inspire the discussion on future ideas for designing DA strategies, as dedicated strategies aiming for anatomical correctness may fail to improve segmentation performance. We hope to raise the attention of the community to the fact, that the effectiveness of DA strategies does not necessarily depend on the methodological complexity. In this case we share the opinion of R4, that such finding can be beneficial to the search for DA with actual benefit for medical segmentation systems.

In addition, we would like to take this chance to clarify some common concerns from the reviews.

First of all, R1 and R4 comment on the computation cost of each DA strategy in our experiments. We want to add that CutMix takes on average 0.3s for one output image, while it takes 15.7s for CarveMix, 20.9s for AnatoMix and 40.4s for ObjectAug on the same device. Because CutMix only combines two images by region-of-interest (ROI), it is much faster than other methods using slow operations, like background in-painting or object rotation. Regarding computation cost as well as segmentation performance, CutMix turns out to be the best performing DA strategy.

Secondly, R1 raised concern on the choice of hyperparameter of each DA. In our experiments, the hyperparameters of AnatoMix and ObjectAug are optimized on a subset of the training data, but not CutMix and CarveMix. CarveMix contains no hyperparameters to optimize and for CutMix, the default settings from the original paper already led to a large margin over other DA strategies and thus it is not further optimized.

Thirdly, R4 comments on overfitting due to limited original datasets. We want to argue that this is just the research question being answered in our paper: What simple but useful DA strategies could counter overfitting, addressing data scarcity for typical medical image segmentation. In addition to the applied DA strategies, we rely on the regularization methods as implemented in the nnUNet framework, for example the ensembling prediction from cross-validation models and saving best model during validation as output.

At last, R5 raises concern on the lack of novelty. We want to again emphasize that the novelty in this paper lies in the finding that preserving anatomical correctness is not a necessary prerequisit for DA for medical image segmentation. As was also noted by R1, this was not known before and could possibly widen the range of available DA methods.

Overall we thank for all the constructive reviews, such as extension to extra segmentation tasks and including more evaluation metrics, that helps improve our research. We believe in the same way as R1, that the community should find a balance between technical innovation and simplicity.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a paper with divergent review scores. The contribution of this paper is that it provides a benchmark study for four different mix-based augmentation methods. The main critical comment is that there is no clear methodological innovation over existing data augmentation techniques. After considering comments from both sides, the rebuttal, as well as ranking against other rebuttal papers, this paper might not be of high enough priority in my batch of papers.

    Another minor comment is that I am not quite sure some mix-based augmentation methods are compared in a fair manner. For example, CarveMix was specifically designed for augmenting lesion images, as the lesions can be carved from patient images and mixed with normal images in an anatomically plausible way. In this paper, CarveMix was applied to abdominal MR images to cut an entire organ and mixed with the image, leading to anatomically implausible images with overlapping organs. Perhaps these methods could be better evaluated in a way they were designed for.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This is a paper with divergent review scores. The contribution of this paper is that it provides a benchmark study for four different mix-based augmentation methods. The main critical comment is that there is no clear methodological innovation over existing data augmentation techniques. After considering comments from both sides, the rebuttal, as well as ranking against other rebuttal papers, this paper might not be of high enough priority in my batch of papers.

    Another minor comment is that I am not quite sure some mix-based augmentation methods are compared in a fair manner. For example, CarveMix was specifically designed for augmenting lesion images, as the lesions can be carved from patient images and mixed with normal images in an anatomically plausible way. In this paper, CarveMix was applied to abdominal MR images to cut an entire organ and mixed with the image, leading to anatomically implausible images with overlapping organs. Perhaps these methods could be better evaluated in a way they were designed for.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received mixed scores from reviewers. The novelty of the paper is in the findings rather than the methodology. Some reviewers rejected the paper solely because of the lack of novelty in the method, while MICCAI emphasized that the “lack of a novel method” should not be the only reason to reject a paper. After carefully reviewing the papers and the rebuttal, I agree with Reviewer 1 that accepting this paper is important as the findings of paper are interesting. The author(s) compared four different data augmentation (DA) and showed that preserving anatomical correctness is not a necessary prerequisite for DA for medical image segmentation. More importantly, it was shown that a simple method outperformed the complex techniques. In recent years, the concept of novelty has been misinterpreted as method complexity, which creates a bias toward papers introducing a complex method with non-significant performance improvements over simple methods. I believe that such papers strike a balance between technical innovation and simplicity.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received mixed scores from reviewers. The novelty of the paper is in the findings rather than the methodology. Some reviewers rejected the paper solely because of the lack of novelty in the method, while MICCAI emphasized that the “lack of a novel method” should not be the only reason to reject a paper. After carefully reviewing the papers and the rebuttal, I agree with Reviewer 1 that accepting this paper is important as the findings of paper are interesting. The author(s) compared four different data augmentation (DA) and showed that preserving anatomical correctness is not a necessary prerequisite for DA for medical image segmentation. More importantly, it was shown that a simple method outperformed the complex techniques. In recent years, the concept of novelty has been misinterpreted as method complexity, which creates a bias toward papers introducing a complex method with non-significant performance improvements over simple methods. I believe that such papers strike a balance between technical innovation and simplicity.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I side with meta-reviewer #5 innovation is not just novel methodology but can also be appropriately designed comparison studies to enable novel and important conclusions as to the appropriateness and performance of different methods. I also find this manuscript has some interesting conclusions that simple, even anatomically incorrect data augmentation, can help with training specific tasks.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I side with meta-reviewer #5 innovation is not just novel methodology but can also be appropriately designed comparison studies to enable novel and important conclusions as to the appropriateness and performance of different methods. I also find this manuscript has some interesting conclusions that simple, even anatomically incorrect data augmentation, can help with training specific tasks.



back to top