Abstract

In this work, we introduce Progressive Growing of Patch Size, a resource-efficient implicit curriculum learning approach for dense prediction tasks. Our curriculum approach is defined by growing the patch size during model training, which gradually increases the task’s difficulty. We integrated our curriculum into the nnU-Net framework and evaluated the methodology on all 10 tasks of the Medical Segmentation Decathlon. With our approach, we are able to substantially reduce runtime, computational costs, and CO$_{2}$ emissions of network training compared to classical constant patch size training. In our experiments, the curriculum approach resulted in improved convergence. We are able to outperform standard nnU-Net training, which is trained with constant patch size, in terms of Dice Score on 7 out of 10 MSD tasks while only spending roughly 50\% of the original training runtime. To the best of our knowledge, our Progressive Growing of Patch Size is the first successful employment of a sample-length curriculum in the form of patch size in the field of computer vision. Our code is publicly available at \url{https://github.com}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2008_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2008_supp.pdf

Link to the Code Repository

https://github.com/compai-lab/2024-miccai-fischer

Link to the Dataset(s)

http://medicaldecathlon.com/

BibTex

@InProceedings{Fis_Progressive_MICCAI2024,
        author = { Fischer, Stefan M. and Felsner, Lina and Osuala, Richard and Kiechle, Johannes and Lang, Daniel M. and Peeken, Jan C. and Schnabel, Julia A.},
        title = { { Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work aims to reduce computational costs and CO2 emissions of the nnUNet framework by progressively increasing the training patch size based on the idea of curriculum learning. The authors show that with this approach, energy consumption can be reduced by approximately 50% while accuracy stays the same.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • There is a real need for computation efficient novel methods. Adapting currently well proven methods such as the nnUNet is a good first step to simplify the transition that is necessary.
    • The authors have ran their adapted nnUNet across all 10 medical segmentation decathlon tasks and showed that accuracy is identical to the basic nnUNet with small improvement for the Colon and Lung task.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Claiming that the adapted nnUNet is better in 7/10 datasets is stretched. Apart from the Colon dataset where the difference is 4% and the Lung with 1.1% on all other datasets the difference is between 0 and 0.5%.
    • The selection of the different training stages and scales seems arbitrary and needs to be better motivated. The increase is not linear as stated, but alternates between different axis.
    • The paper is missing relevant baselines that reduce computational resource E.g. the authors mention this work: “A NEW THREE-STAGE CURRICULUM LEARNING APPROACH TO DEEP NETWORK BASED LIVER TUMOR SEGMENTATION” yet they do not compare against it as a baseline.
    • The novelty is very limited. Increasing difficulty by looking at different parts and scales of the image has been done before. The main novelty of this paper lies in its implementation into the nnUNet framework.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • Since the code is available and the method is well described, I do not see any issue with reproducability.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The Discussion and Conclusion section should be split for clarity. Its also rather uncommon to end a paper on a citation
    • Every section should start with an introductory phrase and not immediately start with a subheading
    • Citations should be sorted by their identificatory number
    • Tables should show arrow indicating which direction is beneficial of metric
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper does approach an important problem we are facing. The reduction of the carbon footprint of machine learning models, the novelty is mostly limited to the implementation of said algorithm into the nnUNet framework. In addition the evaluation is veryl imited and does not compare against other methods reducing the carbon footprint.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    My initial criticism remains. The manuscript has no major flaws, but the novelty value is very limited, so I maintain my rating unchanged.



Review #2

  • Please describe the contribution of the paper

    The authors propose a curriculum learning method for dense-prediction tasks that progressively increases the patch size during training. They implemented their curriculum within the nnU-Net framework and compared it to: (1) a standard constant patch size nnU-Net training; (2) and a random patch-size curriculum learning approach on the 10 tasks of the MSD challenge. Their approach reduces the training time and carbon footprint (CO2), and even outperforms the baselines (1)-(2) on 7 out of 10 MSD tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) Very relevant motivation: Reducing CO2 emissions and efficient training are a very relevant topic in the training of deep learning models and applying curriculum learning to solve this problem is an underresearched domain.

    (2) General approach: The author’s proposed curriculum learning is both task-agnostic as well as model-architecture agnostic, e.g., it can be implemented as well for detection, classification, and other architectures besides nnU-Net. The authors only test it for semantic segmentation with nnU-Net but I can see how this method could be easily implemented for other tasks and models.

    (3) Good paper structure: The paper presents its ideas in a clear and easy to understand way. The experiment setup is well-written and Fig. 2 and Fig. 3 depict the main contributions of their approach on the MSD challenge.

    (4) Great potential for resource-efficient learning: The authors demonstrate that training with their method, nnU-Net requires only 35.5% of the training amount and reduces the training time from 13.5h to 5.7h. This convinces me that their approach has great potential to reduce the carbon footprint while preserving the performance when training data-hungry models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Contradictory assumption: The authors assume that training on smaller patch sizes is a simpler task than training on the full images as the prevalence of foreground pixels decreases in larger patches. However, they also state that “a larger patch size results in an increased global context that can be inferred from that patch”. Ding and Li [1] also discuss that larger context should help the model make consistent predictions, albeit they work with dermoscopy images, and they do the exact opposite of the authors! They order patch sizes in their curriculum from large-to-small and argue that larger patches are easier since they include more context. I think that this is a missing key comparison in the author’s work. This comparison would confirm their fundamental assumption that small-to-large patch size ordering is the driving factor behind their results. I understand that I cannot request additional experiments for the rebuttal but I see the lack of this comparison as something which is fundamental and cannot be overlooked.

    (2) Missing method details: The authors do not explain how exactly they change the patch size during training. There is neither an equation nor is it explained anywhere in the text how the patch size is increased from one training stage to the next one. In Supplemental Table 1, you can see that the patch size is changed by alternately increasing each input dimension, but this needs to be described in the main manuscript.

    [1] Ding, Weizhen, and Zhen Li. “Curriculum Consistency Learning and Multi-Scale Contrastive Constraint in Semi-Supervised Medical Image Segmentation.” Bioengineering 11.1 (2023): 10.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper seems to have a good reproducibility as the authors have also released an anonymized repository with the code. They have also used a publicly available dataset for their training and evaluation (MSD) and a well-established open-source model (nnU-Net).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Following my feedback regarding the weaknesses of the paper, here are more detailed comments:

    (1) While the assumption that small-to-large patch sizes makes sense for a curriculum, I can also see how the large-to-small ordering makes sense as the model sees more context. I believe this paper has great potential and relevance but to confirm that the main contribution is the driving factor, this is a key experiment that is missing in the manuscript. We can also see improvements in the random patch size curriculum (RPSS) which means that the driving factor may also be that the model is exposed to more diverse inputs. I understand that I cannot request additional experiments from the authors during the rebuttal but I would be willing to let them explain why they missed this comparison.

    (2) The missing methodology details, e.g., how the patch size is increased in each stage, are vital in understanding and reproducing the method. Such details also give insight into how the authors justify their implementation and are must-have in the methodology section. I think I can deduce from the supplementary material that each patch-dimension is increased alternately but this is something that belongs in the main manuscript in the form of an equation or text description.

    The rest of my comments are just minor suggestions to improve the manuscript and have not influenced my decision.

    Minor comments:

    • Typo on page 1: deep learning based -> deep learning-based
    • Typo on page 2: not used yet -> not been used yet
    • Typo on page 2: curriculum on image size -> curriculum based on image size
    • Missing citation on page 2: into the nnU-Net [?]
    • Fig. 1: It is not clear what the black circle in the “Training” stage stands for. It might be interpreted as a concatenation of different patch sizes, which is not the case in this paper. I would advise to update the figure and make it more clear.
    • Typo on page 3: than full images or volumes -> than on full images or volumes
    • Typo on page 4: refers to 100% scenario -> refers to the 100% scenario
    • Typo on page 4: 50% of iterations per epoch -> 50% of the iterations per epoch
    • Typo in Caption of Fig. 2: while 100% represents -> where 100% represents
    • Typo on page 6: with increased batch size -> with an increased batch size
    • Typo on page 7: adjustment of batch size -> adjustment of the batch size
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper explores the relevant topic of resource-efficient learning and is generally well-written, the fundamental assumption that the small-to-large patch size ordering is the reason for the performance boost is unconfirmed. Since there is a published method [1] using the exact opposite curriculum (large-to-small), I see the missing comparison as a flaw that cannot be overlooked. Additionally, there are missing details in the methodology section (e.g. how the patch size changes), which should be included in the main manuscript. Hence, I opt for a weak reject with an option for the authors to explain in the rebuttal why they have not included this key comparison and their methodology details.

    [1] Ding, Weizhen, and Zhen Li. “Curriculum Consistency Learning and Multi-Scale Contrastive Constraint in Semi-Supervised Medical Image Segmentation.” Bioengineering 11.1 (2023): 10.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading the author’s rebuttal, I have decided to increase my rating as they have addressed both of my major concerns:

    (1) Contradictory Assumption: Initially, I was concerned that the authors did not compare their work to a related published study (Ding and Li et al. [1]), which follows an opposite approach. Ding and Li [1] order patch sizes from large-to-small, starting with a global context and moving towards more localized features, whereas the authors use a small-to-large order. However, the authors clarified in their rebuttal that Ding and Li et al. [1] focus on performance rather than optimization complexity. Since this work aims to reduce carbon emissions and improve training efficiency, I accept their justification for not comparing to this previous work and will increase my rating accordingly.

    (2) Missing Methodology Details: The authors have also explained why they initially omitted certain methodology details and have promised to include these in the updated manuscript.

    Given these responses, I have decided to increase my score. The authors have addressed my concerns, and their manuscript will be a valuable contribution to the MICCAI community, particularly because it tackles the important issue of reducing CO2 emissions from model training.



Review #3

  • Please describe the contribution of the paper

    The authors use the progressive growth of the patch size as a learning curriculum scheme for nnU-Net training and could outperform the vanilla nnU-Net on 7 of 10 tasks of the medical image segmentation decathlon, while only spending about 50% of the original nnU-Net training runtime. The rationale of the presented curriculum is that small patch sizes, with less context, are easier to segment, thus training is getting progressively harder by increasing the patch size.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The evaluation results of the approach speak for themselves. Using the basic idea of progressively increasing the patch size while training, the authors could halve training time. The paper’s idea is presented very clearly, the results are supported by a solid evaluation, and the code is released. I would also like to highlight the effort of the authors in providing CO2 equivalents for their network training. With the ever-increasing energy demands of current AI approaches, I believe that we, as a scientific community, have the duty to reflect and to act. Providing CO2 equivalents or energy requirements is a relevant step to build awareness and is also the basis for reduction. I would love to see many more groups in our community start adding such relevant measurements in their works.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some minor weaknesses of the paper are:

    • The authors only assume that the segmentation of smaller patches is easier. They could back up their statement with the work of Tappeiner et al. (2022) “Tackling the class imbalance problem of deep learning-based head and neck organ segmentation,” who show that smaller patch sizes reduce the within-patch class imbalance and therefore, depending on the segmentation task, can increase segmentation performance. The patch size curriculum is likely to combine both less class imbalanced training while still providing a large context window.
    • Without reading the code, it is not entirely clear how the smallest possible patch size is found. The smallest possible patch size depends on the U-Net architecture. Regarding the code the authors use the architecture given by nnU-Net, but this is not made entierly clear in the paper.
    • The results of the fully trained nnU-Net with the Constant Patch Size Training Scheme are taken from literature results which are likely generated with the nnU-Net v1 (referenced paper is from 2019); however, the codebase used by the authors is based on the nnU-Net v2, so the fully trained models are not directly comparable. Regarding training emission reduction, it is totally fine to take literature results, but it should be mentioned that the results are from a different code base.
    • It would be interesting to get an interpretation from the authors as to why for the 50% training, the Random Patch Size Sampling outperforms the other curricula.
    • Lastly, depending on the sample size a t-test is not necessarily the correct test to conduct. In a multi-class segmentation task, the in-class segmentation results (e.g., the dice score of each segmented organ) are in general only normally distributed if the number of samples is large enough. T-tests are potentially relevant if sample averages (all classes combined) are considered, which, however, ignore individual class results. Non-parametric tests like the Wilcoxon signed-rank test or, even better, the one-way ANOVA test on the individual class results are most of the time much better suited to compare segmentation results.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    According to the Reviewer Guidlines the authors: Propose a cost-effective (frugal technology) approach to implementing an otherwise expensive CAI solution. The presented solution is simple to adapt and very well evaluated. Overall, it is a very good and meaningful contribution.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The simple, easy-to-adapt method with strong results, including a well-done evaluation, outweighs the minor weaknesses of the work.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I would agree with R5 that the work does not ultimately explain the rationale behind the strong performance of the presented learning curriculum, which should definitely be researched in future work. I also agree with R4 that the performance gains are marginal. However, I would still advocate for accepting the work, mainly due to the significant reduction in training time (and subsequent CO2 emissions) without any performance reduction in one of the most commonly used segmentation methods within the medical domain. Surely, the authors could have evaluated their work against other CO2 reduction methods or tested it for other tasks as well. Nonetheless, within the segmentation task the authors evaluated their work and demonstrated its benefit.

    The nnU-Net has become the de facto baseline model against which new segmentation methods are tested. I argue that if the maintainers of the nnU-Net repository were to adopt the presented method - after extensive testing, including on the TotalSegmentator dataset - it could save researchers a substantial amount of training time and importantly a sginifcant amount of CO2 emissions.




Author Feedback

We would like to thank reviewers R1, R4, and R5 for their positive assessment and thoughtful comments. The motivation for resource-efficient model training is strongly supported [R1,R4,R5]. We apply curriculum learning for resource efficiency, which is an “underresearched domain” [R5]. The significantly increased efficiency and segmentation performance of our method compared to vanilla nnUNet training is acknowledged [R1,R4,R5], underlining the “great potential and relevance” [R5]. The simplicity, task and model-architecture agnosticism are also appreciated [R1,R5]. We are pleased that the “results are convincing” [R5] and thank R1 for appreciating our well-done evaluation of the approach on all 10 MSD tasks. Below we address the reviewer’s main concerns.

[R1,R4,R5] Missing method details: We want to thank the reviewers for underlining the missing clarity of the method description. In Subsection 2.1 we explained the schedule briefly. As the curriculum can be applied to different architectures (model agnostic), we kept the formulation general, as there is no one-fits-all equation for possible patch sizes. In the case of nnUNet, the minimal patch size and increases of it depend on the number of pooling layers. To keep the task similarity maximal, we increase the patch size in minimal steps. We will add this to the main paper for clarity.

[R5] Contradictory assumption: We agree with R5 that, in reference to the maximization of achievable performance, larger patch sizes, including more global context, correspond to a lower difficulty. However, this is not the case in terms of optimization complexity. A trivial example would be a patch of one foreground and one background pixel, for which the model would only need to learn a single threshold (low complexity). Following this intuition, we perform small-to-large patch size training. Another optimization complexity factor is added by the class balance. Smaller patches result in better balance and can lead to improved performance, as shown by Tappeiner et al. in “Tackling the class imbalance problem of deep learning-based head and neck organ segmentation” (2022). In the approach “Curriculum Consistency Learning and Multi-Scale Contrastive Constraint in Semi-Supervised Medical Image Segmentation.” by Ding and Li (2024), the authors use patch size variation in a fundamentally different context. They perform consistency learning for unlabeled data on pairs of full images and image patches. Hence, their difficulty refers to the level of data augmentation. Stronger data augmentation refers to smaller patch sizes, as achieving consistency is more difficult. Therefore, they also establish a simple-to-hard curriculum for their consistency loss optimization. Fine-tuning on patches missing global context is less close to our model’s target domain (i.e. full images) and therefore not the most fitting assumption in our scenario. We like to thank R5 for addressing this important ambiguity, and we will clarify that in the manuscript.

[R4] nnUNet as baseline: We did not consider comparing with “A New Three-Stage Curriculum Learning Approach for Deep Network Based Liver Tumor Segmentation” by Li et al. (2020), as their goal is to improve performance and not efficiency. Under the same training settings, their approach is less efficient, as it iterates over more voxels. Furthermore, their curriculum can not be applied in a general fashion, as the largest foreground component bounding box is used as patch size in their second stage, which easily can exceed GPU limits for larger components (organs, etc…). Additionally, it can only be applied to binary segmentation tasks. We choose the most widely applied patch-based segmentation approach as our baseline which is nnUNet, which uses constant patch size training.

We thank all reviewers for helpful comments on formal and grammatical aspects. We believe that all review comments will greatly contribute to an improved final manuscript.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Accepts

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Accepts



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    the authors appropriately addressed some of major concerns from the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    the authors appropriately addressed some of major concerns from the reviewers.



back to top