Abstract

The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP. Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness on Out-of-Distribution (OOD) datasets. However, applying these methods to the medical imaging domain faces challenges and results in suboptimal performance. This is primarily due to differences in error surface characteristics that stem from data complexities such as heterogeneity, domain shift, class imbalance, and distributional shifts between training and testing phases. To address this issue, we propose a hierarchical merging approach that involves local and global aggregation of models at various levels based on models’ hyperparameter configurations. Furthermore, to alleviate the need for training a large number of models in the hyperparameter search, we introduce a computationally efficient method using a cyclical learning rate scheduler to produce multiple models for aggregation in the weight space. Our method demonstrates significant improvements over the model souping approach across multiple datasets (around 6% gain in HAM10000 and CheXpert datasets) while maintaining low computational costs for model generation and selection. Moreover, we achieve better results on OOD datasets compared to model soups. Code is available at https://github.com/BioMedIA-MBZUAI/FissionFusion.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3154_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3154_supp.pdf

Link to the Code Repository

https://github.com/BioMedIA-MBZUAI/FissionFusion

Link to the Dataset(s)

https://www.kaggle.com/competitions/aptos2019-blindness-detection https://challenge.isic-archive.com/data/#2018 https://stanfordmlgroup.github.io/competitions/chexpert/ https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data

BibTex

@InProceedings{San_FissionFusion_MICCAI2024,
        author = { Sanjeev, Santosh and Zhaksylyk, Nuren and Almakky, Ibrahim and Hashmi, Anees Ur Rehman and Qazi, Mohammad Areeb and Yaqub, Mohammad},
        title = { { FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a hierarchical merging technique that combines local and global aggregation of models at different levels, guided by their hyperparameter settings. Additionally, it presents a cyclical learning rate scheduler to generate multiple models for aggregation in weight space to mitigate the requirement for training numerous models during hyperparameter search. This method demonstrates improvements across various downstream tasks compared to baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation behind the paper is clear and significant. The paper is well written and easy to follow. The experiment evaluation is comprehensive, covering 2 natural and 5 medical image datasets. It shows the effectiveness of the method. The analysis of the out-of-domain scenario is well-incorporated and deserves further investigation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the paper offers a relatively detailed method description, the prospect of training multiple models may still seem daunting.

    It remains unclear why only the learning rate is chosen to generate the candidate models.

    In reference to [1], directly fine-tuning an ImageNet pretrained model can achieve an AUC of 88.16 on the CheXpert dataset, while the paper reports a maximum of 86.44. This raises questions as to whether employing a superior backbone model could simplify the training process and yield better performance instead of utilizing the proposed method.

    The paper asserts that the method maintains low computational costs, it would be strengthened by the inclusion of detailed statistical data to support this claim.

    An explanation of why this method could be particularly advantageous for medical image tasks would enhance the paper’s credibility.

    The experimentation is limited to a single task (classification), and the paper would benefit from exploring additional tasks.

    Since the final performance appears to heavily depend on the source pretrained model, particularly the ImageNet pretrained model in this case, it would be intriguing to observe how the model behaves with a medical image pretrained model (either supervised or self-supervised).

    [1] Ma, DongAo, et al. “Benchmarking and boosting transformers for medical image classification.” MICCAI Workshop on Domain Adaptation and Representation Transfer. Cham: Springer Nature Switzerland, 2022.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Refer to weakness.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a hierarchical merging technique that integrates local and global aggregation of models at various levels, along with a cyclical learning rate scheduler to produce multiple models for aggregation, aiming to reduce the burden of training numerous models. Despite demonstrating effectiveness across different datasets, the process of training and aggregating multiple models may still be overwhelming. Moreover, several questions remain unanswered, including the rationale for exclusively tuning the learning rate, the decision to not employ a superior backbone model, and the claim of low computational costs. The impact of the work remains unjustified. As a result, based on the preliminary review, I am inclined towards a weak rejection.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Although I still think the method is still overly complex, my main concerns are indeed addressed. Consequently, I will raise the score.



Review #2

  • Please describe the contribution of the paper

    This paper proposes FissionFusion, a novel framework that combines several fine-tuned models to improve ID and OOD performance for medical image analysis. FissionFusion introduces a fast geometric generation approach to efficiently generate models with minimal computational overhead. In addition, the author proposes hierarchical souping tailored for the medical image analysis domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written. The complexities in medical datasets are well-defined and explained. It is interesting to show the validation error differences between natural and medical datasets. The proposed fast geometric generation and hierarchical souping are novel.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The improvement between the proposed model and previous SOTAs is small. For example, on the APTOS dataset, the proposed GoG vs. SOTA Greedy Soup is 0.7172 vs. 0.7274, and on the RSNA dataset, the proposed GoU vs. SOTA Greedy Soup is 0.9545 vs. 0.9444.

    In addition, there is no mention of multiple runs or statistical significance analysis, which makes the small improvement meaningless.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author stated to publish the code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) On page 7 and Table 1, the author stated that CoG has around 6% improvement in Recall for the HAM10000 dataset. How did the author calculate this result? It seems the improvement is only 2.04% (0.6818-0.6614). (2) In Table 1, why don’t show the same evaluation metric for each dataset? Is there any special reason to do so? (3) In Table 1, the method comparison is confusing. Does the GoU model use GS+HS or FGG+HS? Since in the supplementary material, GoU is GS+HS. (4) The author stated that the proposed fast geometric generation approach is efficient in generating models with minimal computational overhead. Could the author provide some quantitative/qualitative efficiency/computation analysis compared with existing methods?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the topic of model soup for medical image analysis and the proposed modules are interesting, the results are not substantiated. The very slight improvement is not very meaningful.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have answered my main concerns. (1) Though their proposed method has similar performance compared to SOTAs on some datasets, it largely outperforms SOTAs on HAM1000 (6%). (2) I thank the authors for including the computational cost to corroborate their efficiency statement. (3) The topic of model soup for medical image analysis and the proposed modules are intersting. Therefore, I have decided to raise my score.



Review #3

  • Please describe the contribution of the paper

    The paper presents an alternative to the Model Soups method for training an ensemble of fine-tuned models. The contributions include (1) finding a diverse set of models through a cyclical learning rate strategy with heavy augmentations instead of exploring multiple combinations of learning rates, seeds and augmentations; and (2) a hierarchical merging strategy that aggregates models at local and global levels. The authors report results for CIFAR10, CIFAR100 and 5 medical classification datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes a highly efficient method for a very common situation, namely where one means to fine-tune a model in an ensemble fashion. Instead of performing an exhaustive search over hyperparameters to obtain a diverse ensemble, the authors utilize a cycling learning loss strategy that covers multiple places in the solution space during the training trajectory. They also propose averaging models at different levels, which improves performance and robustness.
    • The visualization of loss surfaces in Figure 1 is great and helps support the statement of the authors on the differences between natural and medical image data. Figure 2 is similarly very helpful for the reader to understand the method.
    • The empirical evaluation observes two model architectures – a ResNet and a Transformer – showing the versatility of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Unexpectedly, the proposed method seems more suitable to the natural image datasets than the medical ones (Table 1). This seems to go against the intuition presented in the introduction. The authors should elaborate on why this is the case. Additionally, it would be good to have a quantitatve comparison of the computation cost in that table.
    • Heavy augmentation is used for testing the proposed approach, which may explain why the results are better than when augmentation is a varying factor in the model generation. Additionally, no augmentations are used specifically designed for the medical imaging datasets.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I have no additional reproducibility concerns.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The authors should include part of the theoretic explanation for focusing on the learning rate in the main manuscript.
    • I would suggest the authors better explain/cite the Fission process.
    • In Figure 1, I suggest to make the markers slightly larger, and maybe explain the differences in loss surfaces directly in the caption.
    • Figure 3 is a bit difficult to grasp, and not as high in quality as the other figures.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose a highly efficient strategy for a common problem. The paper is easy to follow, and the empirical evaluation is sensible. I therefore recommend acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I maintain my initial decision on accepting this paper.




Author Feedback

We thank reviewers (R1,R4,R6) for their valuable insights, highlighting technical novelty, clarity & thoroughness of our work. Below, we address the comments to enhance our paper.

[R1, R4] LR selection: We thank R1 for the suggestion & will explain the reason for choosing LR in the main paper instead of Supp as it is important & might have been missed by R4. Reasoning behind LR choice for optimizing Grid Search(GS) is in Supp Fig1, showing linear mode connectivity (LMC) (θ=λ·θA+(1−λ)·θB) between 2 models θA & θB generated during GS. By varying λ & calculating performance at θ for each model pair, we analyze the impact of hyperparameters(HP). Ideally, LMC should be an inverted parabola or a straight curve if models are in the same basin. We observe solely changing seed & augmentation(aug)(while other HP are fixed) yield smooth curves(models with huge difference cause drop in F1), while LR variations(subfigs c&f) result in erratic patterns & significant F1 drops, indicating models lie in different basins. This implies LR plays a crucial role in guiding models to specific basins, while other aid in converging to global optima. [33,24] also support LR as a critical HP.

[R1] Augmentations: While we only employed heavy aug in our FGG we see similar trend of better results in most cases compared to greedy soup of FGG-models(Ablation study Supp). For consistency across several datasets, we didn’t opt for dataset specific medical aug but will be explored in future work.

[R1, R4] Fission: starts from base models(8),cyclically varying LR for 17 epochs generating multiple models(5 each). Will explain in detail in paper. Fission helps models escape several local minima in rough loss surface & generate better generalizable models, facilitating easier model averaging (Fusion using HS).

[R1,R4,R6] Computation Cost: Training: Model soups GS requires 2400 epochs to generate 48 models(50 epochs each), whereas our FGG needs only 536 epochs(850 + 817), 4x less than GS. Time taken per epoch in both settings is same. Inference: Unlike ensembles, we hierarchically average model weights to get 1 model without incurring additional inference or memory costs.As we need only 1 model for inference, it is imp. in hospital settings where compute resources are often limited like portable ultrasound devices. Details will be added to the paper.

[R1] Comparison to natural datasets: We focus on addressing the challenge of rough loss surfaces, which impede traditional model soups designed generally for smoother surfaces. Greedy soup in medical datasets, often fails to merge best model with others, potentially compromising generalizability. Our method targets medical datasets while demonstrating superior performance on natural domain datasets.

[R6] Results: Performance variations across datasets range from minimal to significant improvements. Notably, our approach (GoG) achieves a 6% improvement in datasets like HAM(ResNet50) & CheXpert(DeiT-S) compared to SOTA greedy soup. FGG is part of our method & we don’t compare with it.

[R6] Statistical Significance: Starting from fixed initialization, given that GS involves training multiple models with different hyperparameter settings, mean & std. is generally not computed in model merging works[33,8,34,3] as merged model is average of weights from multiple runs.

[R6] Metrics: Tab1 uses different classification metrics for different datasets, as they are the significant metric used for those datasets[arxiv-2203.01825, doi-10.1007/978-3-031-45673-2_44].

[R6] GoU: GoU is FGG+HS. In Tab1 of Supp, we conduct ablations for various combinations. GS+HS(GoU) employs GoU(a HS technique) on GS models.

[R4]Future work: To show the effectiveness of our approach for several datasets(7), we utilized commonly used Imagenet pretrained ResNet50 & DeiT-S arch. Future work will explore bigger model arch, training techniques(SSL), other tasks like segmentation & medical pretrained models.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces FissionFusion, a novel framework for enhancing in-domain (ID) and out-of-domain (OOD) performance in medical image analysis by combining multiple fine-tuned models. The framework features two key contributions: (1) a fast geometric generation (FGG) approach that efficiently generates diverse models with minimal computational overhead, and (2) a hierarchical merging strategy that aggregates models at local and global levels. The paper is well-written and highlights the complexities of medical datasets, offering novel techniques in FGG and hierarchical souping. However, the reported performance improvements over previous state-of-the-art methods are relatively small, and the paper lacks statistical significance analysis and detailed computational cost comparisons. Overall, the strengths outweigh the weaknesses, so I suggest accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces FissionFusion, a novel framework for enhancing in-domain (ID) and out-of-domain (OOD) performance in medical image analysis by combining multiple fine-tuned models. The framework features two key contributions: (1) a fast geometric generation (FGG) approach that efficiently generates diverse models with minimal computational overhead, and (2) a hierarchical merging strategy that aggregates models at local and global levels. The paper is well-written and highlights the complexities of medical datasets, offering novel techniques in FGG and hierarchical souping. However, the reported performance improvements over previous state-of-the-art methods are relatively small, and the paper lacks statistical significance analysis and detailed computational cost comparisons. Overall, the strengths outweigh the weaknesses, so I suggest accepting this paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top