Abstract

The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK’s promising prospects. The code is available at https://github.com/FengheTan9/HySparK.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0897_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0897_supp.pdf

Link to the Code Repository

https://github.com/FengheTan9/HySparK

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tan_HySparK_MICCAI2024,
        author = { Tang, Fenghe and Xu, Ronghao and Yao, Qingsong and Fu, Xueming and Quan, Quan and Zhu, Heqin and Liu, Zaiyi and Zhou, S. Kevin},
        title = { { HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose an end-to-end hybrid generative pre-training strategy based on masked image modeling and apply it to large-scale pre-training on medical images. It utilizes a CNN-Transformer hybrid architecture for large-scale pre-training, which can capture both local and global semantic features. It is the first generative-based 3D hybrid architecture method for self-supervised learning. The authors validate the effectiveness of the method on downstream segmentation task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    To the best of my knowledge, the proposed end-to-end hybrid generative pre-training strategy is the first successful generative-based 3D hybrid architecture method for self-supervised learning. The method has certain innovation – image masking is used in a 3D hybrid architecture of CNN and transformer, and the effectiveness of the method has been verified in experiments. Also, the method is highly extensible – it is a general method that does not restrict the specific hybrid encoder (e.g. specific CNN or Transformer) to pre-train. It is promising that the performance of the method can be further improved with future advanced CNNs and Transformers.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The datasets used for the downstream tasks are all included in the pre-training dataset. Although no downstream task labels were used in the pre-training, the model should preferably perform downstream tasks on some other additional and completely different datasets to better verify that the model has acquired transferable representation capabilities.

    2. On the BTCV dataset, whether the model is fine-tuned with 20% or 100% of the data, the advantages of this paper’s method on many organs are not obvious, and the authors do not analyze the reason. The authors say that the method “learns strong multiscale representations”, but why does it not work well on larger sizes of organs, such as kidney? Necessary analyses were missing.

    3. From the ablation study in Table 5, the role of each component of HySparK is actually very minimal, and it is of particular interest that HySparK performs only a little bit better than the w/o pretrained case, does this suggest that the pre-training is inefficient – spending so great effort on employing so much data for pre-training, but the performance improvement it brings to the downstream task is so limited.

    4. In the ablation study, what is the alternative strategy for bottom-up masking in the model w/o bottom-up masking? This is not clarified in the paper.

    5. It seems that Fig. 1 doesn’t quite illustrate the “bottom-up masking” technique.

    6. In Subsection 2.3, the description of the fine-tuning stage of the model is not clear enough. It is suggested to clearly point out which parts of the model architecture are trainable and which are frozen in the fine-tuning stage.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is recommended that the source code be provided with more adequate instructions in the paper for readers to better reproduce the model. For example, whether other competitive methods use the same pre-training method? Which parts of the model architecture are trainable and which are frozen in the fine-tuning stage? etc.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is recommended to perform downstream tasks on some other additional and completely different datasets to better verify that the model has acquired transferable representation capabilities.
    2. It is recommended to perform more analysis of the unsatisfactory experimental results.
    3. It is suggested that more and clearer elaboration of the model method details and experimental details should be carried out in the paper.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The perspective of the paper is prospective and the methodology is somewhat innovative, effective and extendable. However, more adequate experimental verification and more detailed and clear description of the details are needed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Hyspark is an end-to-end 3D pre-training method that aims to combine CNNs for their ability to capture local features with Transformers, which are able to capture global context. It uses reconstruction of a masked 3D volume as a pretraining task on a large variety of 3D modalities. The approach demonstrates in some of the datasets a rather small improvement in downstream tasks compared to state-of-the-art pre-training methods, especially compared to SparK.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel end-to-end training of hybrid architecture of CNN and vision transformer allows for combining the advantages of both architectures, both context and local details are captured
    • Evaluation on a large amount of CT data, which gives us a good overview of the capabilities of the network, especially with the qualitative results from the supplementary material
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No significant improvements in the downstream segmentation task, especially compared to SparK. Questionable whether the overhead of the ViT and additional architectural changes are relevant.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • All 13 datasets are used for pre-training and BTCV and MSD are used for fine-tuning and then evaluation, meaning both datasets used in the evaluation are also part of the training set. It is important to see how the method performs on a completely unseen data. How would the model generalize if the 2 datasets used for evaluation would be excluded from the pre-training?
    • It seems like supervised training on these datasets already achieves a high score. Is self-supervised training in this scenario necessary? It would be interesting to see this approach evaluated on other modalities such as MRI or even Ultrasound, in which pre-training could possibly play a larger role.
    • Although the supplementary material presents visualizations of the segmentations after the fine-tuning, there are no visualizations of the reconstructions during pre-training. It would be interesting to have an overview of the reconstruction capabilities of the proposed method.
    • There are also no cases shown in which the proposed method fails compared to the other methods. It is relevant to understand the cases in which the proposed method does not improve segmentation results.
    • More details on how the masked patches are chosen are needed. In the example visualization of the paper it seems like large areas are masked in a slice. Not only the amount of masking, but the contiguity of masked patches is relevant.
    • In the results section, the use of “significant” is mentioned without any computation of statistical significance.
    • The paper could benefit from an improved discussion of the results, pitfalls, and other potentially relevant application areas, e.g. other imaging modalities.

    Minor comments:

    • Underlined second-best methods in Table 3 are missing
    • Small grammar errors (“linear project layer”)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology of this end-to-end hybrid of ViT and CNN is relevant for the community and could potentially be applied to other imaging modalities in which supervised training is limited by the lack of annotations as well as inherent limitations of the modality.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK), which combines the strength of convolutional neural networks for learning local representations, and transformers which excel at learning global contextual representations.

    To address the challenges of matching masks between the convolutional and transformer architectures, the authors employ “bottom up masking”, by initializing the masks based on the patch division in the bottom ViT layer, and then use sparse convolution in the CNN to drop masked patches from the performer. By ensuring the consistency of masking between the different layers of the neural network, they address the data distribution shift problem described in [17].

    In addition, the authors introduce skip-connections, which have been shown to be important in UNet-like architectures.

    They conduct extensive experiments on multiple CT datasets and demonstrate improved performance.

    The authors commit to releasing the code soon.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors address an important challenge in computer vision and medical imaging, namely combining successful convolutional and transformer architectures into a single model, and pretraining it with the self-supervised masked image modeling strategy.

    In order to accomplish the above, they carefully address the challenge of matching image masking across a hybrid CNN + Transformer architecture, leveraging sparse convolutions in the CNN part of the model. This architecture can serve as a strong baseline for medical image segmentation, and the authors commit to releasing the code.

    Leveraging their novel architecture, the authors demonstrate state-of-the-art performance on two CT organ segmentation datasets.

    They conduct ablation experiments (Table 5) to determine the impact of different design decisions.

    The paper is clearly structured, and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors describe a novel hybrid (CNN+Transformer) architecture for self-supervised pretraining by masked image modeling, and evaluate on multiple test datasets. The method appears well motivated, and carefully implemented; however, the overall performance improvement is quite small. In the case of BTCV, when training with 100% of the annotated data for image segmentation, they achieve a 1.17% improvement. On MSD, they achieve 0.49% improvement. The authors should include a measure of whether these improvements are statistically significant, since they are one of the main results of the paper.

    Note: It appears that SparK is erroneously underlined in Table 2, whereas vox2vec has a better performance.

    The authors conduct an experiment comparing different masking ratios, similar to the original MAE paper, and conclude that 75% is the optimal masking ratio (Table 4). However, it appears that the results are relatively insensitive to the masking ratio, and also non-monotonic (mask 25% is better than mask 50%). It would be interesting to see how the model behaves at a higher masking ratio, since medical images are highly structured, relative to natural images.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the introduction, the authors claim that generative methods (such as masked image modeling) outperform contrastive methods on downstream tasks. I don’t think such a broad claim is accurate. Many papers have demonstrated that contrastive learning frameworks, such as DINO, can outperform masked image modeling on tasks such as image classification, which require high-level semantics to discriminate between multiple classes. Whereas, masked image modeling has been shown to outperform contrastive learning on tasks such as image segmentation, which require pixel-level predictions.

    The masking consistency problem has already been addressed in [17]. In particular, the issues of pixel-distribution shift and mask pattern vanishing. However, the authors claim that in a hybrid architecture, these issues re-emerge, due to the “inconsistent masking strategies in both worlds”. It would be helpful if the authors could include an example (eg, a figure) to demonstrate this effect.

    The authors mention that training is done on a single A800 GPU with batch size = 2. It seems advantageous to increase to multiple GPUs, to increase the effective batch size during training. 8 GPUs may be sufficient (batch size = 16) to avoid the need for multi-node training. In addition, please include a description of the total time required for pretraining the HySparK model.

    Minor comments:

    • Introduction, first paragraph: “Contrastive-based methods and generative-based methods”, can simply be called “Contrastive methods and Generative methods”.
    • Introduction, second paragraph: “pivot rules” -> “pivotal roles”
    • Introduction second paragraph: “data-hungry of ViTs” -> “data-hungry nature of ViTs”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors implement a novel architecture, demonstrate improved performance over existing state-of-the-art methods, and commit to releasing their code.

    I think this paper will provide a useful baseline for medical image segmentation for the community.

    In addition, the paper was well motivated, well structured, and easy to follow.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We appreciate the reviewers’ comments and insight suggestions, especially some interesting and meaningful proposals. They appreciate the innovative, well-perspective, and community-extendable of our work. Our responses are as follows:

CHOICE OF MASKING PATCHES & RATIO: Thanks for the suggestion by reviewers R1 and R3. It’s interesting to discuss the way of masking patches. An appropriate patch masking strategy could further improve HySparK’s representation learning capabilities. In the paper, we apply random masking of the patches and we will add the masking detail in the camera-ready version. In addition, for the masking ratio, we utilize the MedNeXt+ViT as a HySparK pre-trained network with three different ratios (i.e. 25%, 50%, and 75%). We believe that exploring the optimal mask ratio of HySparK is a meaningful topic, but it may depend on different hybrid network architectures, data volumes, and distribution which requires large computing resources to search. We provide a coarse-grained masking ratio scope in the paper and it shows an interesting trend. We believe these proposals could open the way for the community to research various patch masking strategies and more fine-grained mask ratio searches for HySparK.

PERFORMANCE: Thanks to the reviewers for their comments on performance. In the BTCV dataset, we use MedNeXt+ViT as our pretraining network, which achieves an average Dice of 78.58% without pretraining, only 0.63% lower than MedNeXt pre-trained by SparK. We believe it is a hard task to further boost performance under high-level segmentation results. Compared to without pre-training, our HySparK could achieve an average 2.09% improvement and achieved meaningful results in some difficult segment organs, such as the bladder (5.45%), the stomach (3.50%), and the adrenal glands (2.39%). In MSD datasets, compared to SparK, HySparK improves Pancreas and Hepatic Vessel lesions by at least 1.81% and 1.60%. Moreover, we believe that downstream performance partly depends on the specific network design. HySparK serves as the foundational upstream method that can be pre-trained with any hybrid architecture with top CNN and bottom ViT and we believe it can inspire more interesting hybrid network design and other extensions downstream tasks discussions in our community.

DATASET: Thanks to reviewers 2 and 3 for their suggestions on datasets. It is a very prospective proposal to apply our proposed Hybrid Sparse masKing method to other modalities. HySparK is a universal method that can be quickly extended to ultrasound, MRI, and other modalities that could maximize the advantages of hybrid architecture. In addition, we think HySpark in cross-domain issues that pre-training on large-scale CT and performing downstream tasks on completely different datasets (e.g. MRI) still need to be explored. Moreover, we will release the official code and weights in the camera-ready version, welcome everyone to follow and inspire more interesting community work.

OTHER CONSTRUCTIVE COMMENTS: Thanks to all reviewers for their constructive comments on the paper description (R1, R2), grammar (R1, R3), and experimental settings and analysis (R1, R2, R3), which is very important for us to improve the readability and quality of the paper. We will correct, revise, and improve those in the camera-ready version.




Meta-Review

Meta-review not available, early accepted paper.



back to top