Abstract

Deep neural networks have significantly improved volumetric medical segmentation, but they generally require large-scale annotated data to achieve better performance, which can be expensive and prohibitive to obtain. To address this limitation, existing works typically perform transfer learning or design dedicated pretraining-finetuning stages to learn representative features. However, the mismatch between the source and target domain can make it challenging to learn optimal representation for volumetric data, while the multi-stage training demands higher compute as well as careful selection of stage-specific design choices. In contrast, we propose a universal training framework called MedContext that is architecture-agnostic and can be incorporated into any existing training framework for 3D medical segmentation. Our approach effectively learns self supervised contextual cues jointly with the supervised voxel segmentation task without requiring large-scale annotated volumetric medical data or dedicated pretraining-finetuning stages. The proposed approach induces contextual knowledge in the network by learning to reconstruct the missing organ or parts of an organ in the output segmentation space. The effectiveness of MedContext is validated across multiple 3D medical datasets and four state-of-the-art model architectures. Our approach demonstrates consistent gains in segmentation performance across datasets and architectures even in few-shot scenarios.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/4230_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/4230_supp.pdf

Link to the Code Repository

https://github.com/hananshafi/MedContext

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gan_MedContext_MICCAI2024,
        author = { Gani, Hanan and Naseer, Muzammal and Khan, Fahad and Khan, Salman},
        title = { { MedContext: Learning Contextual Cues for Efficient Volumetric Medical Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a universal training strategy to improve the performance of 3D medical image segmentation by jointly optimizing the supervised segmentation task and a self-supervised task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Multiple 3D medical image datasets and different network architectures were used to verify the effectiveness of MedContext.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Comparison to the state-of-the-art: though multiple transformer backbones have been used for evaluation, nnU-Net [1] was not included. nnU-Net has outperformed most transformer networks by winning many MICCAI segmentation challenges. nnU-Net vs. nnU-Net + MedContext should be considered as the major comparison in this paper.

    • The proposed method does not seem novel to me. It follows the same teacher and student framework as in DINO-v2 [2] and use MAE [3] as the perturbation to the inputs of the student model. Self-supervised learning usually shows promising results when there are huge amounts of unlabeled datasets, which is not the case for this study.

    [1] Isensee, Fabian, et al. “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.” Nature methods 18.2 (2021): 203-211. [2] Oquab, Maxime, et al. “Dinov2: Learning robust visual features without supervision.” arXiv preprint arXiv:2304.07193 (2023). [3] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I strongly recommend the authors to release their code for reproducibility, especially when the proposed method is universal and adaptive to different network architectures.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned in the weakness section, I strongly recommend the authors to valid the proposed method on nnU-Net and release the code for better implementation details and reproducibility.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Due to the lack of the comparison to an important baseline and publicly available code, I would recommend a weak rejection for this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The results of nnU-Net are still not available in rebuttal, thought it was mentioned to be validated in the experiments. It is still not clear if there is a performance gap between the transformers and nnU-Net in the experiments; if nnU-Net is better, why do we care about the performance improvement on the under-performed transformer baselines?



Review #2

  • Please describe the contribution of the paper

    Authors a training strategy where a supervised segmentation loss is used together with a segmentation from masked input loss. To this end, a student teacher setup is designed. Student network is fed with original and masked input. The teacher is fed with the original. Consistency between teacher’s output and masked input prediction are enforced through a consistency loss. Experiments with 3 different data sets are presented. The proposed training is applied to 4 different architectures and compared with pre-training, which is not used for the proposed technique.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Being able to generate high quality segmentations without pre-training is pretty good.
    • The idea is novel to the best of my knowledge. Integration of supervised and masked prediction loss is quite interesting.
    • The method is simple and intuitive. Authors do a great job in explaining it.
    • Experimental design is quite good.
    • Shown results are in favor of the proposed technique. It seems like the training strategy can reduce the importance of pre-training.
    • Ablation studies are good. They show the value of all the components.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The only weakness is regarding the results. Authors present results for publicly available data sets for the baselines that are not necessarily consistent with what was reported in the respective articles. For instance, in https://arxiv.org/pdf/2103.10504.pdf, authors report 0.891 average DICE for the BTCV data set for UNETR. A value that is much higher than what is reported in Table 1 here. This raises a question about the implementation and training of the baselines.

    I understand that authors would like to make sure the training of with and without medcontext are comparable. However, the discrepancy between previously reported values is very high. I think this needs to be explained. For instance, if the UNETR is trained until the point where it achieves a similar value as what has been reported, does the added benefit of medcontext remain?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    discuss the discrepancy between the numbers I mentioned above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    please see above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel training framework designed to enhance volumetric medical segmentation by learning contextual cues without relying on extensive annotated datasets.

    This framework can be incorporated into any training framework and it is validated on both transformer architectures and CNN architectures and different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed framework is novel and effective over existing pertaining-fine tuning scheme. For instance, the way to jointly optimize the supervised and self-supervised losses for optimization in 3D segmentation is novel. And due to its simplicity, this method can be easily implemented in any segmentation training task. Besides, the way they reconstruct the input is based on learnable mask embeddings by optimizing segmentation outputs.

    Paper is well-written with clear figures and equations, which makes this paper easy to follow.

    The ablation study provides a comprehensive validation of the different aspect of the proposed methods. For instance, each module they proposed is validated to be effective. And they also validate the framework on different kinds of models, proven its universal ability for segmentation tasks.

    Method obtains competitive results with state-of-the-art.

    The ablation study provides a comprehensive validation of the different aspect of the proposed methods.

    Method obtains competitive results with state-of-the-art.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no comparison for other training frameworks rather than pertaining and fine-tuning.

    The ablation study for choosing loss should also be explored.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The author should also try to jointly train the model with supervision loss and other self-training loss (MAE, SimCLR. etc)

    The chosen for loss is arbitrary, could the author illustrate the motivation?

    I believe there are other works for jointly optimizing the self-supervised loss and supervised loss. (e.g “USCL: Pretraining Deep Ultrasound Image Diagnosis Model Through Video Contrastive Representation Learning”) Could the author compare with these jointly optimization methods?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Paper is well-written, method is novel, especially for their contribution to jointly optimize the model in 3D segmentation task by reconstruct the input through learnable embeddings. and comparison provides valuable information. Figure are easily readable and provides complementary information to the manuscript.

    The training framework seems universal and can be integrated into any model, makes it useful for Medical Image Computing tasks.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The author clearly state the question from reviewers, and to best of my knowledge the method is novel and effective. Besides, the written of the paper is quite standard and easy to read. However, I would still suggest the authors should add experiment about combining supervision loss and other self-training loss (MAE, SimCLR. etc), and not in pertaining fine-tuning paradigm. Finally, I would recommend this paper to be accepted.




Author Feedback

We thank the reviewers for their positive feedback. Kindly find the responses to the specific queries below:

Reviewer-1:

Discrepancy in UNETR results: Please note that UNETR has reported results on all twelve organs of synapse with a 24-6 data split. In our case, we use an 18-12 data split (refer to experiments section) with 18 train samples and 12 test samples and report scores on eight challenging organs. For a fair comparison, we train all models with the same number of epochs on synapse with the same 18-12 data split.

Reviewer-3:

Results on nnUNET: Our approach has been comprehensively validated with both transformer (UNETR, SwinUNETR, nnFormer) and CNN (PCRLv2) models. As per Reviewer’s suggestion, we have validated our approach on nnUNET as well. We are committed to open-sourcing all our code and models for the benefit of medical community (including nnUNET evaluation scripts). Due to strict MICCAI policy, we cannot post any links to results or code at this point.

On Novelty: Kindly note that MAE and DINO-v2, while effective, are self-supervised learning methods that necessitate extensive datasets for pre-training. Such huge and diverse datasets are often unattainable in the medical imaging field due to factors like high costs, privacy concerns, etc. Hence, our approach presents a single-stage end-to-end training framework to boost the performance of medical imaging models in low data regimes. In addition, while masking has been introduced in the realm of self-supervised pre-training, our proposed approach stands out as the first in the medical imaging community which uses volumetric 3D tube masking strategy in a student-teacher distillation framework at the fine-tuning stage in an end-to-end pipeline. Furthermore, our volumetric masked tokens are learnable which can effectively learn the contextual knowledge through reconstruction in output segmentation space. We hope this explanation provides an insight into the effectiveness of our proposed approach in the medical imaging problems.

Reviewer-4:

Comparison with other training frameworks: Please note that in addition to comparing our proposed MedContext with pre-training finetuning paradigms (Table 5 and 6), our method shows superior results compared to baselines (Tables 1,2 and 3) when trained from scratch. We have further shown the effectiveness of MedContext in the few-shot scenario in Table 4. We would also like to draw the attention of the reviewer to Table 2 of supplementary material where we integrate MedContext with a relatively new framework MedNeXT (MICCAI’23) and show performance gains on this framework.

Exploration of loss function: We kindly refer the reviewer to the Table 2 of supplementary for comparison of different loss functions. Our choice of using normalized L2 loss between the masked student output and teacher output stems from the fact that the normalized L2 objective is scale invariant which is beneficial for 3D segmentation tasks, where the magnitude of logits can vary. He et al. [1] further shows that L2 objective with normalized logits as the reconstruction target improves representation quality.

Similarities with USCL: Kindly note that USCL is a pre-training method providing pre-trained backbones for downstream medical tasks. Our proposed MedContext, on the other hand, is a single-stage, end-to-end training solution that directly utilizes downstream data for efficient volumetric segmentation, without any pre-training. Our approach combines self-supervised and supervised objectives in an end-to-end framework. MedContext is the first to use a volumetric 3D tube masking strategy in a student-teacher distillation framework during fine-tuning. Our learnable volumetric masked tokens effectively capture contextual knowledge essential for volumetric segmentation.

[1] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” CVPR 2022.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a universal training framework called MedContext with contextual knowledge distillation. This framework aims to teach the student network to reconstruct missing organs or parts of organs in the output segmentation space. The major criticism from reviewers centers on the baselines for comparison, which the authors have partially addressed. The baselines and the proposed methods were evaluated on the same data splits, although these were different from those used in previous work. Yet, the authors did not provide a comparison to nn-Unet, which is the main concern of R3. Despite this, the novelty and sufficient experimental results across public three datasets with four strong baseline methods, including nnFormer stand out. Thus, the Area Chair recommends acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes a universal training framework called MedContext with contextual knowledge distillation. This framework aims to teach the student network to reconstruct missing organs or parts of organs in the output segmentation space. The major criticism from reviewers centers on the baselines for comparison, which the authors have partially addressed. The baselines and the proposed methods were evaluated on the same data splits, although these were different from those used in previous work. Yet, the authors did not provide a comparison to nn-Unet, which is the main concern of R3. Despite this, the novelty and sufficient experimental results across public three datasets with four strong baseline methods, including nnFormer stand out. Thus, the Area Chair recommends acceptance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    In the field of medical image segmentation, many tasks now have a significant amount of annotated data, for instance, there are as many as 1000 annotated cases for multi-organ segmentation (such as TotalSegmentator dataset). From a practical application standpoint, one would not train models using only the 18 cases from the BTCV dataset but would utilize many more publicly available annotated datasets. We are more concerned with whether a pretrained model can continue to improve the performance of supervised learning on such data. The experiments conducted solely on the 18 BTCV cases in this paper cannot prove the effectiveness of the pretrained models on a real-world task with more annotations. Additionally, R3 mentioned the lack of experiments with nnU-Net. It is unfortunate that this article does not provide an answer as to whether the proposed method can continue to drive improvements on top of a strong segmentation model.

    Overall, the meta-reviewer believes that this work does not bring new insights to the community and there are several issues that need further verification and discussion. Therefore, it is not suitable for acceptance at this time.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    In the field of medical image segmentation, many tasks now have a significant amount of annotated data, for instance, there are as many as 1000 annotated cases for multi-organ segmentation (such as TotalSegmentator dataset). From a practical application standpoint, one would not train models using only the 18 cases from the BTCV dataset but would utilize many more publicly available annotated datasets. We are more concerned with whether a pretrained model can continue to improve the performance of supervised learning on such data. The experiments conducted solely on the 18 BTCV cases in this paper cannot prove the effectiveness of the pretrained models on a real-world task with more annotations. Additionally, R3 mentioned the lack of experiments with nnU-Net. It is unfortunate that this article does not provide an answer as to whether the proposed method can continue to drive improvements on top of a strong segmentation model.

    Overall, the meta-reviewer believes that this work does not bring new insights to the community and there are several issues that need further verification and discussion. Therefore, it is not suitable for acceptance at this time.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I agree with Meta-reviewer 1 that this paper has sufficient novelty and contribution. I don’t think it’s that essential to include nnUNet as a baseline. Also, the method has been experimented on 3 public datasets, not just one, as noted by Meta-reviewer 4.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I agree with Meta-reviewer 1 that this paper has sufficient novelty and contribution. I don’t think it’s that essential to include nnUNet as a baseline. Also, the method has been experimented on 3 public datasets, not just one, as noted by Meta-reviewer 4.



back to top