Abstract

Colorectal cancer (CRC) is a critical global concern, despite advancements in computer-aided techniques, the development of early-stage computer-aided segmentation holds substantial clinical potential and warrants further exploration. This can be attributed to the challenge for localizing tumor-related information within the colonic region of the abdomen when doing segmentation and that cancerous tissue remains indistinguishable from surrounding tissue even with contrast enhancement. In this work, a task-oriented Synthetic anatomical Semantics-aware Masked Image Modeling (SaSaMIM) method is proposed that leverages both existing and synthesized semantics for efficient utilization of unlabeled data. We first introduce a novel fine-grain synthetic mask modeling strategy that effectively integrates coarse organ semantics and synthetic tumor semantics in a label-free manner. Thus, tumor location perception in the pretraining phase is achieved by means of integrating both semantics. Next, a frequency-aware decoding branch is designed to achieve further supervision and representation of the Gaussian noise-based tumor semantics. Since the CT intensity of tumors follows Gaussian distribution, representation in the frequency domain solves the difficulty in distinguishing cancerous tissues from surrounding healthy tissues due to their homogeneity. To demonstrate the proposed method’s performance, a non-contrast CT (NCCT) colon cancer dataset was assembled, aiming at early tumor diagnosis in a broader clinical setting. We validate our approach on a cross-validation of these 110 cases and outperform the current SOTA self-supervised method for 5% Dice score improvement on average. Comprehensive experiments have confirmed the efficacy of our proposed method. To our knowledge, this is the first study to apply task-oriented self-supervised learning methods on NCCT to achieve end-to-end early-stage colon tumor segmentation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1664_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/Da1daidaidai/SaSaMIM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Dai_SaSaMIM_MICCAI2024,
        author = { Dai, Pengyu and Ou, Yafei and Yang, Yuqiao and Liu, Dichao and Hashimoto, Masahiro and Jinzaki, Masahiro and Miyake, Mototaka and Suzuki, Kenji},
        title = { { SaSaMIM: Synthetic Anatomical Semantics-Aware Masked Image Modeling for Colon Tumor Segmentation in Non-contrast Abdominal Computed Tomography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents SaSaMIM, a novel method for colon tumor segmentation in non-contrast abdominal CT scans. The key contributions are: 1.A task-oriented masked image modeling framework that integrates synthetic anatomical semantics for high-performance colon tumor segmentation in non-contrast CT. 2.A spatial-frequency dual-branch decoder to enhance the model’s perception of Gaussian noise-based target semantics. 3.Extensive experiments demonstrating the effectiveness and high performance of the proposed model over existing medical image segmentation methods and uniform self-supervised pretraining methods. In summary, the paper introduces a new self-supervised learning approach for colon tumor segmentation that leverages synthetic semantics and frequency-aware modeling to achieve state-of-the-art performance. The method is the first to apply task-oriented self-supervised learning on non-contrast CT for end-to-end early-stage colon tumor segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper include:

    1. Novel Formulation for Synthetic Anatomical Semantics: The paper introduces a novel fine-grain synthetic mask modeling strategy that integrates coarse organ semantics with synthetic tumor semantics in a label-free manner. This approach is particularly innovative as it does not require pre-existing labels, thus addressing the challenge of limited annotated data in medical imaging.
    2. Original Use of Data through Self-Supervision: SaSaMIM leverages self-supervised learning to make efficient use of unlabeled data. This is significant because it allows the model to learn from a larger pool of data, which is crucial in the medical imaging domain where annotated datasets can be scarce.
    3. Frequency-Aware Masked Image Modeling: The paper proposes a frequency-aware decoding branch that provides additional supervision and representation for Gaussian noise-based tumor semantics. This is an original contribution as it tackles the difficulty in distinguishing cancerous tissues from surrounding healthy tissues by representing the data in the frequency domain.
    4. Innovative Integration of Multi-Semantic Tokens: The paper describes the integration of multi-semantic tokens through local masking, which enables the generation of detailed and semantically dense reconstructed representations. This aspect contributes to the model’s ability to capture fine-grained details necessary for accurate segmentation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper presents a novel approach for colon tumor segmentation in non-contrast CT scans, there are potential weaknesses or areas for improvement that could be considered:

    1. Dependence on Synthetic Data: The approach relies heavily on synthetic tumor semantics generated from Gaussian noise. While this is a creative solution to the lack of labeled data, the reliance on synthetic data could potentially introduce biases if the synthetic data does not accurately represent the variability seen in real tumors.
    2. Potential Overfitting to Synthetic Data: There is a risk that the model may overfit to the synthetic data used for pre-training, especially if the synthetic data does not capture the full distribution of real-world data.
    3. Concern on Comparison: The comparative approach to segmentation methodologies within this paper raises some concerns, as it primarily focuses on the UNETR family of models, including UNETR, UNETR++, and Swin-UNETR. The field of medical image segmentation encompasses a broader array of methodologies, such as nnUNet and MedNeXt. It would be beneficial for the authors to expand their comparative analysis to include a more diverse set of established medical segmentation techniques.
    4. Lack of Ablation Studies: While the paper presents a comprehensive approach, it may benefit from ablation studies to understand the contribution of each component of the proposed method to the overall performance.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Refer to the weakness part.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript introduces a pre-trained segmentation model to harness free semantic information and employs tumor synthesis to generate both tumor instances and corresponding labels. It then proceeds with reconstruction in a multi-semantic space, encompassing both the pixel domain and the frequency domain, and conducts extensive experiments to substantiate the efficacy of the approach. However, there are several concerns that warrant attention. Notably, the effectiveness of the method is contingent upon the validity of the additional semantic spaces generated. Additionally, the selection of segmentation baselines in this paper appears to be overly focused, potentially limiting the scope of comparison. The paper also lacks ablation studies to validate the contribution of the proposed components to the overall performance.

    In light of these considerations, I assign a “Weak Reject” to the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The idea of this paper is interesting. My concerns were basically eliminated. The authors mention that they will add new baseline comparisons and results of ablation experiments in the revised version, so I give Weak Accept.



Review #2

  • Please describe the contribution of the paper

    The paper proposes “SaSaMIM”, a task-orientated masked image modeling approach designed to enhance colon tumor segmentation in non-contrast CT scans. This approach innovatively combines synthetic semantic-guided masking with a dual-branch decoder that integrates spatial and frequency domain analyses. The method aims to address the challenges of segmenting tumors that are difficult to distinguish from surrounding tissues in non-contrast CT images by leveraging synthetic anatomical semantics and Gaussian noise-based modeling for tumor regions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Innovative Integration of Synthetic Semantics: The method’s use of synthetic semantics to guide the learning process, particularly in a self-supervised framework, is innovative. It effectively addresses the lack of contrast in non-contrast CT images, which is a significant challenge in medical image segmentation. Dual-Decoding Architecture: The introduction of a frequency-aware decoding branch alongside a spatial decoder helps to capture both the texture and shape characteristics of tumors, potentially leading to more accurate segmentation results. Strong Empirical Results: The experiment results, both shown in Table 1 and 2 demonstrate the efficacy of the proposed approach through comprehensive cross-validation experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Missing Experiments on the Frequency Branch: One key contribution of this paper is their dual-branch architecture. However, no experiments demonstrate how the frequency decoder helped improve the segmentation. Writings (minor): There exists incorrectly referred tables, ‘XXX’s, awkward sentences. Proofreading is highly suggested.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Releasing the code is always preferable when it come to reproducibility, as well as the dataset (5-fold) is splited.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Add experiments demonstrating the efficacy of the frequency branch by for instance, excluding frequency loss. Instead of discussing the models for comparison in the Experiments section very briefly, add a section of Related Works and talk about those models there. In addition, the standard setup of Swin-UNTER could be described since that is the setup that the authors adopted. Proofread the manuscript. There are some awkward sentences and some ‘XX’ (section 3.1) to be fixed. In addition, the table referred in the first paragraph in section 3.3 is wrong. The supervised segmentation methods should be Table 2. Provide more detailed explanations and justifications for the choice of structural elements and operations used in synthetic semantic-guided masking. Discuss how different configurations might affect the model’s performance and robustness.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a technically sound and potentially impactful approach to addressing a significant challenge in medical imaging. The innovative use of synthetic semantics and the dual-decoding architecture are likely to advance the field of tumor segmentation in non-contrast CT images. However, to strengthen the paper, addressing its methodological justifications and manifesting the efficacy of the dual-branch design are crucial. Moreover, proofreading the manuscript is highly suggested.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addressed my concerns and provided numerical results. It would be great if they can include those results in the main paper.



Review #3

  • Please describe the contribution of the paper

    The paper presents a approach aimed at improving colon tumor segmentation in non-contrast computed tomography (NCCT) scans. Firstly, the authors introduced a task-oriented masked image modeling framework tailored for fine-grained synthetic anatomical semantic perception. This framework is specifically designed to enhance the segmentation accuracy of colon tumors. Secondly, the paper introduced a spatial-frequency dual-branch decoder, which serves to enhance the model’s perception of Gaussian noise-based target semantics. This decoder can contribute to refining the segmentation process, particularly in scenarios where noise might affect the accuracy of the segmentation results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper exhibits several strengths that elevate its significance in the field of medical imaging. Firstly, it introduces the task-oriented masked image modeling framework tailored for fine-grained synthetic anatomical semantic perception, a unique approach aimed at improving colon tumor segmentation in non-contrast computed tomography (NCCT) scans. This formulation addresses the specific challenges of segmenting colon tumors with high precision. Additionally, the design of a spatial-frequency dual-branch decoder enhances the model’s perception of Gaussian noise-based target semantics, thereby improving segmentation accuracy, particularly in scenarios where noise might affect the results. Moreover, the paper demonstrates the feasibility of the proposed approach in a clinical setting, emphasizing its potential real-world applicability. Furthermore, the paper likely includes a rigorous evaluation of the proposed method, involving comprehensive experiments conducted on relevant datasets and employing robust performance metrics. Finally, the design of Figure 2 likely contributes to the paper’s strength by effectively illustrating the architecture or components of the proposed framework, aiding readers in understanding the approach.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper has several notable weaknesses that should be addressed to bolster its contribution and clarity. Firstly, it fails to define common abbreviations like CT and MRI upon their initial use, potentially causing confusion for readers unfamiliar with medical imaging terminology. Secondly, the relatively small sample size of 110 cases used raises concerns about the generalizability of the proposed approach. A validation on a larger and more diverse benchmark dataset would enhance the robustness and applicability of the findings. Additionally, the use of the abbreviation “MAE” may lead to confusion among readers accustomed to its traditional meaning as “Mean Absolute Error.” Providing a distinct abbreviation or clarification would mitigate this confusion. Furthermore, the paper lacks clarity regarding the transfer of weights from Swin-UNETR, a model primarily focused on brain segmentation, to colon segmentation. Clarifying this process would improve reproducibility and understanding. Moreover, the statement regarding dataset compilation, “we compiled a dataset of early-stage colorectal cancer CT scans from XXX,” lacks specificity about the data source, compromising transparency and the ability for readers to assess the dataset’s quality. Lastly, the absence of a comparison with SegNet using a pretrained encoder such as ResNet diminishes the paper’s comprehensiveness. Including such a comparison would provide valuable insights into the relative performance of the proposed approach compared to established methods in medical image segmentation. Addressing these weaknesses would strengthen the paper’s contribution and ensure its clarity and comprehensiveness.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Firstly, providing definitions for common abbreviations like CT and MRI upon their first mention would greatly improve the accessibility of the paper, particularly for readers who may not be well-versed in medical imaging terminology. This can be achieved through the inclusion of a glossary or footnotes to clarify these abbreviations throughout the manuscript.

    Secondly, while the paper demonstrates promising results with a sample size of 110 cases, it would greatly benefit from validation on a larger and more diverse benchmark dataset. This would strengthen the credibility and generalizability of the study’s findings, providing a more robust foundation for the proposed approach. Expanding the dataset or conducting additional validation on external datasets could address this concern effectively.

    Moreover, to avoid confusion among readers familiar with the traditional meaning of “MAE” as “Mean Absolute Error,” it would be advisable to use a distinct abbreviation or provide clarification within the text to differentiate its usage in the context of the study. This clarity would prevent misunderstandings and ensure consistency in terminology throughout the paper.

    Additionally, transparency regarding the transfer of weights from Swin-UNETR to colon segmentation is essential for reproducibility and understanding. Providing detailed information on this process, including any adaptations made for the specific application, would enhance the transparency and credibility of the methodology. Consider including a dedicated section or supplementary material to elucidate this aspect of the study.

    Furthermore, clarifying the source of the compiled dataset, such as naming the institution or database from which the scans were obtained, would improve transparency and allow readers to assess the dataset’s quality and relevance. Detailed information about data collection protocols and ethical considerations would further enhance transparency and trustworthiness.

    Lastly, including a comparison with established methods like SegNet using a pretrained encoder such as ResNet would enrich the paper by providing valuable insights into the relative performance of different approaches. Conducting additional experiments or analyses to include this comparison would highlight the strengths and limitations of each method, contributing to a more comprehensive evaluation of the proposed approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation for a Weak Accept, contingent on rebuttal, is based on several factors. Firstly, the paper introduces concepts such as the task-oriented masked image modeling framework and the spatial-frequency dual-branch decoder, showing potential for significant impact in medical imaging. However, improvements are needed in defining abbreviations, clarifying technical processes, and providing specific dataset details to enhance clarity. While the paper presents promising results, the small sample size and lack of comparison with alternative methods raise concerns about robustness and generalizability. Addressing these limitations through validation on larger datasets and comparisons with established methods would strengthen the paper’s contribution. Moreover, enhancing reproducibility and transparency by clarifying technical processes and dataset sources is crucial for ensuring credibility.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their valuable suggestions. We were encouraged by the positive comments on the novelty and extensive experiments in real-world settings by all reviewers. We address their concerns below. #R3 [Dependence on Synthetic Data & Potential Overfitting] As pointed out correctly, in supervised learning, the quality of synthetic data is crucial, like SyntheticTumors (CVPR2023); DiffTumor (CVPR2024). We agree that using synthetic data for supervised training may cause overfitting. However, we did not use the “synthetic images” for supervision on segmentation. Instead, we aimed to prompt the network to focus on key “tumor semantics” in an unlabeled manner and facilitate pre-training, similar to SemMAE (NIPS2022); EfficientSAM (CVPR2024). Our downstream training was based completely on real clinical data. The highest performance on the real clinical data achieved by our method shown in Table 1 demonstrate that the facilitated pre-training could mitigate overfitting in the downstream tasks. This finding is consistent with arXiv:1906.12340 (NIPS2019); arXiv:2206.04664 (CVPR2023). [Extended Comparison] We compared with various methods, including MedNeXt, but did not include it because of a DSC of 57.8 lower than ours 64.7. Instead, we rather focused on comparisons with a broad range of SOTA self-supervised methods. #R3, R4, R5 [Ablation studies] As suggested, we conducted 5-fold cross-validation ablation studies to investigate the effects of Synthetic Masking (Sec 2.1) and Frequency Branch (Sec 2.2). Compared with the baseline, the Baseline + Sec 2.1 achieved +5.0 DSC and -19 HD; adding Sec 2.2 further improved by +1.8 DSC and -16 HD.

#R4 [Justification of Frequency Branch] As suggested, we did ablation studies to investigate that (see the results mentioned above). As seen, the effectiveness of this branch was significant. [Structural Operations] As suggested, we analyzed the initialization of the radius for the expansion and erosion based on the number of voxel points (1-5). Fewer voxel points (1,2) resulted in sparse bowel wall semantics, while too many (4,5) caused excessive adhesion. Hence, we configured the radius to 3. This will be mentioned in the final version. [Minor Mistakes and Other Suggestions] As suggested, we will fix all the errors you pointed out, re-organize Experiments and Related Works, add the standard setup of Swin-UNTER, and professionally proofread the final version.

#R5 [Abbreviations] As suggested, we will define all abbreviations in their first appearance in the final version. [Diverse Validation Dataset] In addition to the novel technological development, this study’s clinical motivation is to tackle the segmentation of colon tumors in non-contrast (NC)CT, a task few have explored. Current datasets are based on contrast CT, so we aim to establish a benchmark for tumor segmentation in NCCT. Our dataset of 110 cases aligns with mainstream CT validation benchmarks (e.g. BTCV with 30, MSD-liver with 131, MSD-colon with 126). This unique NCCT validation benchmark is a valuable clinical contribution, and we plan to expand it further. [Clarification of Weight Transfer] Swin-UNETR includes a Swin-Transformer feature encoder and a U-net backbone. During pretraining, an extra frequency branch was added to the backbone. For finetuning, only the Swin-Transformer weights were transferred, and the backbone was reinitialized. This clarification will be included in the final version. [Transparency of Dataset Source] We had to anonymize the dataset source as ‘XXX” because of the MICCAI anonymization policy. This dataset was specially collected for our study from our collaborating hospital because there was no public NCCT dataset for colon tumors. We will be able to reveal it in the final version. [Extended Comparisons] As suggested, we validated on SegResNet, in addition to our comparisons with a broad range of SOTA self-supervised methods. SegResNet achieved a DSC of 57.4, lower than ours 64.7.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top