Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate cancer segmentation in PET-CT images is crucial for oncology, yet remains challenging due to lesion diversity, data scarcity, and modality heterogeneity. Existing methods often struggle to effectively fuse cross-modal information and leverage self-supervised learning for improved representation. In this paper, we introduce C²MAOT, a Cross-modal Complementary Masked Autoencoder with Optimal Transport framework for PET-CT cancer segmentation. Our method employs a novel modality-complementary masking strategy during pre-training to explicitly encourage cross-modal learning between PET and CT encoders. Furthermore, we integrate an optimal transport loss to guide the alignment of feature distributions across modalities, facilitating robust multi-modal fusion. Experimental results on two datasets demonstrate that C²MAOT outperforms existing state-of-the-art methods, achieving significant improvements in segmentation accuracy across five cancer types. These results establish our proposed method as an effective approach for tumor segmentation in PET-CT imaging. Our code is available at https://github.com/hjj194/c2maot.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2322_paper.pdf

SharedIt Link: https://rdcu.be/eHwKT

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_9

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/hjj194/c2maot

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaJia_C²MAOT_MICCAI2025,
        author = { Huang, Jiaju AND Chen, Shaobin AND Liang, Xinglong AND Yang, Xiao AND Zhang, Zhuoneng AND Sun, Yue AND Wang, Ying AND Tan, Tao},
        title = { { C²MAOT: Cross-modal Complementary Masked Autoencoder with Optimal Transport for Cancer Segmentation in PET-CT Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {87 -- 97}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces a novel self-supervised learning framework designed to extract features from two modalities, aiming to address the challenge of limited labeled data. Additionally, the method incorporates an Optimal Transport (OT) loss to reduce the distribution gap between modalities, facilitating more effective cross-modal feature alignment.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper combines Optimal Transport (OT) loss and a masking strategy to pretrain an encoder for CT and MRI images in a self-supervised manner. It further evaluates the effectiveness of the pretrained model by comparing its performance against several state-of-the-art methods on downstream segmentation tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The combination of Optimal Transport (OT) loss and MCM demonstrates potential for improving segmentation performance in medical imaging tasks. However, the paper’s novelty is limited, as both techniques have been explored extensively in prior work and are already known to be effective across various domains. A more detailed analysis or ablation demonstrating their complementary roles would significantly enhance the impact of the work.

Regarding Contribution 2, the claim that the model reconstructs tumors by learning tumor representations and spatial patterns seems overstated, as the learning is still limited to the training dataset. There is no clear evidence that the model generalizes to new or unseen tumor structures beyond the distribution it was trained on, especially in the context of the proposed cross-modality (CM) strategy.

Additionally, the inconsistent font sizes in the figures should be corrected to improve overall presentation quality and readability.

The paper lacks a clear description of the segmentation task illustrated in Figure 1, part C. It would be helpful if the authors could provide more detailed explanations of the input/output structure, the role of the pretraining components in this stage, and how the segmentation decoder is integrated into the overall framework.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Please refer to section 7
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper is well-executed, it lacks sufficient novelty. The core techniques—Optimal Transport loss and MCM —have both been widely studied in prior work, and the current combination does not introduce a clearly original methodological contribution or theoretical insight.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a cross-modal complementary masked autoencoder integrated with an optimal transport framework for PET-CT cancer segmentation. Experimental results across two datasets demonstrate that the proposed method outperforms current state-of-the-art approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) This paper addresses several key challenges in PET-CT segmentation: (a) the limited availability of labeled PET-CT data and the high cost of training segmentation models, (b) the difficulty of achieving accurate delineation of tumor regions, and (c) the challenge of aligning heterogeneous features between PET and CT modalities. The proposed method, C2MAOT, tackles these issues by leveraging a self-supervised pre-training framework that incorporates a spatially-aware complementary masking strategy and an optimal transport (OT) loss to facilitate effective cross-modal representation learning.

(2) The motivation and contributions of the work are clearly articulated and well justified.

(3) The experimental evaluation, including the ablation study, is well-designed and provides strong empirical support for the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) The paper does not specify the data split ratios used for dividing the dataset into training, validation, and testing sets.

(2) The paper lacks reporting of training and inference times. Providing these metrics is crucial for understanding the computational efficiency and potential clinical applicability of the proposed method.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is clearly written, well organized, and presents promising results that demonstrate the effectiveness of the proposed method.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed the concerns in the reviews.

Review #3

Please describe the contribution of the paper

In this work the authors propose a self-supervised pretraining for multimodal imaging data. The methods is based on pre-training using a masking procedure with complementary patches between the two modalities and optimal-transport based loss for fusing multimodal features. The main method is implemented and tested on PET/CT data for multiple cancer types. The comparison with state-of-the-art and the ablation study indicates the additive power of the proposed pre-training.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1)Pre-training and reducing reliance on labeled data is crucial in this domain, given the challenges of annotating images with multiple and variable tumors. 2)The main strength is that the proposed SSL pre-training method includes all the complementary procedures for SSL training, a pretext task that uses masked image modeling exploiting the multimodal imaging data, a loss function to implement the fusion and the inference procedure to better employ the pre-trained backbone for the segmentation. 3)Furthermore, the study presents extensive evaluations across different cancer types, an ablation study, as well as visualizations and qualitative results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper provides some implementation details, such as training hyperparameters; however, key architectural aspects of the UNet, such as the number of features per layer, as well as details on dataset splits, cross-validation techniques, and data specifications are missing. Including these details would enhance reproducibility and improve the overall presentation of the work. Relatively low metrics in the two datasets for all methods that may indicate that some hyperparameters could be better finetuned.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces an innovative self-supervised pre-training method that reduces reliance on labeled data in medical image segmentation, particularly for complex tumors. Its strengths include a comprehensive SSL framework incorporating masked image modeling, a tailored fusion loss, an effective inference procedure, and extensive evaluations across various cancer types with qualitative and ablation studies. However, additional details regarding implementation would enhance reproducibility and overall clarity. In this regard, I recommend an accept.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Overall, the authors have adequately addressed my main comments in their responses. I maintain my initial recommendation for acceptance, based on the reasons outlined prior to the rebuttal.

Author Feedback

To the Area Chair and Reviewers, we thank you for your constructive feedback. Below we address the key concerns:

Q1: Dataset Splitting & Training Parameters. (AC, R2, R3) A1: For segmentation task, we employed a 5-fold cross-validation. Patient-level data was partitioned using a fixed random seed of 42. Table 1 and 2 presents the mean results across five folds. Training used SGD (lr=1e-4, momentum=0.99, weight decay=3e-5) with polynomial decay and equally weighted BCE + Dice loss.

Q2: Runtime Details. (R2) A2: Pre-training (500 epochs): ~100 hours; Fine-tuning (1000 epochs): ~21 hours/fold; Inference: ~30 seconds/case.

Q3: Reproducibility (R2) A3: We will release the code upon acceptance, as stated in our abstract.

Q4: U-Net Architecture. (R3) A4: CT Encoder and PET Encoder share the same 3D U-Net architecture, each with 6 stages featuring [32, 64, 128, 256, 320, 320] channels respectively. Each stage has two 3x3x3 conv layers with InstanceNorm and LeakyReLU. Downsampling is via strided conv. A single symmetric decoder mirrors the encoder structure with upsampling. PET/CT features from the encoders are concatenated and then refined using a 1x1x1 conv with GroupNorm.

Q5: On Metrics. (R3) A5: Current hyperparameters showed excellent performance on primary breast cancer. Metastatic tumors are inherently challenging (small, scattered, numerous), affecting scores for all methods. We will explore broader hyper-parameter ranges in future work.

Q6: Novelty & Complementary Roles of MCM + OT. (R1) A6: Novelty: (1) Unlike prior random masking (e.g., MAE3D), our proposed MCM creates complementary masks across PET-CT (when PET patch is masked, corresponding CT remains visible), forcing cross-modal learning by requiring networks to use visible modality to reconstruct masked regions, leveraging metabolic-anatomical complementarity. Table 1 shows our method outperforms MAE3D by 3.74% in average DSC. (2) To emphasize deeper semantic alignment, we propose layer-wise weighted OT loss that prioritizes alignment of deeper features. Fig.3 shows our OT reduces inter-modality distance while increasing intra-modality dispersion (2.15→1.91). Complementary Roles: MCM operates at input level (patch-to-patch) while OT works at feature level (distribution-to-distribution); MCM promotes cross-modal learning while OT ensures feature distribution alignment. Table 2 confirms this synergy: the combined approach (68.33% avg. DSC) substantially outperforms MCM-only (66.04%), OT-only (64.98%), and baseline (63.54%). Notably, the combined improvement exceeds the sum of individual contributions, demonstrating complementary roles in addressing the multi-faceted challenges of PET-CT tumor segmentation.

Q7: Claim of tumor generalization. (R1) A7: Our intent is to demonstrate transferability within the PET-CT domain, not universal tumor generalization. As detailed in section 3.1 and Table 2, we pre-trained on 50% data of AutoPET III (lung cancer, lymphoma, melanoma) without exposure to breast cancer. After fine-tuning, the MCM variant raises metastatic breast cancer DSC from 53.42% to 57.13%, showing the learned representation transfers effectively to anatomically distinct tumors within the same imaging modality. The MCM specifically enhances this capability by promoting cross-modal learning. We will tone down our claim in the manuscript to avoid overstating generalization.

Q8: Clarity of Fig. 1, Part C. (R1) A8: Inputs are paired PET-CT volumes; output is tumor segmentation mask. Dual encoder architecture (E_CT and E_PET in Fig. 1c) is identical to pre-training (Fig. 1a). Pre-trained weights initialize encoders, which are fine-tuned during training. The decoder (“D” in Fig. 1c) mirrors the encoder but is trained from scratch. Features from both encoders are concatenated before feeding into the decoder, leveraging cross-modal representations while adapting them for tumor segmentation task.

Q9: Figure Fonts. (R1) A9: Figure font sizes will be made uniform.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

The main concerns are related on missing details regarding the data splitting and training parameters.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The rebuttal address the issues from reviewers from some extent, but the answers to some questions raised by reviewers are not convincing. The results are marginally improved

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This work presents enough technical contributions and meets the bar of MICCAI.

back to top

C²MAOT: Cross-modal Complementary Masked Autoencoder with Optimal Transport for Cancer Segmentation in PET-CT Images

Author(s):