Abstract

Positron emission tomography (PET) combined with computed tomography (CT) imaging is routinely used in cancer diagnosis and prognosis by providing complementary information. Automatically segmenting tumors in PET/CT images can significantly improve examination efficiency. Traditional multi-modal segmentation solutions mainly rely on concatenation operations for modality fusion, which fail to effectively model the non-linear dependencies between PET and CT modalities. Recent studies have investigated various approaches to optimize the fusion of modality-specific features for enhancing joint representations. However, modality-specific encoders used in these methods operate independently, inadequately leveraging the synergistic relationships inherent in PET and CT modalities, for example, the complementarity between semantics and structure. To address these issues, we propose a Hierarchical Adaptive Interaction and Weighting Network termed H2ASeg to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we design a Modality-Cooperative Spatial Attention (MCSA) module that performs intra- and inter-modal interactions globally and locally. Additionally, a Target-Aware Modality Weighting (TAMW) module is developed to highlight tumor-related features within multi-modal features, thereby refining tumor segmentation. By embedding these modules across different layers, H2ASeg can hierarchically model cross-modal correlations, enabling a nuanced understanding of both semantic and structural tumor features. Extensive experiments demonstrate the superiority of H2ASeg, outperforming state-of-the-art methods on AutoPet-II and Hecktor2022 benchmarks. The code is released at https://github.com/JinPLu/H2ASeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0500_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0500_supp.pdf

Link to the Code Repository

https://github.com/JinPLu/H2ASeg

Link to the Dataset(s)

https://autopet-ii.grand-challenge.org/dataset/ https://hecktor.grand-challenge.org/Data/

BibTex

@InProceedings{Lu_H2ASeg_MICCAI2024,
        author = { Lu, Jinpeng and Chen, Jingyun and Cai, Linghan and Jiang, Songhan and Zhang, Yongbing},
        title = { { H2ASeg: Hierarchical Adaptive Interaction and Weighting Network for Tumor Segmentation in PET/CT Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. The introduction of H2ASeg, a novel deep learning architecture designed for precise tumor segmentation in PET/CT images by hierarchically modeling the correlations between PET and CT modalities to exploit their complementary information.

    2. The development of a Modality-Cooperative Spatial Attention (MCSA) module that facilitates both global and local interactions between modalities to enhance the transfer of valuable information across PET and CT.

    3. The proposal of a Target-Aware Modality Weighting (TAMW) module that identifies and emphasizes tumor-related features within multi-modal features, refining the tumor segmentation process.

    4. Extensive experimental validation demonstrating the superiority of H2ASeg over state-of-the-art methods on AutoPet-II and Hecktor2022 benchmarks, showcasing its effectiveness in improving segmentation accuracy and efficiency.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Formulation of Hierarchical Interaction: The paper introduces a novel hierarchical approach to model the cross-modal correlations between PET and CT images. This is achieved through the Hierarchical Adaptive Interaction and Weighting Network (H2ASeg), which is designed to understand the nuanced relationships between semantic and structural features of tumors. The hierarchical nature of the model allows for a more refined and accurate segmentation, which is crucial in clinical settings.

    Original Use of Data through MCSA and TAMW Modules: The paper presents two original components—the Modality-Cooperative Spatial Attention (MCSA) and Target-Aware Modality Weighting (TAMW) modules. The MCSA module is unique in its ability to perform both global and local feature interactions, capturing long-range dependencies and detailed information effectively. The TAMW module, on the other hand, is designed to adaptively weight features based on their relevance to the tumor, which is a novel way of emphasizing tumor-related features for improved segmentation.

    Innovative Feature Interaction Mechanism: The Bi-Directional Spatial Attention (BDSA) mechanism within the MCSA module is an innovative approach to feature interaction. It uses a combination of self-attention and cross-attention to facilitate the exchange of information between PET and CT modalities.

    Excellent experimental results: This article compares various SOTA methods and can obtain excellent experimental results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Model Complexity: The hierarchical design of the H2ASeg model, augmented with innovative components like MCSA and TAMW, has the potential to lengthen training durations and necessitate greater computational expenditure compared to more conventional network architectures. This increased complexity might impede its practical deployment in resource-constrained settings.

    Insufficiency of Implementation Details: The absence of comprehensive implementation specifics and open-source code undermines the ability to decisively evaluate the H2ASeg model’s superiority over existing methods. This lack of detailed documentation could significantly challenge the scientific community’s efforts to replicate and validate the reported outcomes, raising concerns about the reproducibility and accessibility of this approach.

    Underrepresentation of CT and PET Distinctions: The interplay between CT and PET scans is crucial for nuanced tumor identification; however, the experimental findings presented fail to adequately highlight the complementary roles of these imaging techniques. The article’s lack of detailed analysis on this front diminishes the potential for a comprehensive understanding of how the H2ASeg model leverages these distinct imaging modalities for improved diagnostic accuracy.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • “each window of the inputs is pooled into a token by a convolutional layer with the kernel size and stride equal to the window size” What does it mean?

    • Shows the time required for training and the memory size.

    • Displaying the time required for training, floats, and the amount of GPU memory occupied can help users understand the training cost.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Details of network implementation, discussion for the complementarity of PET and CT.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a novel deep-learning framework called H2ASeg for PET/CT lesion segmentation which improves performance by utilizing the hierarchical interaction of multi-modal (PET+CT) information, making the method more efficient than traditional modality fusion techniques such as channel concatenation, etc. In particular, the paper incorporates Modality-Cooperative Spatial Attention (MCSA) module in the encoding path for implementing modality interaction at the global and local scales via inter- and intra-window Bi-Directional Spatial Attention (BDSA). This enhances the complimentary information relevant for accurate tumor localization from PET and CT. Additionally, another component called Target-Aware Modality Weighing (TAMW) was used in the decoding path to focus on target features for obtaining optimal tumor boundaries. The experiments were carried out on two publicly available PET/CT datasets, AutoPET-II and HECKTOR 2022.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (a) The paper introduces a new way of modality fusion between PET and CT which enables efficient learning of joint representations from both PET and CT images all global and local scales. This is a novel and important contribution of the paper, highlighting resemblance to radiologist’s way of reading images. (b) In particular, the inter- and intra-window bidirectional attention within MCSA enhance long-range and local representations respectively, improving segmentation performance over other state-of-the-art methods. (c) TAMW highlights the multimodal features from tumors by learning to differently weigh foreground and background features. This helps (as the authors’ claim) to further improve performance. (d) Strong benchmarking and comparison to supervised learning based networks (CNN-based, Transformer-based, multi-modal fusion-based, etc.). (e) Ablation over the presence/absence of MCSA and TAMW modules further elucidate the important roles they play in segmentation. Along with the foreground emphasis maps obtained from TAMW, this work, in a way, also throws lights into the potential explainability of the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (a) I suspect that the use of additional supervision signals (deep supervision) at different levels of the network (and the reformulated Loss function with different weighing at different levels) was also (in part) responsible for performance improvement. Were the other networks also trained using the same loss function? This is not clear from the paper. (b) Although the work utilizes multiple segmentation-based evaluation metrics and obtains decently high scores on them, these might not always reflect the clinical applicability of these methods as emphasized in some existing works on lesion segmentation like Liu, Z., et al (2023) [https://doi.org/10.1117/12.2647894], Jha, A. et al, (2012) [10.1088/0031-9155/57/13/4425], Ahamed, S., et al (2023) [https://arxiv.org/pdf/2311.09614.pdf]. (c) It is not very clear to me how the values in Table 3 are computed. Are these the same as W^{k}{fore} and W^{k}{back}? Moreover, the authors emphasize that their method tends to highlight CT features in shallow layers and PET features in deeper layers for foreground but don’t explain the same for the background (which doesn’t seem to follow the same trend as foreground). How will you explain this discrepancy?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No codes has been shared although the paper seems to have sufficient details for potential reproducibility. The model was developed on public datasets, although, the specific details on the cases included in training, validation and test phases has not been shared (which might limit the scope for reproducibility slightly).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (a) Please discuss whether the same loss function was used for other networks too? If not, how would you highlight the contribution to performance coming solely from the choice of loss function? (b) All the methods were run 4 times to avoid “randomness”. Were they run on the same train/valid/test split of the data each time? If not, was the test set at least fixed over different runs? Please explain this. (c) Are the reported standard deviation (std) values for the metrics in Table 1 over the 4 runs (std of mean over 4 runs) or are they over all the cases within the test set (std of metric values over N cases, where N is the number of cases in the test set)? Please explain. (d) The paper can benefit from better explanation of Fig. 1, for example, explain the use of darker and lighter shades of different colors used for the blocks in the schematics for MSCA and TAMW in Fig. 1. (e) As pointed out in the weakness section above, how do you plan to expand this work evaluating its clinical importance of these results (use of task-based metrics, additional evaluation from radiologists, etc.). Additionally, discuss the future scope of this work in the Conclusion section with relevant citations.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The newly proposed method for modality fusion demonstrates novelty and robustness, closely mirroring the analytical approaches utilized by radiologists in clinical image interpretation. This work represents a notable advancement in the domains of feature fusion and multimodal joint representation learning.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present a method for tumor segmentation from PET/CT images. Instead of fusing the different modalities as separate channels or having modality-specific features, the authors fuse the different modalities using cross-modality attention mechanisms on a dual-encoder network structure. Further, the authors use the ground truth to guide the decoder of their network at all the resolution stages. The proposed method, termed H2ASeg is shown to outperform SOTA methods on the AutoPet-II and Hecktor2022 benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Interesting network structure • The use of the ground truth (GT) in all levels of the decoder to guide segmentation is interesting. • Ablation studies show the relevance of the cross-attention module, as well as the gt-guided decoding. • Great performance on the two presented benchmarks

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The weighting of the cost function at different levels is not very reasoned. Further insight may be interesting.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    • It would be great to have an open git repository with the code • Experimentation is done on public benchmarks

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • The authors claim “Moreover, MCSA and TAMW are flexible enough to be embedded into existing architectures with higher efficiency to excavate intrinsic correlations between modalities.” Would be great to see such affirmation in other use-cases. • The PET images of Fig. 3, Hecktor database are very interpolated and hard to interpret. A nearest neighbor interpolation might be better. • The authors claim that “On AutoPET-II, due to the limited global modeling ability, CNN-based methods like UNet-3D, ResUNet-3D are easily interfered with by the high metabolic areas”. I disagree with that statement, as UNets are great at obtaining global models and long-range interactions.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a good paper with a solid experimentation section.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks to the reviewers for their valuable suggestions, which inspires us to perfect our work!

Reviewer4: weekness1 Our work aims to design a more efficient multi-modal fusion strategy than conventional models. As shown in Table 1, conventional models such as UNet and ResUNet, get poor segmentation performance, and nnUNet, as their expansion, needs to add massive parameters (see its GitHub for details) and longer training time (due to more pre-and post-processing, it takes about a week to complete training) to achieve segmentation performance close to our model. Therefore, we believe that our proposed multimodal fusion strategy will have better clinical deployment prospects than conventional architectures.

Reviewer4: weekness2&comment1 We will open-source the code after our paper is officially published. Regarding the questions you may have about Win.Pool, I sincerely provide you with more details. In the article we use nn.AvgPool3d(tuple(window_size)), so each window can be pooled into a token.

Reviewer4: weekness3 As mentioned in the last paragraph of the introduction, the process of a radiologist segmenting a tumor can be divided into three steps: comparing PET/CT, locating the tumor, and outlining the tumor. For the first step, we designed MCSA to promote information interaction between the two modalities, and the effectiveness of it is verified in Figure 4. For the steps left, we developed TAMW based on the characteristics of PET/CT and our understanding of the U-Net. The deep layers, with their larger receptive fields, may focus on localizing the tumor, which is what PET is good at; while shallow layers, with larger image resolution, may focus on tumor contour, which is what CT is good at. Will adaptively weighting the contributions of PET/CT at different layers improve segmentation? The answer to this question is given in Table 2, and Table 3 supports our motivation.

Reviewer4: comment1 The training time of H2ASeg ranges from one day (CPU idle) to three days (CPU busy), with no significant difference from other models, except nnUNet. When the shape of input is (4, 2, 128, 128, 64), the GPU memory usage is 21000+Mib.

Reviewer5: weekness1&comment1 Firstly, in our experiments, we ensure that the loss functions of all models are bceloss + diceloss. Secondly, in section 3.2, we added the modules to the baseline that also uses deep supervision to obtain the results in Table 2. The control variable method may answer your question. Finally, we do not force all networks to adopt deep supervision but maintain the original configuration in their paper. For the models using deep supervision, we set the same loss weights. Because some models are imported from libaries, and some models have many outputs (A2FSeg has 16 outputs).

Reviewer5: comment2 We firstly deivded the test set and fixed it in subsequent experiments, then repeated the experiment on the remaining data.

Reviewer5: weekness3&comment3 We collected the weights of the TAMW in the fixed testing set. For Table 3, as Reviewer4: weekness3 mentioned, the effect of TAMW mainly focuses on the awared targets, so we only analyzed the weights from the foreground emphasis. Secondly, the weights from the background emphasized often have large std, leading to weak interpretability.

Reviewer5: comment4 What a great question! We try to represent the focus of the model by using darker and lighter shades. In MCSA, after the inter-window attention, with PET’s positioning ability and feature interaction, both PET and CT features can detect the regions where tumors may appear, so we darken the color of the bar in these regions. Through intra-window attention, with CT’s structural information, model can determine whether these regions are tumors, so we set the bars in incorrect regions to a lighter color. The strategy in TAMW is the same.

Reviewer6: comment3 I sincerely provide you with the papers we refer to: ViT, nnFormer, UNETR, UNETR++, NestedFormer.




Meta-Review

Meta-review not available, early accepted paper.



back to top