Abstract

The precise prediction of Pathological Complete Response (pCR) following Neoadjuvant Chemo-ImmunoTherapy (NCIT) in Head and Neck Squamous Cell Carcinoma (HNSCC) is crucial for optimizing therapeutic strategies and prognostic evaluation. Current methods exhibit limitations in simultaneously modeling multi-temporal treatment dynamics, multi-sequence magnetic resonance imaging (MRI) correlations, and multi-modal feature interactions. To address this challenge, we present a novel multi-modal representation and fusion framework, HARM3-Fusion, which innovatively processes multi-temporal, multi-sequence MRI data and hierarchically fuses it with whole slide image (WSI) to enhance the accuracy of pCR prediction. Specifically, our method comprises three key modules: a multi-temporal module based on Loss-enhanced Dual-stream Convolutional Variational Auto-Encoder (LD-VAE), designed to decouple features from pre-treatment and post-treatment MRI scans; a multi-sequence module based on self-attention for integrating MRI features from T1 and T2 weighted sequences; and a multi-modal module based on cross-attention to fuse complementary information between MRI and WSI.To evaluate the efficacy of HARM3-Fusion, we establish HNSCC-pCR, the first multi-modal dataset for HNSCC. HNSCC-pCR dataset comprises 407 patients, with each case including pre-treatment and post-treatment T1-weighted and T2-weighted MRI scans, WSI of pre-treatment biopsy specimens, and pathologically confirmed surgical pCR. Based on this dataset, experimental results demonstrate that HARM3-Fusion achieves superior performance for pCR prediction compared to other single-modal and multi-modal approaches.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0572_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Jianye-Wang-WJY/HARM3-Fusion

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanJia_HARM3Fusion_MICCAI2025,
        author = { Wang, Jianye and Liu, Xinyue and Gong, Zhiying and Yang, Lingjie and Zhang, Hanwen and Long, Yu and Fan, Yimeng and Jiang, Yuncheng and Duan, Xiaohui and Zhao, Weibing},
        title = { { HARM3-Fusion: Hierarchical Attentional Representation Learning of Multi-Modal, Multi-Temporal, and Multi-Sequence Fusion for Pathological Complete Response Prediction of Head and Neck Squamous Cell Carcinoma } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {245 -- 254}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a novel multi-modal representation and fusion framework, which processes multi-temporal, multi-sequence MRI data and hierarchically fuses it with whole slide image (WSI) to enhance the accuracy of pCR prediction.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The practical part of the paper is very well done, I like the scope of the experiments and the description of the implementation details. The design of the method itself is imaginative, although there are already many combinations of different methods, including transformers. On the other hand, dealing with multi-modal data is very rewarding and not difficult. The publication of the code, which the authors promise in the abstract, will help a lot. They also show the interpretability of the presented method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I understand the trend, but there is so much fusion and complication in the method that the text does not convince me of the robustness of the method. On the other hand, the 407-sample dataset is one of the larger ones, and the results obtained are significantly better in almost all metrics than the SOTA methods. The article therefore has much to offer its audience.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I positively evaluate the practical part of the paper. However, I lack any evidence of the ability to generalize, which I consider important in such a combined framework.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces HARM3-Fusion, a hierarchical attention-based multi-modal fusion framework for pathological complete response (pCR) prediction in head and neck squamous cell carcinoma (HNSCC). The model decouples multi-temporal MRI features using a loss-enhanced dual-stream variational autoencoder (LD-VAE), applies self-attention for multi-sequence (T1, T2) fusion, and cross-attention for MRI-WSI multi-modal fusion. A new multi-modal dataset (407 patients) is curated, and experiments demonstrate significant performance improvements over unimodal and multimodal baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    i) The paper introduces a novel integration of multi-temporal, multi-sequence, and multi-modal information for HNSCC pCR prediction.

    ii) The author proposes an innovative LD-VAE design that explicitly attempts to disentangle temporal changes in MRI data.

    iii) The hierarchical attention presented in the paper combines self-attention and cross-attention and is well motivated and technically sound.

    iv) The authors present a comprehensive ablation study that supports the importance of each module.

    v) The newly created dataset presented in the paper (407 patients) is valuable and practical for clinical AI research.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    i) While comparisons are made with reasonable baselines (3D-RPNET, M2Fusion, HMCAT), more recent or competitive multi-modal methods (e.g., newer transformer-based fusion approaches) could have been included for a more convincing evaluation. 

    ii) The study uses a single-centre dataset. There is no external validation to demonstrate generalizability across institutions, scanners, or populations. This limits clinical applicability claims.

    iii) Although HARM3-Fusion outperforms prior methods, the improvement margins (especially AUC improvements ~5-9%) are moderate. It is not clear if the added model complexity justifies the gain in practical settings.

    iv) The ablation study focuses on attention modules but could have been strengthened by including experiments isolating the contributions of T1 vs T2 sequences separately, or WSI vs MRI individually, under controlled settings.

    v) The paper states that the code will be released “soon,” but no concrete plan for the dataset release is mentioned. Open accessibility could greatly enhance the paper’s reproducibility and impact.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    i) How sensitive is the LD-VAE to hyperparameters such as the weighting of the contrastive loss term (λ)? Was this tuned on a held-out validation set?

    ii) Was class imbalance (pCR vs non-pCR) explicitly addressed during training (e.g., loss weighting, sampling)? Given the imbalance (~30% pCR rate), this could impact sensitivity-specificity trade-offs.

    iii) Can the method handle missing modalities (e.g., if a patient lacks T2 or WSI data)? Some robustness analysis would strengthen practical viability.

    The paper presents technically sound and moderately novel ideas. However, the lack of external validation, relatively modest performance improvements over strong baselines, and limited comparative evaluation are notable limitations. Nevertheless, the problem is important, the method is thoughtfully designed, and the dataset contribution is valuable. Strengthening the validation or expanding the comparative analysis would significantly enhance the impact of the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces HARM^3-Fusion, a hierarchical attention-based multi-modal framework for predicting pathological complete response (pCR) in head and neck squamous cell carcinoma (HNSCC). The key contributions of this work include:

    1. Temporal Feature Decoupling via LD-VAE: The authors design a Loss-Enhanced Dual-stream Variational Autoencoder to effectively disentangle latent feature variations between pre-treatment and post-treatment MRI scans.
    2. Hierarchical Multi-Sequence & Multi-Modal Fusion: They propose a hierarchical fusion mechanism that employs self-attention to integrate multi-sequence MRI features (T1-weighted and T2-weighted scans) and cross-attention to fuse MRI features with WSI features​.
    3. New HNSCC-pCR dataset: The paper contributes the largest HNSCC pCR prediction dataset to date, consisting of over 400 patients each with paired pre- and post-NACT MRI scans, corresponding WSI data, and pCR outcome labels. This multi-modal dataset (on the order of ~407 patients) provides a valuable resource to standardize evaluation in this domain and is used to validate the proposed method.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Address a critical problem: This work tackles an important clinical challenge - predictiong treatment response in HNSCC, by leveraging multi-modal data. Integrating longitudinal MRI changes with histopathology information is highly relevant, given prior evidence that patients with pCR exhibit markedly different pre/post imaging appearances compared to non-responders.
    2. The proposed method demonstrates state-of-the-art results on the new HNSCC dataset. It outperforms a range of baselines, including both unimodal MRI models and other multi-modal fusion approaches.
    3. The proposed dataset is well-constructed and can facilitate further research in this community.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed solution, while effective, is built upon known building blocks (variational autoencoders for feature disentanglement and transformer-style attention for feature fusion). Thus, the novelty lies more in the particular combination and application to a new multi-modal prediction task, rather than in fundamentally new machine learning techniques. The contribution is somewhat incremental.
    2. The model architecture is fairly complex, comprising multiple stages. Training such a model on a dataset of only a few hundred patients could risk overfitting, especially given the class imbalance (122 pCR vs 285 non-pCR)​. There is no discussion of whether the performance might degrade on new unseen data or smaller datasets.
    3. There are a fewer minor inconsistencies in the paper. For example, the number of patients is stated as 470 in the introduction but in practical is 407. Also, in the abstract and introduction, the paper refers to “WSI after NACT”, while the dataset description says “WSI before NACT”. This contradiction needs clarification, as it affects the interpretation of the multi-modal setup (using a post-NACT surgical WSI versus a pre-therapy biopsy WSI are very different scenarios).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think this problem formulation is valuable and the proposed dataset can facilitate further research in this field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have well-addressed my concerns . I recommend to accept this paper.




Author Feedback

We thank all reviewers for their constructive suggestions, and particularly appreciate their commendation of our model’s practical value (R1, R3), methodological design (R1, R2), experimental design and comprehensive ablation analyses (R1, R2). We will carefully revise the manuscript in response to all comments. Our responses are as follows: Q1: Novelty and Effectiveness (@R2, @R3): (1) Task Innovation: this study is the first to propose the clinical task of predicting pCR in HNSCC following NACT. We collected the largest comprehensive multimodal HNSCC-pCR dataset and developed HARM³-Fusion, the first predictive model for pCR in HNSCC. Early prediction of pCR can effectively guide the formulation of function-preserving surgical strategies, mitigating excessive resection of critical structures (e.g., larynx, tongue). Therefore, precise pCR prediction is essential. (2) Technical Innovation: As pCR prediction is multifactorial and far more complex than cancer classification, simple module combinations cannot extract pCR-specific features from MRI and WSI. We thus designed LD-VAE, employing cross-reconstruction and contrastive losses to temporally decouple MRI data. Inspired by clinical diagnostic practice, we fuse multi-sequence MRI and multimodal MRI + WSI data via a hierarchical attention mechanism that more effectively captures MRI dynamics and MRI–WSI interactions, representing a notable innovation. (3) Performance Improvement: HARM³-Fusion increases AUC by 5.22% and ACC by 7.93% over the existing sota methods, representing significant improvements with important clinical implications. Q2: Model Complexity and Robustness (@R1, @R3): Our model is lightweight. Multi-modules were designed to extract latent features from each patient’s four MRI scans and ultra-high-resolution WSI. However, in terms of model parameters and computational complexity, our LD-VAE and hierarchical attention module comprise only 1.62 M and 4.33 M parameters, fewer than the contrastive learning–based multimodal sota method M2Fusion [20]. Our lightweight model only requires 13 GB of GPU memory and 40-minute training on an RTX 4090, while maintaining strong robustness. Q3: Multi-Center Dataset and Generalization (@R1, @R2): Collecting and standardizing a clinical dataset is particularly challenging. Our original dataset comprises WSIs acquired from KF-PRO-020-HI and Pannoramic 250 Flash III scanners and MRIs obtained from an Ingenia 3.0 T system, covering individuals of diverse ages and sexes. Also, we collected multimodal data from another center (67 cases) and tested the pretrained model on this unseen dataset. Its ACC and AUC for pCR prediction exceeded those of comparative models, demonstrating generalizability across institutions, scanners, and patient populations. Q4: Experimental Details (@R2): (1) Comparative Experiments: Multimodal models for pCR prediction are scarce. The latest Transformer-style fusion method HMCAT underperforms HARM³-Fusion (Tab. 1). (2) Ablation Study: We evaluated the MRI T1 and T2 sequences separately, with averaged results reported under “LD-VAE = √” in line 4 of Table 2. Existing WSI-centric methods (CLAM/TransMIL) underperform our multimodal method in pCR prediction in Tab.1, due to the absence of dynamic priors captured from pre-/post-treatment MRI. (3) Implementation Details: the hyper-parameters of LD-VAE are optimized via five-fold cross-validation to prevent overfitting. The model achieves optimal performance at λ = 0.5 while remaining robust for λ ∈ [0.3, 0.7], indicating moderate sensitivity to parameter settings. We employ data augmentation to oversample minority-class samples and address class imbalance. Missing modality will be explored in future work. Q5: Data Description and Dataset Release (@R2, @R3): We confirm that our dataset comprises 407 patients and all the WSIs were collected pre-treatment. Upon publication, researchers may apply for access to the dataset by contacting us via email.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    (1) Predicting pCR using post-treatment MRI images has limited clinical significance; (2) Containing multiple inconsistencies (both text and figures); (3) Lacking description of MRI data preprocessing and key ablation experiments.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top