Abstract

The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2709_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/yjx1234/MMCAF-Net

Link to the Dataset(s)

Lung-PET-CT-dx dataset: https://www.cancerimagingarchive.net/collection/lung-pet-ct-dx/

BibTex

@InProceedings{YuJia_Small_MICCAI2025,
        author = { Yu, Jianxun and Ge, Ruiquan and Wang, Zhipeng and Yang, Cheng and Lin, Chenyu and Fu, Xianjun and Liu, Jikui and Elazab, Ahmed and Wang, Changmiao},
        title = { { Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {592 -- 601}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes MMCAF-Net, a deep learning framework that addresses the challenges of small lesion detection and cross-modality fusion. This paper focuses on lung disease classification using PET/CT images and tabular clinical data. The authors introduce the E3D-MSCA module to improve lesion sensitivity, the Multiscale Cross-Attention (MSCA) module for multimodal feature fusion, and a Bidirectional Scale Fusion (BSF) module to resolve dimensional inconsistencies. Experimental evaluation on the Lung-PET-CT-Dx dataset demonstrates superior performance over several state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper clearly state the motivation developing MMCAF-Net. It addresses the challenges in small lesion detection and aligning features between modalities with different dimensional structures.

    The paper includes comparisons with relevant baselines. Performance improvements were reported accuracy, F1 score, and PPV, highlighting the effectiveness of the proposed framework.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors use Kolmogorov–Arnold Networks (KAN) for encoding tabular data. However, the authors did not describe KAN in details, and did not provide qualitative or quantitative evidence that KAN is the best choice in the model.

    One of the key motivations of this paper is to integrate the tabular data in the image analysis. However, the authors did not describe in detail what tabular data is used in the experiments. The paper does not provide evidence showing how tabular data contribute to the decisions of the model. The model also does not provide interpretability about how the tabular data is involved in the decision-making processes.

    The experiment is on a relatively small dataset. The authors did not perform cross validation in the experiments. The generalizability and stability of the model is not sufficiently tested.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    One of the key motivations of this paper is to integrate the tabular data in the image analysis. However, the paper does not provide evidence showing how tabular data contribute to the decisions of the model.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Although the authors argue that the results in tables 1 and 2 show that utilizing tabular data improves the performance of the model, I am unable to draw the conclusion from the tables. I am not sure which model corresponding to the ablated version of the proposed model without utilizing tabular data.



Review #2

  • Please describe the contribution of the paper

    This paper proposes MMCAF-Net, a multimodal multiscale fusion network for lung disease classification, specifically targeting the challenge of small lesion detection. The framework integrates 3D medical imaging and clinical tabular data using three key components: (1) an Efficient 3D Multi-Scale Convolutional Attention (E3D-MSCA) module for lesion-focused feature extraction, (2) Kolmogorov–Arnold Networks (KAN) for encoding structured clinical data, and (3) a novel Multiscale Cross Attention (MSCA) fusion module paired with a Bidirectional Scale Fusion (BSF) block for feature alignment and integration. The model is evaluated on the Lung-PET-CT-Dx dataset and shows consistent improvements over several existing multimodal baselines across key diagnostic metrics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses an important problem, improving small lesion detection in multimodal clinical settings.
    • The design is methodologically sound and integrates several complementary modules tailored for 3D imaging and cross-modal fusion.
    • Ablation studies are thorough and support the effectiveness of the proposed modules.
    • Performance gains demonstrate the practical benefit of the proposed approach.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The innovation is primarily in architectural integration; the individual components (attention blocks, cross-attention, feature pyramids) are adaptations of known designs.
    2. The evaluation is conducted on a relatively small dataset, limiting the ability to assess the model’s generalizability to broader clinical settings or external datasets.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents an incremental advance rather than a fundamentally novel approach, it effectively combines established methods to tackle a relevant and challenging clinical task. The empirical results are convincing, and the paper is well-structured with clear motivation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed all of my concerns effectively. Overall, the paper presents a strong contribution and is deserving of acceptance.



Review #3

  • Please describe the contribution of the paper

    The main contribution of the authors in this paper is the Multi- modal Multiscale Cross-Attention Fusion Network (MMCAF-Net) for extracting lesion-specific features from 3D medical images and hence classify them as either adenocarcinoma or squamous cell carcinoma. Some of the impressive challenges that the paper promises to solve are the incapability of the SOTA techniques to handle small lesions. Also, the method is able to handle the differences in dimensionality between medical imaging and electronic health record and thereby use both for effective classification.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strengths of the paper are the following 1) Their proposed model Multi-modal Multiscale Cross-Attention Fusion Network (MMCAF-Net) is combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. 2) Further to enhance multimodal data integration, MMCAF-Net introduced a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. 3) A detailed ablation study was included to give an impression of why the introduction of different modules in the proposed framework was necessary and how they impacted the overall performance of the model. 4) A detailed performance comparison was given with other SOTA techniques.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper has several weaknesses especially when it comes to result section. 1) First the dataset description part is very confusing. Once it is mentioned that 251 adenocarcinoma samples and 61 squamous cell carcinoma are present in the training dataset and immediately next its mentioned only 34 squamous cell carcinoma sample was used for training thereby needing augmentation for class imbalance. So, please clarify are you using any subset of the actual dataset or the dataset as a whole. 2) Most of the comparison given with SOTA techniques use different datasets so I assume the authors have reimplemented these models on this dataset for comaprison, though that’s no where mentioned in the paper. Now, that is completely fine but there are many papers which directly uses this dataset, so please include those as well in the comparison. 3) For qualitative analysis, merely giving one image to compare with other methods is not sufficient. Include more images to support the claim that your method outperforms other. 4) Moreover the number of test samples used to test the model for both carcinoma classes are very few. Thus, the robustness cannot be judged with these few samples.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has some novel contribution especially in regards of the proposed framework. 1) Their proposed model Multi-modal Multiscale Cross-Attention Fusion Network (MMCAF-Net) is combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. 2) Further to enhance multimodal data integration, MMCAF-Net introduced a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. 3) A detailed ablation study was included to give an impression of why the introduction of different modules in the proposed framework was necessary and how they impacted the overall performance of the model. 4) A detailed performance comparison was given with other SOTA techniques.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I believe the proposed framework “Multiscale Cross-Attention Fusion Network (MMCAF-Net)” for extracting lesion-specific features from 3D medical images and hence classify them as either adenocarcinoma or squamous cell carcinoma has some impressive potential as they claim to handle small lesions which is one of the existing challenges that present SOTA techniques fail to handle. Moreover, they have addressed most of the concerns in the rebuttal. So, I am accepting this paper.




Author Feedback

We thank reviewers for recognizing our paper’s “innovation problem”(R1), “datasets and experiments”(R1&2&3), “tabular data and model”(R2), “dataset description”(R3). Below, we address these concerns followed by specific responses for each reviewer.

G1: Dataset size and comparison experiment issues. (R1&2&3) We focus on a binary classification task for lung cancer subtypes using 3D images and tabular data, requiring a dataset that includes both. After reviewing 40 articles from 2021 to 2024, we found this dataset to be nearly the only one that meets our needs. Despite the small dataset, we enhanced our training data with data augmentation like oversampling, regularization, and early stopping. (see Section 3.1, lines 9-11). We conducted a deep search using the keyword “Lung-PET-CT-dx” and found only one relevant article, “Ct multi-task learning with a large image-text (lit) model,” which employed 3D images and tabular data for classification. However, it focused on a four-class task, differing from our binary classification criteria, so it was excluded from the comparison experiments. Our inclusion criteria require a binary classification task using 3D images and tabular data.

R1Q1: innovation problem. Our innovations consist of three parts. First, unlike traditional feature pyramids that rely on simple upsampling, our approach directly extracts raw features from encoder layers, addressing the issues of information loss in the multiscale fusion of 3D images. We then proposed the E3D-MSCA module to further extract multi-scale features from the feature pyramid, specifically optimized for the spatial structure of 3D data. By incorporating channel and spatial attention modules, we effectively captured multi-scale dependencies of key features in 3D space. Second, we developed the MSCA module. Unlike traditional cross-attention, we integrated the multi-scale concept into the cross-modal fusion process, enabling the model to capture and fuse features across different spatial ranges at multiple scale levels. Third, we introduced the BSF module to calculate dimensional importance scores between scales for merging features of different resolutions. The effectiveness of our first innovation is shown in Table 2, comparing rows 2 and 4, while Table 3 shows the superiority of our fusion method over others.

R2Q1: Selection of KAN for the tabular model. KAN offers advantages in faster scaling and strong expressive capability with fewer parameters. Originally designed for function approximation and solving partial differential equations, KAN has been adapted for time series prediction. We flexibly applied KAN to encode our tabular data, finding it superior to an MLP used in our ablation experiments. Therefore, we chose KAN as our tabular data encoder. Due to space constraints, the ablation results for the tabular encoder are not included in the text but can be found in Table 4 of the anonymous code link. R2Q2: Contribution and interpretability of tabular data. The tabular data includes attributes such as gender, age, weight, TNM stage, and smoking history. As shown in row 7 of Table 1 and row 4 of Table 2, incorporating this data improved our model’s performance by 7.4% in AUROC, 2.4% in ACC, and 14.5% in F1. This ablation experiment highlights the significant contribution of tabular data. Due to space constraints, further interpretability is illustrated in Figure 4 of the anonymous code link.

R3Q1 dataset description issues The dataset includes a total of 61 samples of squamous cell carcinoma, with 34 samples in the training set and 12 and 15 samples in the validation and test sets, respectively. This is mentioned in Section 3.1, lines 5-6, and on page 6, lines 1-3 of the main text. R3Q3 qualitative analysis image issues Due to space limitations, Figure 4 shows the classification results of three hard-to-differentiate cases from the test set. Each case has 12 slices, and we selected the most representative slice of the lesion for display.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal address the issues from reviewers from some extent, but the answers to some questions raised by reviewers are not convincing.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed MMCAF-Net can effectively integrate multi-dimensional and multi-modal data to solve small lesion detection. All reviewers agree that the task is challenging and the method is sound.



back to top