Abstract

Multi-granularity features can be extracted from multi-modal medical images and how to effectively analyze these features jointly is a challenging and critical issue for computer-aided diagnosis (CAD). However, most existing multi-modal classification methods have not fully explored the interactions among the intra- and inter-granularity features across multi-modals. To address this limitation, we propose a novel Indepth Integration of Multi-Granulairty Features Network (IIMGF-Net) for a typical multi-modal task, i.e., a dual-modal based CAD. Specifically, the proposed IIMGF-Net consistes of two types of key modules, i.e., Cross-Modal Intra-Granularity Fusion (CMIGF) and Multi-Granularity Collaboration (MGC). The CMIGF module enhances the attentive interactions between the same granularity features from dual-modals and derive an integrated representation at each granularity. Based on these representations, the MGC module captures inter-granularity interactions among the resulting representations of CMIGF through a coarse-to-fine and fine-to-coarse collaborative learning mechanism. Extensive experiments on two dual-modal datasets validate the effectiveness of the proposed method, demonstrating its superiority in dual-modal CAD tasks by integrating multi-granularity information.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4250_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/seuzjj/IIMGF-Net

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WuYeL_Indepth_MICCAI2025,
        author = { Wu, YeLi and Zhang, Xiaocai and Wu, Weiwen and Jiang, Haiteng and An, Chao and Zhang, Jianjia},
        title = { { Indepth Integration of Multi-granularity Features from Dual-modal for Disease Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed IIMGF-Net, a novel dual-modal classification framework designed to integrate multi-granularity features from different medical image modalities for improved disease diagnosis. Specifically, the paper introduces two key modules:

    1. Cross-Modal Intra-Granularity Fusion (CMIGF): This module enhances the fusion of features at the same granularity level between modalities using a tailored cross-modal Mamba mechanism, enabling richer intra-granularity interactions.
    2. Multi-Granularity Collaboration (MGC): This module models hierarchical dependencies between different granularity levels through a bidirectional coarse-to-fine and fine-to-coarse collaboration strategy, enabling comprehensive inter-granularity integration. By effectively modeling both intra- and inter-granularity interactions, the proposed IIMGF-Net demonstrates superior classification performance across two dual-modal medical imaging benchmarks.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed two-stage framework, IIMGF-Net, which explicitly models both intra-granularity (via CMIGF module) and inter-granularity (via MGC module) interactions across dual-modal inputs. This joint modeling strategy addresses a clear gap in existing multi-modal fusion methods, which often neglect the hierarchical structure of visual features.
    2. The proposed Cross-Modal Mamba (CMM) mechanism effectively adapts the Mamba sequence modeling paradigm for dual-modal medical data, enabling efficient long-range contextual interaction at each granularity level. This is an interesting and timely adaptation of a novel architecture.
    3. By focusing on tasks such as skin cancer diagnosis and lymph node metastasis prediction using dual-modality imaging, the method addresses realistic and clinically meaningful CAD scenarios where multi-scale and multi-source information is common.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of strong motivation for Mamba over alternatives: While the paper adapts the Mamba mechanism for cross-modal feature fusion, it does not sufficiently justify why Mamba is chosen over more established attention-based methods like transformers or cross-attention modules (e.g., as used in TFormer [25] or CRD-Net [14]). A comparison or discussion regarding the efficiency or modeling advantages of Mamba in this context is missing.
    2. Missing implementation and training details: Some critical implementation details are under-specified. For example, how many parameters are introduced by the CMIGF and MGC modules? How does the total model complexity compare to strong baselines like FM4Net or TFormer? This makes it difficult to assess whether the improved performance is due to better design or increased model capacity.
    3. Limited interpretability and failure analysis: While Grad-CAM-style heatmaps are shown for qualitative evaluation, there is no deeper analysis of failure cases or insight into which modalities or granularity levels contribute most to the classification outcome. This reduces the interpretability and practical insights for clinical usage.
    4. Ambiguity in the design and benefit of the CMM (Cross-Modal Mamba) module: While the CMM module is claimed to facilitate adaptive cross-modal interaction, its implementation relies heavily on Mamba-style recurrent computations (e.g., sequential hidden state updates with temporal dynamics), which are not clearly justified in the context of spatial medical image fusion. Since the input is not sequential in nature (unlike video or language), the rationale for choosing a sequential memory mechanism over more conventional fusion strategies (e.g., multi-head cross-attention or gated fusion) is unclear. There is also no ablation or comparative baseline without Mamba to assess its actual contribution.
    5. Missing modality ablation or unimodal baselines: The paper focuses on dual-modal fusion, but it does not report results on individual modalities (e.g., only dermoscopy or only clinical images for Derm7pt), nor does it analyze how much each modality contributes. Without this, the benefit of dual-modality input is not clearly isolated from the model architecture improvements. Furthermore, although the abstract claims superiority over early-, middle-, and late-fusion strategies, Table 1 does not clearly indicate which baselines represent each category. The lack of explicit categorization or analysis of fusion types makes it difficult to assess how the proposed method compares to these fusion strategies.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite a clear effort to improve multi-modal feature integration for medical image classification, I find the paper not ready for acceptance due to several critical issues. The main methodological novelty, the use of Mamba for cross-modal fusion, is insufficiently motivated and not clearly superior to existing transformer-based attention mechanisms, especially given the spatial nature of medical images. The design and contribution of the Cross-Modal Mamba (CMM) module remain ambiguous, with no ablation or comparison against standard alternatives. Additionally, the experimental setup lacks clarity in several areas. The paper does not isolate the benefit of dual-modal input by providing unimodal baselines, and it also fails to explicitly categorize or compare against early-, middle-, and late-fusion strategies as claimed. Important implementation details like model complexity and parameter counts are omitted, making it difficult to evaluate whether gains are due to architectural insight or increased capacity. Finally, the lack of interpretability analysis and insight into modality contributions limits the clinical relevance and understanding of model behavior. For these reasons, I do not recommend acceptance at this stage.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    While the proposed model shows promising results, I find the core methodological justification, particularly for using Mamba in a spatial cross-modal fusion setting, remains insufficient. The authors argue that Mamba has been adapted for spatial image analysis, but they do not clearly explain why its recurrent design is appropriate or necessary for non-sequential data, especially compared to more intuitive alternatives like cross-attention. The lack of ablation studies isolating the CMM module further weakens the claim. Although parameter comparisons and unimodal results are mentioned in the rebuttal, these details are not clearly presented or analyzed in the paper. Key aspects such as failure cases, modality contributions, and fusion strategy categorization are underdeveloped or relegated to the response due to page limits. As such, the current version does not provide sufficient evidence or clarity to support its core claims. I recommend rejection in the current form, but encourage the authors to strengthen the methodological motivation, expand on experimental analyses, and clearly present supporting results in future revisions.



Review #2

  • Please describe the contribution of the paper

    In this paper, a deep fusion multi-granularity feature network is proposed. The cross-modal intra-granularity fusion module is designed to solve the problem of intra-granularity feature integration. At each granularity layer, the dual modal features are adaptively fused with attentional Mamba to obtain a unified representation. The multi-granularity collaboration module uses coarse-to-fine and fine-to-coarse mechanisms to enhance the feature interaction between granularity. It supports bidirectional collaboration between multi-granularity features, establishes hierarchical dependencies between different granularity features, and finally improves classification performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In this paper, a novel multi-modal feature fusion paradigm is proposed, which breaks through the limitation of the traditional framework of early/middle/late fusion. Its core contribution is to systematically model the dual-modal feature interaction from the perspective of particle size analysis for the first time, and establish the cross-modal hierarchical dependence relationship by constructing a dual mechanism of particle size collection and inter-particle feature interaction. At the same time, at the methodological level, it can also provide a new idea for multi-modal learning, and its granularity fusion strategy can be extended to other modal combination scenarios.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Formula (1) uses VMamba’s VSS module for multi-granularity feature extraction, but VSS itself is a residual structure with unchanged input and output scale, while the paper claims that its output feature graph length and width are halved and channels doubled (H/2×W/2×2C). This statement may mislead the reader, and the author is advised to clarify whether the VSS structure has been modified or the relevant description amended to avoid ambiguity. The ablation experiments were only performed on MGC and CMIGF modules containing CMM, but did not directly verify the contribution of the core mechanism of CMM and how its effectiveness is manifested. At the end of the encoder, coarse-grained features and multi-grained collaborative features are simply fused through GAP+Connect+FCN, and there is no feature coverage or noise feature interference. It is recommended that the author explain the design considerations in the discussion and explain the basis for the current choice.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has made a valuable exploration in the intersection of computer vision and medical image analysis. This paper presents a clear logic and clear research motivation. Starting from the theoretical basis of multi-granularity analysis, it innovatively expands the task of multi-modal image classification, and systematically puts forward a novel research problem, which is the feature interaction between particle size and particle size. The paper has reached a high level in the aspects of method innovation, experimental integrity and writing quality. Methods According to the input sequence of the model, two key modules are introduced, and the connection and progression between modules are reasonable and orderly. In the experimental part, the comparative experimental design analysis is clear and clear, and compared with 12 kinds of latest methods, the algorithm is effective and forward-looking. The formula is clear, the characters are used correctly, there are no obvious words or writing errors, and there are no obvious formatting problems. The overall quality of the paper is high, but if the adjustment details of VSS module can be further clarified in the revision, the theoretical analysis of CMM module can be strengthened, and the basis for the selection of the final feature fusion strategy can be discussed, it will be more helpful for readers to understand the core innovation points of the method. In conclusion, I believe that certain modifications to the current version of this paper can be accepted for publication at the MICCAI conference.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal effectively addressed prior concerns and clarified unresolved questions raised in the initial review.



Review #3

  • Please describe the contribution of the paper
    • The authors propose a novel IIMGF-Net, which can achieve in-depth integration of multi-granularity features from dual-modal medical images to improve the accuracy of disease diagnosis.
    • The authors explore a Cross-Modal Intra-Granularity Fusion module and the Multi-Granularity Collaboration module, which enable intra-granularity feature fusion and inter-granularity feature collaboration, respectively.
    • The authors conduct comprehensive experiments to validate the effectiveness of the proposed IIMGF-Net. Compared to the state-of-the-art methods, the proposed method achieves the highest classification performance.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • As a method for classifying dual-modal images, the authors propose two new modules that merge features obtained from two images in the feature space (Cross-Modal Intra-Granularity Fusion module and Multi-Granularity Collaboration module). Both modules incorporate elements of Mamba.
    • The comparison is performed on two datasets (Derm7pt and DECT-LNM) against 12 existing methods. It is shown that the proposed method outperforms the existing methods on all evaluation measures (ACC, AUC, SEN, SPE) on both datasets.
    • Ablation studies show that each of the two modules of the proposed method contributes to the performance improvement.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • I don’t understand why Cross-Modal Mamba (CMM) is “Cross-Modal”. Specifically, from equation (3), it can be understood that the information on x and y is not integrated.
    • Only two datasets have been evaluated. And one of them is a private dataset.
    • I don’t understand why there are three images in the DECT-LMM dataset even though they are CT images. Additional explanation of the dataset is needed.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is necessary to check whether the Cross-Modal Mamba (CMM), which is the main module of the proposed method, is a cross-modal feature extractor.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I was not sure whether the Cross-Modal Mamba (CMM), which is the main module of the proposed method, is a cross-modal feature extractor. But the authors’ answer remove my concern.




Author Feedback

We thank all reviewers for their constructive comments, which we have addressed below and can largely incorporate finally with minor revisions.

R1 Q1 Input/output scale of VSS Yes, the original VSS module is modified to halve the feature length/width and double the channels to generate multi-granularity features. Q2 CMM’s contribution As explained in Section 2.3 of the paper,CMM could enable adaptive cross-modal interactions and facilitate information communication between them via interactive computation of S1 and S2. CMM’s superiority was consistently verified over competitors, e.g., self- or cross-attention, which was not included due to page limit. For example, CMM improves the accuracy of self-attention by 3.3% on the DECT-LNM dataset and 1.4% on Derm7pt, and improves by 2.6% and 0.7% over cross-attention, respectively. Q3 Feature fusion with GAP+Connect+FCN It is applied by following literature [1,2] considering its simplicity and efficacy. The final high-level features can be effectively compressed by GAP, fused by concatenation, and weighted combined for classification via FCN.

R2 Q1 Why CMM is cross-modal? S1 in Eq.(3) is obtained by S1=Linear(y) as introduced below Eq.(3) in the paper, so x and y are integrated in Eq.(3) and CMM is cross-modal. Q2 Datasets The private dataset will be publicly released and additional evaluation will be conducted in future. Q3 Why use three images? Three adjacent CT slices containing tumor regions are used by following [3] to enable richer contextual information.

R3 Q1 Motivation for Mamba. The original sequential data method Mamba has been adapted for image analysis, e.g., by [4] with 2D Selective Scan to enable spatial image traversal. Mamba uses fewer parameters than self- or cross-attention-based methods while effectively modeling long-range dependencies, making it popular for medical image analysis tasks [5,6], particularly in data-scarce settings. Its efficacy has been demonstrated in R1.Q2 and also by its superiority over Transformer-based methods in Table 1 of the paper. Q2 Implementation and training details. Our model has significantly fewer parameter (43.77M) than strong baselines, e.g., FM4Net(54.48M), TFormer(78.24M) and CRD-Net(82.73M). The full implementation is on Github and the link will be provided in the final version. Q3 More interpretability and failure analysis. They were not included due to page limit. For example, three granularities (Z1, Z2 & Z3) obtain 77% accuracy on Derm7pt, and excluding Z1, Z2 & Z3 leads to 1.7%, 1.4% and 1.1% drops respectively, indicating the most fine-grained granularity Z1 contributing most. Q4 Using Mamba and CMM Refer to R3.Q1 and R1.Q2. Q5 Unimodal baselines and categorization of early-, middle-, and late-fusion strategies. Dual-modality consistently outperforms unimodal baselines in our study, which was not reported due to page limit. For example, our dual-modality model achieved 77.0% accuracy on Derm7pt while individual modalities obtain 69.4% and 75.0%, respectively. The categorization of fusion methods is already provided in the 1st paragraph of Section 3.3 in the paper. Will highlight this.

The novelty and theoretical contribution of our work has been consistently acknowledged by all reviewers and we hope this will be considered in the final decision.

[1] He X, et al. Co-Attention Fusion Network for Multimodal Skin Cancer Diagnosis. PR, 2023 [2] Zhang Y, et al. TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis. CBM, 2023 [3] An C, et al. Deep learning radiomics of dual-energy computed tomography for predicting lymph node metastases of pancreatic ductal adenocarcinoma. EJNMMI, 2022 [4] Liu Y, et al. Vmamba: Visual state space model. Advances. NIPS, 2024 [5] Gong H, et al. nnMamba: 3D Biomedical Image Segmentation, Classification and Landmark Detection with State Space Model. ISBI, 2025 [6] Nasiri-Sarvi A, et al. Vim4Path: Self-Supervised Vision Mamba for Histopathology Images. CVPRW, 2024




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Based on the review, the paper’s use of Mamba for cross-modal medical image fusion lacks clear architectural justification and empirical support, with missing experimental details like unimodal baselines and modality analysis undermining its credibility. We encourage the authors to thoroughly address these gaps by justifying Mamba’s suitability, providing ablation studies, and completing the experimental analysis for potential resubmission.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top