Abstract

The integration of multimodal data, particularly medical images and tabular data encompassing physician-assessed radiological factors, holds significant promise for enhancing clinical decision-making. However, effective fusion of these heterogeneous data modalities remains challenging due to their disparate feature spaces and the limitations of current independent encoding approaches. We introduce FM-Bridge, a novel methodology leveraging vision-language foundation model (VLM) to address this challenge. Our approach capitalizes on the intrinsic image-text embedding space alignment within VLMs to achieve robust multimodal fusion. We propose transforming clinical expertise-rich tabular data into semantically coherent textual descriptions, subsequently utilizing the VLM’s text encoder to generate textual features explicitly aligned with image features. This method facilitates a more semantically congruent and effective fusion of medical image and tabular data, demonstrating potential for improved performance in downstream medical image analysis tasks compared to conventional methods. Code is available at https://github.com/HKU-MedAI/FM-Bridge.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2892_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/HKU-MedAI/FM-Bridge

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaYan_Bridging_MICCAI2025,
        author = { Huang, Yanyan and Zhang, Wanli and Huang, Peixiang and Fu, Yu and Yang, Ruimeng and Yu, Lequan},
        title = { { Bridging Radiological Images and Factors with Vision-Language Model for Accurate Diagnosis of Proliferative Hepatocellular Carcinoma } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {35 -- 45}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes FM-Bridge, a multimodal approach that leverages vision-language foundation model to fuse image and tabular data for accurate diagnosis of proliferative hepatocellular carcinoma. In specific, FM-Bridge transforms tabular data into textual descriptions, which are then input alongside images into VLM to extract image and tabular features. Those features are fused together using a similarity weighting strategy.

    The proposed method is evaluated on a private dataset for pro-liferative HCC diagnosis, demonstrating favorable improvements in performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    S1: This paper explores using VLMs to align and fuse image and tabular data, which should be intersting to the community. S2: Compared to the prior methods evaluated in the paper, the proposed method demonstrates significant improvements in performance. S3: The paper is well-structured and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    W1: As FM-Bridge uses a transformer-based tabular encoder, it’s not fair to only compare with multimodal methods designed for MLP-based tabular encoders. Relevant methods using transformer-based tabular encoders include [R1]. W2: The authors claim to propose a new strategy that uses VLM’s tabular encoder to extract tabular features and show that FM-Bridge w/o the image encoder is better than the prior tabular-only methods. However, they do not compare FM-Bridge with recent tabular methods that also uses sematic encoding and language models [R2, R3].

    [R1] Du, Siyi, et al. “TIP: Tabular-image pre-training for multimodal classification with incomplete data.” ECCV 2024. [R2] Yan, Jiahuan, et al. “Making Pre-trained Language Models Great on Tabular Prediction.” ICLR 2024. [R3] Yang, Jingfeng, et al. “TableFormer: Robust Transformer Modeling for Table-Text Encoding.” ACL 2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I find the research question on image-tabular learning using VLMs is very interesting. However, my major concern is the someshow unfair experimental comparisons. The proposed FM-Bridge uses transformer-based architecture with semantic encoding, yet this paper lacks comparisons with more advanced methods specifically tailored for language models or transformer-based tabular encoding. Addressing these concerns would make this paper better.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a novel medical diagnosis model named FM-Bridge. The core of FM-Bridge lies in converting tabular data into textual descriptions and incorporating a VLM for semantic analysis. In terms of modality fusion, the model adopts a weighted strategy by embedding the weights from multimodal fusion into unimodal features. The authors validated their approach on a private dataset, and the results indicate that its performance is improved compared to existing models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Processing tabular data as textual information: Instead of representing tabular data as discrete values as done in traditional models, the authors convert tabular data into textual descriptions. This approach preserves richer semantic information, and when processed through a VLM, it can more accurately describe pathological images.

    Utilizing a weighted-like strategy for modality fusion: The authors multiply the image and text modalities, then apply a softmax function to transform the result into weights, which are subsequently multiplied with the corresponding modality. Compared to conventional methods that simply concatenate modalities for classification, this approach accentuates the features within each unimodal source that are most critical for the final classification.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Insufficient cross-slice feature fusion: When processing 3D CT images, the authors only feed individual slices into the VLM. This approach merely extracts local lesion features within a single slice and fails to effectively integrate the potential correlations across different slices. Given that 3D CT images inherently contain rich spatial continuity information, this processing strategy exhibits clear limitations in capturing cross-slice features, which may adversely affect overall diagnostic performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors present a multimodal model for medical diagnosis. The core innovation of this model is the conversion of tabular data into textual descriptions, which allows the data to carry richer semantic information. Moreover, the weighted strategy employed for modality fusion offers a degree of novelty compared to traditional approaches that rely solely on concatenation. However, the method used for CT processing is overly simplistic, as it only captures the intra-slice feature relationships, neglecting the inter-slice feature correlations.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents FM-Bridge, a novel approach that leverages a vision-language foundation model (VLM) to improve the diagnosis of proliferative hepatocellular carcinoma (HCC). The method involves transforming clinical data into textual descriptions which are then processed by the text encoder of a VLM. This allows for a more semantically congruent fusion of medical image and tabular data, which in turn improves medical image analysis tasks. The method was tested on a private dataset of proliferative HCC diagnosis and showed superior performance compared to conventional methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper presents a novel approach of using a VLM to improve the diagnosis of HCC.
    • The method of transforming clinical data into textual descriptions is innovative and shows potential for improving medical image analysis tasks.
    • The paper provides a comprehensive review of related studies and models, which gives a good background and context for the study.
    • Experiment and are well designed making the paper easy to read and understand.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The method is evaluated on a private dataset related to proliferative HCC, yet no information is provided about its size, class distribution, imaging modalities, or collection protocols. Without this context, it is difficult to assess the robustness, fairness, or generalizability of the results.

    • The authors claim that their method significantly outperforms existing approaches but do not present quantitative evidence to support this. No baseline metrics, statistical tests (e.g., DeLong test for AUC comparison), or comparative results are provided to validate these claims.

    • Although a link to the code is shared, the paper does not include essential implementation details. Key components such as model architecture, training parameters, optimization strategy, and evaluation metrics are missing. This lack of transparency severely hinders reproducibility.

    • The authors assert that their work is novel due to the use of a vision-language model for integrating radiological images with tabular data. However, vision-language models have been previously applied in multimodal medical contexts. The paper should include a more thorough comparison with related work and clearly articulate what sets their approach apart.

    Minor Issues:

    • Including statistical significance testing (e.g., DeLong test for comparing AUCs) would strengthen the validity of the performance claims and provide more confidence in the reported improvements.

    • A schematic or figure showing the overall architecture and data flow would greatly enhance understanding, especially of how tabular text is integrated with image features.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Mainly for the paper extension: The authors should provide a more detailed explanation of their method and justify its use. They should also provide more information about their dataset and compare their method with other state-of-the-art methods . In addition, they should provide more detailed information about their implementation and provide evidence to support their claims.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite some limitations, the paper presents a novel and promising approach for improving the diagnosis of HCC. The method is innovative and has potential for advancing multimodal medical AI. The paper is well-written and provides a comprehensive review of related studies and models. Therefore, I recommend accepting the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We are sincerely grateful to all reviewers for your insightful comments and constructive suggestions. We particularly appreciate the acknowledgement of our paper’s clarity, quality, and its contributions regarding the novelty and effectiveness of fusing image and tabular information for improved cancer diagnosis.

We have carefully considered all feedback and would like to address the key concerns as follows:

  1. Methodology and Dataset Details (R1)

We acknowledge the need for greater transparency. In the revised manuscript, we will supplement crucial implementation details and provide more comprehensive details regarding our private dataset to facilitate a better assessment of model generalizability and the reliability of our results.

  1. Comparisons with Other Methods (R1, R2)

Novelty of VLM for Tabular Data (R1): We agree that VLMs have seen increasing application in the medical domain. However, a key limitation of conventional VLMs is their inability to directly process structured tabular data, which is prevalent and crucial in clinical decision-making. Our primary novel contribution, as elaborated in the manuscript, is a straightforward yet effective approach to bridge this gap by transforming tabular data into textual descriptions. This textualization strategy enables existing VLMs, through their powerful text encoders, to seamlessly process and integrate tabular information alongside medical images. We believe this significantly extends the applicability of VLMs in multimodal medical analysis.

Expanded Experimental Comparisons (R2): We agree that comparing FM-Bridge with multimodal methods employing Transformer-based tabular encoders, as well as with advanced tabular-only methods, would provide a more robust evaluation of our fusion mechanism. We are committed to addressing this and will work to include these additional experimental comparisons in the final version of the manuscript.

  1. Insufficient Cross-Slice Feature Fusion for 3D CT (R3)

We acknowledge that processing 3D CT scans slice-by-slice has limitations in capturing inter-slice relationships. Employing a 3D image encoder is indeed a more ideal approach. Our initial experiments utilized 2D medical image VLMs due to the limited availability of robust, pre-trained 3D medical VLMs at the time of submission. Nevertheless, the core fusion concept of FM-Bridge is designed to be adaptable and can seamlessly transition to 3D VLMs as they become more mature. In the camera-ready version, we will enhance our discussion on this aspect, outlining how FM-Bridge can be extended with 3D encoders.

We believe these revisions will significantly strengthen our paper, and we thank the reviewers again for their valuable guidance.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top