Abstract

Accurate clinical diagnosis requires comprehensive analysis of medical imaging and patient narratives. However, current computer-aided diagnosis methods focus primarily on imaging modalities while neglecting the integration of patient-reported clinical narratives, due to the scarcity of high-quality patient narratives and the limitations in multimodal information fusion. To address these issues, we propose a dual-component framework consisting of: 1) a Retrieval-Augmented Patient Narratives Generation Module (RANGM) that employs a retrieval-enhanced mechanism to guide pre-trained large language models in generating clinically plausible patient narratives; and 2) a Multimodal Information Balanced Fusion Network (MIBF-Net) incorporating our novel Information Balanced Fusion Attention (IBFA) module for effective cross-modal integration, along with a Modal Prediction-Divergent Loss (MPL) to enhance the model’s ability to diagnose samples that have single modal prediction distribution ambiguous. Owing to the plug-and-play design, our MIBF-Net can integrate with existing imaging-based state-of-the-art methods. Extensive experiments demonstrate significant performance improvements of 2.3%-4.6% on the HAM10000 dataset and 3.8%-6.4% on the ISIC2019 dataset. Our code is publicly available at https://anonymous.4open.science/r/MIBF-Net-2B52/

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5428_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/sysu19351118/MIBF-Net

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanZix_MIBFNet_MICCAI2025,
        author = { Tang, Zixuan and Sun, Bai and He, Shidan and Hong, Yuan and Yu, Dongdong and Liu, Zhenzhong and Li, Mengtang and Chen, Bin and Zhao, Shen},
        title = { { MIBF-Net: Multi-modal Information Balanced Fusion Network for Clinical Diagnosis via Patient Narratives and Lesion Image } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contributions of this paper include a Retrieval Augmented Patient Narratives Generation Module (RANGM) that enhances patient narrative generation using LLMs, and a Multimodal Information Balanced Fusion Network (MIBF-Net) method and a loss function to improve clinical diagnosis accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper addresses the challenge of integrating patient narratives with images to improve clinical diagnosis. The authors propose a novel module to feed medical knowledge to LLMs to generate patient narratives, incorporates a novel Information Balanced Fusion Attention (IBFA) for more effective image-text fusion, and proposes a loss function to emphasize learning from samples where predictions from text and image modalities differ. Experiments on HAM10000 and ISIC2019 datasets show substantial improvements.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The generated narratives are synthetic and there is no external validation against real patient-reported data. The retrieval and generation heavily depend on the LLM outputs.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Typos and writing suggestions: In Figure 3, image enoder -> image encoder. Figur 4 caption, “which means the effectiveness of our method” -> “which proves/verifies the effectiveness of our method”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novel methodological contributions despite some limitations in validation. The authors demonstrate clear improvements on two benchmark datasets.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a multimodal framework designed to address the challenges of limited high-quality patient narratives and multimodal information fusion in skin lesion diagnosis. In specific, the authors employ a retrieval augmented patient narratives generation (RANGM) module to generate plausible patient narratives using a pre-trained language model, a multimodal inofrmation balanced fusion network (MIBF-Net) to learn cross-model integration, and a modal prediction-divergent loss to address ambiguous single modal prediction.

    The proposed method is evaluated on two skin lesion datasets, demonstrating favorable improvements in performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    S1: This paper explores using LLMs and a retrieval based narratives generation module to address limited multimodal data, which is an important problem to the community. S2: Compared to the prior methods evaluated in the paper, the proposed method demonstrates non-trivial improvements in performance. In addition, the algorithm is plug-and-play and can be used with any image encoders. S3: The paper is well-structured and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    W1: This details of the framework at inference is unclear. Does the model input both image and generated patient narratives at inference? The query tokens in Eq. (1) is not clearly described. How does the RANGM generate narratives without the ground-truth disease names?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I find the idea of generating plausible patient narratives using LLMs and knowledge base is very interesting. Addressing the limited multimodal data is a critical challenge in thisl field. However, my major concern is the unclear inference time behavior. Clarifying how the KNN retrieval operates and what inputs are used for narrative generation would strengthen the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel clinical diagnosis generation model named MIBF-NET. The core of MIBF-NET lies in integrating a knowledge base with an LLM to generate patient narratives for skin disease patients, and utilizing a Transformer optimized for multimodal tasks to extract cross-modal features from both the patient narratives and lesion images. In addition, the model employs a KL divergence-based loss function. The authors validate the proposed method on two public skin disease datasets, achieving results that outperform the comparative models reported in the literature.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Introduction of a new modality: The authors generate Patient Narratives using an LLM and a knowledge database, and fuse them with lesion images to produce clinical diagnoses.

    A novel transformer-based modality fusion approach: The authors modify the keys (K) and values (V) of the unimodal transformer by concatenating the current modality’s K and V with those from another modality, enabling the current modality to incorporate information from the remaining modalities.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Uncertainty in generating Patient Narratives: The authors generate Patient Narratives by concatenating a prompt with the top-k related data from a knowledge database—a method akin to RAG. Thus, the core issue of this approach lies in the accuracy of the selected top-k knowledge. The authors employ a KNN algorithm to verify this top-k knowledge; however, given KNN’s limitations in handling high-dimensional data, an insufficient amount of effective knowledge provided to the LLM may lead to biased Patient Narratives.

    Limited experimental coverage: The authors tested their method solely on the ISIC2019 and HAM10000 skin disease datasets, where lesions typically appear quite prominent. Consequently, the performance of the model in handling hard samples (e.g. small-lesions) has not been adequately validated.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Compared to traditional clinical diagnosis models that rely solely on lesion images, the authors propose a novel approach by combining patient narratives with lesion images to generate clinical diagnoses, and they introduce a model that achieves this functionality with excellent performance. The model emphasizes both the generation of patient narratives and the balanced fusion of patient narratives with lesion images. However, the method used to select the top-k related knowledge is overly simplistic, which may lead to biased patient narratives. Moreover, the experiments might not have evaluated the model’s performance on handling inconspicuous lesions (e.g., newly developed cerebral infarction).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

To Reviewer #2: Thank you for your positive feedback on our method, including the “novel IBFA” and “substantial improvements.” We will test our approach on additional LLMs and will further evaluate its clinical impact.

To Reviewer #3: We appreciate your comments, such as “easy to follow” and “important to the community.” Our model uses both modalities during inference. Our RANGM generates narratives from lesion image with an offline disease knowledge base.

To Reviewer #4, Thank you for your positive feedback on our method, particularly your recognition of our “novel modality fusion approach.” Regarding your concern, since our current experiments were conducted only on dermatology datasets with relatively short knowledge base contexts, we haven’t encountered the issue you mentioned. However, we will definitely investigate whether this problem might emerge in scenarios with larger-scale knowledge base contexts in our future research. Your suggestion provides valuable guidance for the iterative improvement of our method, and we sincerely appreciate your insightful comments.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top