Abstract

Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at https://anonymous.4open.science/r/MUPM-CBDB.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4581_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/yucheng722/MUPM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanYuC_Analysis_MICCAI2025,
        author = { Tang, YuCheng and Fu, Yunguan and Yi, Weixi and Wang, Yipei and Alexander, Daniel C. and Davies, Rhodri and Hu, Yipeng},
        title = { { Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a linear Multimodal Uncertainty Propagation Model (MUPM) to estimate and analyze the interaction between image-only, text-only, and joint uncertainties in large multimodal language models (MLLMs), applied to cardiac MR imaging and health record data. The authors demonstrate MUPM’s robustness across input distributions and transferability across tasks, and they explore its utility for efficient uncertainty estimation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Proposes a practical and interpretable linear surrogate to avoid unstable derivative-based uncertainty propagation in MLLMs.

    • Demonstrates generalization of the uncertainty decomposition model across different cardiac disease prediction tasks (1-, 3-, and 5-year).

    • Uses a large-scale, real-world clinical dataset (UK Biobank).

    • Introduces clinically relevant applications: efficient uncertainty estimation, modality contribution analysis, and potential for dataset optimization.

    • Provides code and implementation details to support reproducibility.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the idea is promising, the methodological execution raises several concerns:

    • Temporal Coherence Ignored: From the “Datasets” paragraph in Section 3, it seems that cardiac MR data is treated as a set of independent static volumes, with affine augmentations and uncertainty estimation applied per time-point. Additionally, each 3D volume is resampled along the Z-axis, even though each slice is independently acquired. Some subjects may present breath-hold misalignments. If each time-frame is treated independently, there could be a large variance in uncertainty depending on whether an end-diastole (ED) or end-systole (ES) frame is selected, which could confound the interpretation of image uncertainty. The normalization is also done per 3D volume, despite each slice being acquired independently in a 2D+t scheme.

    • Linear Modeling vs. Claimed Nonlinearity: The paper points out the problem with derivative-based propagation due to model non-linearity and instability, yet replaces it with a linear model (MUPM). This simplification is pragmatic but it implies that they are analyzing small local perturbations (i.e., slight affine deformations, minor text edits), which may capture subject-level physiological variability. This makes sense assuming a large enough dataset where the physiological variability comes from the different subjects, but i think it should be discussed/pointed out.

    • Modality Independence Assumption: Each modality is augmented and analyzed independently, without accounting for cross-modal consistency. This design overlooks cases where the MR image’s diagnostic contribution is disease-dependent. For example, arrhythmias or valve disorders may be invisible in imaging but described in clinical notes, while other diseases like dilated cardiomyopathy are visually prominent.

    • Per-Disease Analysis: The uncertainty contributions of text/image modalities are not analyzed per disease. This could uncover clinically important insights, e.g., identifying diseases detectable only in image, or that benefit from it.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    This paper addresses an important and timely question: how to model uncertainty in multimodal LLMs in a way that supports clinical interpretation and operational robustness. The contribution is novel and valuable, especially given the challenges of black-box models in healthcare. However, the paper would benefit significantly from deeper introspection regarding:

    • the limitations of modeling temporal imaging data as static
    • assumptions of linearity
    • disease-specific heterogeneity in modality contribution.

    These limitations don’t invalidate the contribution but should be transparently addressed or explored in future work.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite strong results and an original, well-motivated idea, this work includes several simplifications (e.g., temporal independence, linear modeling) that are not fully justified and may limit generalizability. The paper would benefit from more careful data handling, a discussion of how well the linear model approximates real-world MLLM behavior, and per-disease breakdowns of modality contributions. Nonetheless, the core idea is valuable and has the potential to spark impactful follow-up work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a linear model for multimodal uncertainty propagation in Multimodal Large Language Models (MLLMs), focusing on cardiac MR imaging and health record-based disease prediction. The method characterizes the relationship among image-only, text-only, and joint uncertainties, which are challenging to estimate directly in MLLMs due to the instability of gradient-based approaches. The authors show that MUPM is robust to changes in data distribution, transferable across tasks, and helps in reducing uncertainty estimation costs. The paper also explores practical clinical applications, including uncertainty-aware diagnosis, modality redundancy identification, and efficient downstream uncertainty estimation. Overall, the motivation is clear. The problem being solved seems interesting. The methods section is sound.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper identifies an important and under-explored question in MLLMs: how do image-only and text-only uncertainties propagate and interact in clinical multimodal scenarios.
    2. The model is simple yet interpretable. It potentially beneficial for some clinical scenarios that value model transparency.
    3. The experiments are conducted on large-scale real-world datasets (UK Biobank) involving cardiac MRI and EHR data.
    4. The model provides interpretable insights into the contributions of different modalities and their interactions to the total uncertainty. It explores multiple potential applications of the model, covering distribution shift robustness, task generalization, efficient uncertainty estimation, and redundancy detection.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited methodological novelty: The main contribution, MUPM, is a direct application of first-order uncertainty propagation theory combined with linear regression. This is a well-established technique and the paper lacks new insights into why this method is particularly suited for MLLMs.
    2. No comparison with alternative MLLMs: All experiments are conducted exclusively on M3D. No experiments were conducted with other MLLMs such as BLIP-2 or LLaVA-Med, limiting the generalizability claim. This needs to demonstrate why M3D was chosen and why it is representative.
    3. No baseline comparisons: The paper does not compare MUPM against existing uncertainty quantification methods (e.g., MC-Dropout, Ensembling, variance-based methods), making it difficult to assess the advantages or trade-offs of the proposed model.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper studies an important but under-explored problem and shows practical insights using real-world data. Although the method is simple and lacks strong comparisons, the work is still valuable for clinical MLLM applications. I lean towards a weak accept.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a Multimodal Uncertainty Propagation Model (MUPM) for modeling the uncertainty of each input of a different modality, as well as the relationship between the uncertainty between those inputs. Experiments demonstrate the proposed approach’s ability to (1) show the importance of each input modality using the corresponding uncertainty; (2) predict the effect to improve the quality of a input, and also (3) show the robustness of the proposed approach.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Being able to effectively predict the importance of (improving) an modality is very useful in practice, which could save much energy on data collection / improving if the (improvement of) corresponding modality turns to be not so important.
    2. Their finding that MUPM coefficients remain stable across different data distributions and tasks suggests the approach could be generalizable.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The linear approximation might be insufficient for more complex models
    2. The work focuses exclusively on cardiac disease prediction tasks. It’s unclear whether the conclusions (especially about text being more important than images) would hold for other medical domains or general multimodal tasks.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It would be great if the authors can check the data and provide some intuition / explanation about the experiment results. For example, what makes images less important than text.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Being able to effectively predict the importance of (improving) an modality is very useful in practice, though it would be great if the author can validate the approach on more tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

N/A




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top