Abstract

Medical vision-language pretraining models (VLPM) have achieved remarkable progress in fusing chest X-rays (CXR) with clinical texts, introducing image-text data binding approaches that enable zero-shot learning and downstream clinical tasks. However, the current landscape lacks the holistic integration of additional medical modalities, such as electrocardiograms (ECG). We present MEDBind (Medical Electronic patient recorD Bind), which learns joint embeddings across CXR, ECG, and text. Using text data as the central anchor, MEDBind features tri-modality binding, delivering competitive performance in top-K retrieval, zero-shot, and few-shot benchmarks against established VLPM, and the ability for CXR-to-ECG zero-shot classification and retrieval. This seamless integration is achieved by combining contrastive loss on modality-text pairs with our proposed contrastive loss function, Edge-Modality Contrastive Loss, fostering a cohesive embedding space for CXR, ECG, and text. Finally, we demonstrate that MEDBind can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2333_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2333_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

https://physionet.org/content/mimic-cxr-jpg/2.0.0/ https://openi.nlm.nih.gov/faq https://stanfordmlgroup.github.io/competitions/chexpert/ https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018 https://www.physionet.org/content/mimic-iv-ecg/0.1/ https://www.nature.com/articles/s41597-020-0495-6 http://2018.icbeb.org/Challenge.html https://www.physionet.org/content/mimiciv/2.2/



BibTex

@InProceedings{Gao_MEDBind_MICCAI2024,
        author = { Gao, Yuan and Kim, Sangwook and Austin, David E and McIntosh, Chris},
        title = { { MEDBind: Unifying Language and Multimodal Medical Data Embeddings } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The proposed method CXRs, ECGS and text into joint embeddings. The paper introduces a tri-modality binding loss and shows good results compared to existing fine-tuning benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • VLPM for chest X-ray datasets is a competitive domain. specifically the results on MIMIC-CXR are impressive.
    • Figure 1 gives a perfect overview of the method, very nicely constructed figure.
    • The combination of CXR and ECG is very intuitive. In clinical setting these data types are used concurrently by doctors, so it makes sense to combine them in deep learning models as well.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • From the results is seems that there is no conclusive benefit for the X-ray datasets to use a model pre-trained with ECG data, nor for the reverse scenario. It could be that the good performance against other methods is achieved by the breadth of pre-training datasets compared to earlier methods.
    • The LLM integration setting in section 3.4 is not described well and confusing. BioBERT is not an LLM. An LLM is a generative model, while BioBERT is not. This makes it unclear what experiments were done in this section of the paper.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The description of the method and implementation is clear. That paper seems reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Writing is ‘clunky’ in some parts of the paper, see for example 3.3 “in zero shot, “, 3.4 “free text into LLM,”.
    • be careful of switching between present and past tense within one sentence.
    • P5 implementation details: “we truncated the text to first 100 words without compromising the content.” How is it possible to truncate text without compromising the content? And it likely is 100 tokens instead of 100 words.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Well-described innovative setting combining ECGs, Xray and text.
    • incomplete evaluation
    • unclear multimodal LLM setting
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper develops a contrastive learning method to support the binding of three medical modalities. Specifically, it integrates joint information from chest X-rays (CXR), electrocardiogram (ECG), and medical text through text-modality contrastive loss (TMCL) and edge-modality contrastive loss (EMCL).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of this paper is the rich validation experiments. The developed self-supervised contrastive learning method is tested with multiple classification and retrieval tasks. Multiple experiments are conducted to show the efficacy and necessity of EMCL.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although this developed method is well tested and described, its core components, i.e. encoders for image and text embedding generation leverage existing modules. For example, the Swin Transformer is used as the encoder for CXR images; A vanilla transformer backbone is used as the ECG data encoder; BioBERT is used for text encoder. Additionally, the proposed EMCL is similar to TMCL in format.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    An interesting tri-modal data learning framework (MEDBind) is developed in this study. By two types of contrastive learning loss terms for text-image and image-image, CXR, ECG and medical texts are integrated for embedding learning. The MEDBind efficacy is manifested by multiple classification and information retrieval experiments. The fundamental question that remains to be answered is if EMCL provides a generic way to integrate multi-modal data. It would be great to clarify further why EMCL helps improve performance. Is there any other way to improve EMCL? In equation (2) where EMCL is defined, it is noticed that this loss can be re-written as $-\sum_{u=1}^m log(…) -m log(m/n)$. Given m and n, $-m log(m/n)$ is a constant. it is not clear why it is necessary to have this constant term in EMCL loss term. It is not clear why it is not necessary to assign different weights to EMCL and TMCL loss terms.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation is made by the trade of between the well tested tri-modality binding framework and marginal technical novelty.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Rebuttal responses partially address my questions



Review #3

  • Please describe the contribution of the paper

    The paper presents the first attempt to learn a unified embedding across three medical modalities: CXR, ECG and medical texts. The architecture employs dedicated encoders for each modality that output comparable, normalized embeddings. For training, a Text-Modality Contrastive Loss (which is a variant of the infoNCE loss) is used for text-modality binding, and a novel Edge-Modality Contrastive Loss (EMCL) is proposed for binding between the non-text modalities, i.e. CXR and ECG. The model achieves good performance for information retrieval, zero-shot and few-shot learning tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is the first attempt to bind more than two medical modalities.
    2. The paper proposes a novel contrastive loss function, ECML. Whereas previous methods like ImageBind [5] bind each modality to a common modality (images), MedBind not only binds each non-text modality to a common text modality using TMCL loss, but also explicitly binds the non-text modalities using EMCL loss.
    3. The paper is well-written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The acronym MEDBind does not fit the full-form which is “Medical Electronic Patient Record”. Authors should clarify the intention behind the “Bind” keyword when they first introduce the acronym in the abstract/introduction.
    2. In TMCL loss, authors consider identical paired texts as positive pairs. It is unclear whether this variant of the InfoNCE loss has been used before or is a contribution of the authors.
    3. It is not clear what t_j -> z_j in eq (1) denotes, and how L^(t_j->z_j) is different from L^(z_j->t_j).
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel method to combine CXR, ECG and medical text. The method is well explained and the results are convincing.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Most of my concerns have been addressed. I maintain my initial rating. I would recommend the authors to clarify the section on TMCL in the paper, if accepted.




Author Feedback

We thank reviewers for the positive comments:

  1. First attempt to unify embeddings across three medical modalities: CXR, ECG, and texts
  2. Novel contrastive loss functions: TMCL in its application to text-modality binding and EMCL for non-text binding
  3. Comprehensive experiments and good performance in retrieval, zero-, and few-shot tasks
  4. Intuitive combination of CXR and ECG reflecting clinical practices

We also thank reviewers for their constructive feedback and will revise/address them in the camera-ready version, including grammatical errors (R4) and model naming (R5). Detailed responses by subheadings follow:

Value of multimodal training (R3,R4): Multimodal training is not intended to significantly outperform single-modal training on, e.g., single-modal CXR classifications; it is intended to perform similarly while enabling new cross-modal retrieval and zero-shot avenues of research that better reflect the inherent multimodal nature of medicine. Interestingly, we see some improvements in single-modal tasks in retrieval (Tab 2) and zero-shot (Fig 3), perhaps due to cross-modality regularization. However, the benefits are highlighted in cross-modal retrieval tasks (Fig 2), classification (Tab 3), and downstream LLM integration (Tab 4). Tab 3 highlights MEDBind’s cross-modality zero-shot performance, where the model predicts a CXR label using ECG and vice versa–without training. Diagnosing conditions across different modalities is a growing area of research. For example, [1] uses traditional supervised learning from CXR to predict echocardiogram-derived disease labels. Our model is the first to perform cross-modal CXR/ECG tasks with zero-shot classification.

Clarifications on EMCL (R3): While EMCL is like TMCL, its novelty lies in its approach to directly bind CXR and ECG, which has not been explored. In the term $-mlog(m/n)$ in EMCL: $n$ is a constant representing mini-batch size; whereas $m$ fluctuates, representing the number of CXR-ECG matched pairs within a given minibatch (not all patients receive both a CXR and an ECG). Dynamic weight adjustment through $−mlog(m/n)$ in EMCL normalizes the contribution of negative anchors in the denominator of infoNCE loss across such variable batches, ensuring better consistency in EMCL loss. We will explore using different weights between EMCL and TMCL in future work.

Clarity on LLM (R4): We follow the broader definition of LLM as large pre-trained language models on a vast corpus of text, which includes BERT (masked); see [2, 3] for usage. While not generative, BioBERT may be more suitable for classification tasks as an encoder-only LLM. We will revise Sect 3.4 to improve clarity. The role of the LLM in Sect 3.4 is to comprehend the text and integrate it with CXR and ECG embeddings. Tab 4 shows how MEDBind’s CXR and ECG tokens can act as “summaries” of imaging modalities and be incorporated with medical text into an LLM by training a classification head.

Text Truncation (R4): We truncated text to the first 100 words (120 tokens) during pre-training, as 97% of CXR and ECG reports were under 100 words. This speeds up training, but longer lengths could be used in the future with larger GPUs.

TMCL Considerations (R5): The InfoNCE loss used in TMCL was first proposed in Supervised Contrastive Learning. Our primary contribution to TMCL was applying it to modality-text pairs to ensure accurate binding and address concerns where identical texts (e.g., “The ECG is normal”) could lead to incorrect binding if not managed by TMCL. In Eq 1, t_j->z_j denotes text-to-modality loss, and z_j->t_j modality-to-text loss. Following CLIP, we use both losses in training to enforce binding consistency in both directions.

  1. S Bhave et al., Deep learning to detect left ventricular structural abnormalities in chest X-rays, EHJ 2024.
  2. L Jiang et al., Health system-scale language models are all-purpose prediction engines, Nature 2023.
  3. S Murray, Talking about Large Language Models, CACM 2024.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a novel contrastive learning method, MEDBind, designed to integrate embeddings from three distinct medical modalities: CXRs, ECG, and medical text. The core of the methodology leverages Text-Modality Contrastive Loss (TMCL) and Edge-Modality Contrastive Loss (EMCL), aiming to enhance the interoperability of these modalities. The concerns raised about the novelty of the encoder components and the specific benefits of the multimodal approach were addressed adequately in the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes a novel contrastive learning method, MEDBind, designed to integrate embeddings from three distinct medical modalities: CXRs, ECG, and medical text. The core of the methodology leverages Text-Modality Contrastive Loss (TMCL) and Edge-Modality Contrastive Loss (EMCL), aiming to enhance the interoperability of these modalities. The concerns raised about the novelty of the encoder components and the specific benefits of the multimodal approach were addressed adequately in the rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed framework can well support multi-modality data, such as EEG, XRAY and text and the experiments validate its effectiveness. While the technical novelty is marginal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The proposed framework can well support multi-modality data, such as EEG, XRAY and text and the experiments validate its effectiveness. While the technical novelty is marginal.



back to top