Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-GrounD Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants. The repository is at https://github.com/ShawnHuang497/BiRD

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1279_paper.pdf

SharedIt Link: https://rdcu.be/dY6f7

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_38

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1279_supp.pdf

Link to the Code Repository

https://github.com/ShawnHuang497/BiRD

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hua_AReferandGround_MICCAI2024,
        author = { Huang, Xiaoshuang and Huang, Haifeng and Shen, Lingdong and Yang, Yehui and Shang, Fangxin and Liu, Junwei and Liu, Jia},
        title = { { A Refer-and-Ground Multimodal Large Language Model for Biomedicine } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {399 -- 409}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper devises a refer and ground dataset for biomedical images, namely Med-GRIT-270k, which contains 270k QA pairs. It contains eight distinct medical imaging modalities and the text explanation is curated by chatGPT. The authors then propose Refer-and-GrounD Multimodal Large Language Model for Biomedicine (BiRD) with the dataset and the experiments demonstrate the efficacy of the proposed method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The creation of the Med-GRIT-270k dataset is an important contribution, as it addresses the gap in biomedical refer and ground datasets and provides a new dimension of annotation.

The paper also propose a model for this specific task, with the QwenLM-7B as backbone. The performance shows the model can potentially work in this domain.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The term Refer-and-Ground is not clear to me. The data for training current state-of-the-art LVLMs can also refer to the refer-and-ground nature. Many papers also use bounding box coordinates to tune the model. How does this paper differentiate from these methods?

The model is just a fine-tuned model from Qwen-7B using the curated dataset. Thus there is limited novelty in this part. An analysis on how to better utilize the dataset is desired but not appeared in the paper.

The organization of the figures needs some improvement. In Figure 4, it would be better if you can provide the image in this example.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

You may consider providing a more detailed explanation or a dedicated section that clearly defines what sets the “Refer-and-Ground” concept apart from similar existing approaches in LVLMs.

To augment the novelty of your work, explore and discuss alternative models or innovative fine-tuning techniques that could uniquely leverage your dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Reject — should be rejected, independent of rebuttal (2)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The most significant contribution of this work is the Med-GRIT-270k dataset, which is a substantial addition to biomedical multimodal research. However, the text within this dataset is generated using ChatGPT. The model presented primarily adapts the existing QwenLM-7B architecture applying it to the new dataset without significant modifications. Consequently, while the dataset itself is a valuable resource, the overall contribution of the paper is limited.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors propose the Med-GRIT-270k dataset and the BiRD model inthe field of biomedical multimodal large language models. The dataset integrate referring and grounding tasks with multimodal conversations in biomedicine, created using a process involving ChatGPT for generating data. The BiRD model, fine-tuned with this dataset using multi-task instruction learning, offers enhanced capabilities for precise multimodal interactions in medical settings.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The creation of the Med-GRIT-270k dataset is a primary strength. This dataset is specifically designed for the biomedical field to integrate referring and grounding with multimodal conversations.
- The BiRD model represents an interesting application of multi-task instruction learning in the context of biomedical MLLMs. It is specifically tuned to handle fine-grained, multi-modal interactions, which are crucial for tasks such as detailed medical analysis and patient interaction.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The generation of image captions: Even with advanced training, language models like ChatGPT might not always produce medically accurate or sufficiently detailed descriptions required for clinical use. Medical image captioning requires a high level of precision due to the potential implications of misinterpretation. If the language model lacks sufficient training on specific medical data or is not continually updated with new medical information, the risk of generating incorrect or misleading captions increases.
- Object hallucination issues: Despite the authors suggesting that this issue is attributed to the fact that the model’s visual encoder is frozen and lacks knowledge of medical imaging, the object hallucination issues mentioned in the paper might also extend to the text generation component, such as the captions generated by ChatGPT. Since ChatGPT is not specifically fine-tuned with a comprehensive, high-quality dataset of medical images and corresponding captions, it may lack the requisite domain-specific knowledge.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

The authors claimed to release the source code and dataset in future but not upon acceptance of the submission.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- Discuss more about how to control the quality of ChatGPT generated captions for medical images.
- Address the issue of object hallucination more comprehensively by discussing its implications in a clinical context. Propose specific methodological improvements or new training strategies to minimize hallucination, such as dynamic updating of the model with new clinical data, enhanced validation techniques, or improved model interpretability.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The strengths of the paper slightly outweigh its weaknesses.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors provided more information about the quality of text generated by ChatGPT and the hallucination issue. I will keep my positive score.

Review #3

Please describe the contribution of the paper

In this paper, authors have developed the Med-GRIT-270k dataset, which is claimed to be the first large-scale dataset in biomedicine that integrates referring, grounding, and conversations. The dataset contains 270k question-and-answer pairs spanning 8 different medical imaging modalities. They have also introduced BiRD, the first Biomedical Refer-and-Ground Multimodal Large Language Model, which is fine-tuned on the Med-GRIT-270k dataset using multi-task instruction learning. The paper demonstrates the effectiveness of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The Med-GRIT-270k dataset is a novel contribution, as it addresses the lack of a dedicated refer and ground dataset for biomedical images.
- The BiRD model is the first MLLM in biomedicine that can handle fine-grained interactions like referring and grounding, which are essential for intelligent biomedical assistants.
- The paper provides a comprehensive evaluation of the BiRD model across various tasks and modalities, demonstrating its versatility.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The paper does not provide a detailed comparison with existing biomedical MLLMs, as they do not have referring and grounding capabilities. Any other related baseline could be helpful.
- The issue of object hallucination in the BiRD model is mentioned, but not thoroughly addressed or analyzed.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

The authors mention that they will release the Med-GRIT-270k dataset and a comprehensive codebase, which will facilitate reproducibility and further research in this area.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- Should provide a more detailed analysis and discussion on the issue of object hallucination, as it is a significant challenge for MLLMs in the biomedical domain.
- Could explore ways to mitigate object hallucination, such as fine-tuning the visual encoder or incorporating additional constraints during training.
- Experiments are comprehensive and cover various tasks and modalities, providing a thorough evaluation of the BiRD model’s capabilities.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major strengths of the paper lie in the development of the Med-GRIT-270k dataset and the BiRD model, which address a significant gap in the field of biomedical MLLMs. The comprehensive evaluation and the promising results demonstrate the potential of the proposed approach for developing intelligent biomedical assistants. While there are some weaknesses, such as the lack of a detailed comparison with existing methods and the issue of object hallucination, the authors have acknowledged these limitations and plan to release the dataset and codebase for further research.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

Authors explained their limitaitons regarding hallucinations and baseline models.

Author Feedback

We sincerely thank the reviewers for their thoughtful reviews and are pleased they appreciate the significance (R#1, R#3, R#5), novelty (R#1, R#5), and effectiveness (R#1, R#3, R#5) of our paper.

[R1-Q1] Other baseline. Thank you for the constructive suggestion. We will provide other related baselines, such as the LLava series and RadFM, in the open-source repository on our dataset.

[R1-Q2] Hallucination. Hallucinations are linked to both the model and training data. Given our primary contribution is the dataset, we included negative samples to mitigate this issue. We plan to further investigate this at the model level in future work. For more information about this, please refer to our response to [R5-Q2].

[R3-Q1] The meaning of referring and grounding. We apologize for any confusion caused by the definitions of terms. We will include this information in the revised manuscript as follows: “Referring” refers to a user input process where, in addition to inputting a image and a question, the user also specifies the exact image area they wish to inquire about, as seen in tasks like ROC and RC in Fig. 1 and the second turn in Fig. 2. “Grounding” indicates the model’s ability to pinpoint the specific location of the object queried by the user, as shown in the VG task in Figure 1 and the first turn in Figure 2. As reviewers R#1 and R#5 have commended, “These capabilities are essential for intelligent biomedical assistants.” However, these capabilities that previous biomedical large models did not possess. Similar to the evolution of LVLMs in the general community, the primary challenge here isn’t the novelty of LVLM structures, but rather the lack of detailed interaction datasets. Hence, we’ve developed this innovative dataset to address this gap in the field.

[R3-Q2] The novelty of the model. As mentioned above, our primary aim is to address the gap in fine-grained interaction data for biomedicine. Our model serves as a demonstration to the community of the successful application of this dataset, like LLava-med, PMC-VQA and so on. We are urging more researchers to empower LVLMs with this diverse data to enhance more capabilities.

[R3-Q3] We sincerely appreciate your selfless and meticulous review. I apologize for our oversight in the diagramming, and we will correct it.

[R3-Q4] Reproducibility. Section 3 describes the data construction pipeline, and the appendix provides detailed prompt templates refined. This information should suffice for replication. Additionally, our data and code are being organized and will soon be open-sourced.

[R5-Q1] Quality of data. We have adopted the following three strategies: a. Input structured information and specific details about image objects (category, coordinates, quantity) to GPT, unlike LLaVa-Med’s coarse-grained descriptions. b. Each time we generate data, we randomly apply some templates customized by professional doctors to GPT, maximizing the use of GPT’s in-context learning and instruction following capabilities. c. We adjust the input prompts to generate 500 data points for each of the eight modalities, which are then reviewed manually to ensure their quality. Only when the data from all modalities are deemed satisfactory do we finalize the form of the input prompts.

[R5-Q2] Hallucination. We are deeply appreciative of your constructive critique concerning our research. Despite ChatGPT’s limited knowledge of medical imaging, it has a substantial foundation in medical knowledge. Given the shortage of biomedical multimodal data, using multiple strategies to ensure high-quality data generation is worthwhile. We agree with you and consider that augmenting the dataset with additional clinical data and leveraging the LoRA fine-tuning technique for continuous learning could provide a viable solution in future works. For more information about this, we recommend referring to our response in [R1-Q2].

We sincerely thank all reviewers again for your kind attention and thoughtful reviews!

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors have addressed the reviewers’ concerns. All three reviewers recommend accept.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

The authors have addressed the reviewers’ concerns. All three reviewers recommend accept.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

back to top

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Author(s):