Abstract

Multimodal large language models (MLLMs) have been explored in the Chinese medical domain for comprehending complex healthcare. However, due to the flaws in training data and architecture design, current Chinese medical MLLMs suffer from several limitations: cultural biases from English machine translations, limited comparative ability from single image input and difficulty in identifying small lesions with low-resolution images. To address these problems, we first introduce a new instruction-following dataset, Chili-Joint (Chinese Interleaved Image-Text Dataset for Joint Diagnosis) collected from the hospital in mainland China, avoiding cultural biases and errors caused by machine translation. Besides one single image input, Chili-Joint also has multiple images obtained at various intervals during a patient’s treatment, thus facilitating an evaluation of the treatment’s outcomes. We further propose a novel HiA (High-resolution instruction-aware Adapter) to incorporate high-resolutioninstruction-aware visual features into LLMs to facilitate the current MLLMs to observe the small lesions as well as the comparative analysis. Extensive experiments on Chili-Joint demonstrate our HiA can be a plug-and-play method to improve the performance of current MLLMs for medical analysis. The code is available at https://github.com/xmed-lab/HiA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1207_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/xmed-lab/HiA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Din_HiA_MICCAI2024,
        author = { Ding, Xinpeng and Chu, Yongqiang and Pi, Renjie and Wang, Hualiang and Li, Xiaomeng},
        title = { { HiA: Towards Chinese Multimodal LLMs for Comparative High-Resolution Joint Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors collect a new dataset: Chinese Interleaved Image-Text Dataset for Joint Diagnosis (Chili-Joint). The authors propose a High-resolution instruction-aware Adapter (HiA) for large language models (LLM).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The effort to prepare a new Chinese image-text joint diagnosis is commendable.
    2. The authors do experiments on four instruction-following scenarios, which is adequate in terms of instruction-following tasks in comparison to the previous work in the medical LLM.
    3. The writing and formatting are good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. As for the clinical contribution, the authors claim to collect a Chinese dataset (Chili-Joint) to avoid cultural biases and errors caused by machine translation. However, they did not analyze (or simply show examples) how the current method with machine translation can fail on Chili-Joint or limit the clinical application.
    2. The well-known public dataset MIMIC-CXR has images with a resolution over 1000x1000, and has a subset of longitudinal data (with a pair of images for comparative diagnosis). As a result, the MIMIC-CXR is suitable for the evaluation of HiA. The authors do not do experiments on the MIMIC-CXR, making the evaluation of HiA not comprehensive.
    3. Novelty of HiA. There are adapters for high-resolution images [7], instruction-following [a], and learning-based methods for comparative diagnosis [b,c]. The lack of comparison to the SOTA methods makes it hard to decide the novelty of HiA.
    4. The relationship between the proposed method and Chili-Joint is unclear. The proposed method can be implemented on public datasets and compared to other methods.

    [a] Llama-adapter v2: Parameter-efficient visual instruction model [b] Hierarchical Vision Transformers for Disease Progression Detection in Chest X-Ray Images [c] MuSiC-ViT: A multi-task Siamese convolutional vision transformer for differentiating change from no-change in follow-up chest radiographs

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. It is unclear the resolution of input images for the results in Table 1 and 2.
    2. Details for the baseline. The author fine-tuned the backbone LLM on Chili-Joint dataset. But how they fine-tune the model is unknown (LoRA, adapter, prompt-tuning?). HiA introduces additional parameters for tuning, thus without the details for the baseline, it is hard to judge the improvement of HiA.
    3. Implementation details. It is unknown whether the author tunes HiA and the baseline for four subtasks (Des, Dis, Loc, Com) separately. If it is tuned jointly, given the discrepancy in the input length, what is the strategy for padding?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Comparison to vanilla methods to demonstrate the superiority of the method: Vanilla methods can be utilized for high-resolution input and instruction-following. High-resolution: 1) learnable down-sampling layers before the input to the frozen ViT. 2) use the learnable high-resolution CNN in place of the frozen ViT for V_i and V_j. Instruction-following: append learnable Q in front of the input prompts to the LLM.
    2. Typos. For example: In abstract: high-resolutioninstruction-aware. In Instruction-Aware Extractor: {… V_i …} should be V_i^l ? Table 1: Note that all models are fine-tuning (fine-tuned) on Chili-Jointin (Chili-Jointin) the same setting.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The effort to prepare a new Chinese image-text joint diagnosis is commendable. However, the clinical significance of Chili-Joint is not clearly explained. Besides, the relationship between the proposed method and Chili-Joint is unclear. As a result, the proposed methods seem to be general yet lack comparison to the SOTA methods and on the public datasets (MIMIC). I believe this work can be improved in the later journal version, however, the submitted conference paper could be rejected, dependent on rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After rebuttal, the authors address most concerns. This paper has more merits than weaknesses.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a new method, “HiA” (High-resolution instruction-aware Adapter), designed to enhance Chinese MLLMs in medical applications. The HiA method addresses critical issues associated with low-resolution image analysis in existing models, enabling improved diagnostic capabilities. The results demonstrate a significant enhancement in performance, making it a substantial advancement for Chinese medical MLLMs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Solid Experiments and Ablation Study: The results significantly surpass those of competitors, demonstrating the robustness of the proposed method through comprehensive testing.
    • Intuitive Methodology: The method is logical and well-conceived, aligning well with the needs and constraints of the medical field.
    • Relevance to Chinese Medical Context: The approach is particularly suited to Chinese medical use cases, suggesting potential for widespread applicability in these settings.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Metrics Used for Evaluation: In Table 1, the reliance solely on BLEU and METEOR metrics may not adequately capture the performance nuances in a medical context. Inclusion of accuracy or precision-based metrics could provide a more comprehensive evaluation of the method’s effectiveness.
    • Lack of Detailed Statistical Support: In the ablation study, the claim that “without HR, the model would ignore the small lesions” needs to be substantiated with more detailed statistical analysis or examples. Merely citing BLEU and METEOR scores does not sufficiently support such a specific claim.
    • Performance Drop in Ablation Study: The performance drops noted (-1.7 and -3.2) in the ablation study should be discussed more explicitly to understand their impact on the model’s utility.
    • Typos: On page 2, it’s GPT3.5 not GPT3.4
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    If the dataset could be open sourced somehow, that will be a huge contribution

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Enhance Evaluation Metrics: Integrate additional metrics such as accuracy or precision to provide a more comprehensive evaluation of the model’s performance. This will help better assess the method’s efficacy in a medical context where specific diagnostic accuracy is crucial.
    • Strengthen Statistical Support: Provide more detailed statistical evidence or case studies to substantiate claims, particularly regarding the model’s ability to detect small lesions when high-resolution (HR) features are utilized. This should include not only qualitative descriptions but also quantitative analysis that clearly demonstrates the impact of HR features.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s structure is standardized, avoiding major errors but a little plain. Additionally, the metrics used for evaluation are not robust, influencing the overall score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper addresses the issue of cultural biases that occurs while translating English to Chinese Medical QA datasets and limited image resolution and single image input limitations in existing Med QA models. To this end, the authors propose Chili-Joint, a new Chinese Medical Image Question Answering dataset collected from the hospital in mainland China, that consists of multiple images captured at various time during the treatment and their corresponding textual descriptions. They also propose an architecture novelty, namely, HiA (High-resolution instruction-aware Adapter) that converts multiple images into high resolution visual features and uses an instruction aware extractor to capture salient features and inject them back into the LLM. HiA is a plug and play method that is training efficient since only HiA is trained while keeping all the other model parameters frozen.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper discusses an important issue of low-resolution image encoding in Medical image question answering LLMs. The authors propose an easy to plug and play high resolution image encoder and adapter (HiA) that doesn’t require end-to-end architecture training.
    2. The versatility of HiA is demonstrated in Table 1. All the baseline models show improvements over Chili-Joint dataset with HiA.
    3. The extensive ablation study on different proposed components (instruction aware tokens, high resolution tokens) and their length and number of layers in Table 2 and 4 are insightful. They help in understanding the intuition behind design decisions of the model architecture.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors only report results with evaluation metrics BLEU and METEOR which have their own limitations. It might be better to evaluate the results on semantic metrics such as BERTScore and other audio captioning metrics such as ROUGE, CIDER etc to validate the trend.
    2. The paper suffers from writing, grammatical errors and lack of space between words in many instances. This hinders reading. I have pointed out some of them in the minor errors section. It requires a round of tidying up.
    3. The paper evaluates the existing state of the art Med QA models and their proposed HiA counterparts only on the novel Chili-Joint dataset. It would be essential to see their performance on other SOTA MedQA datasets to see whether this trend of results is consistent with Chili-Joint. This is because the images in Chili-Joint might have smaller lesions or areas of interest compared to SOTA datasets, thereby providing a substantial increase in the scores with high definition image resolution.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to weakness section for areas to improve the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I find the paper’s contributions to Medical Question Answering significant. The authors tackle an important problem of low resolution encoding of images in MedQA. It would be good to show how the existing SOTA datasets suffer from this issue and to what extent.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks to the authors for addressing my concerns. To show robustness of the proposed model, the authors have mentioned in their feedback (8) that they will add evaluation with Qilin-Med-VL dataset. I am convinced that this will significantly improve the strength of the overall paper.




Author Feedback

We thank the reviewers for their valuable feedback. R1, R3, and R4 appreciate our extensive experiments and ablation study. R1 and R4 find our approach logical and suitable for the medical field, while R1 and R3 commend our method’s potential, writing, and new dataset. The major concerns are the metrics, details and some misunderstanding about method, dataset and experiments, which is clarified in the following.

—-R1,R4—- (1) Metrics: Thanks for the valuable suggestion. Due to the page limit, we only select two metrics for evaluation, in the future journal version, we will add more metrics suggested by R1 and R4.

—R1— (2) Detailed Statistical Support: The performance drop in BLEU indicates that the model tends to overlook small lesions to some extent. The reason is that for description and disease tasks, the interest regions are normally small, and performance drop means indirectly showing they are not being adequately identified. We will add more metrics such as disease accuracy for better evaluation.

(3) More discussion for performance drop (-1.7;-3.2) in ablation study: For ‘-1.7’, IA module can capture more useful information, without IA would degrade the performance. For ‘-3.2’, without HR information, the model fails to detect small lesions, resulting in descriptions that overlook these lesions and consequently degrade performance. Will add discussions in the latter version.

—R3— (4) Clinical value: Our clinic value is to avoid cultural biases and errors caused by machine translation. Prior work, such as Huatuo-26M (Li, Jianquan, et al.) and Qilin-Med-VL [19], found that translating from English introduces biases and inaccuracies, compromising robustness. Hence, it is crucial to develop a native Chinese dataset, which is the focus of our paper. This will be discussed in the revised version.

(5) Evaluation on MIMIC-CXR: Our main goal is to design a robust MLLM for Chinese Medical QA. MIMIC-CXR is unsuitable for our evaluation for two reasons. First, MIMIC-CXR is an English dataset, and translating it to Chinese introduces cultural biases and errors, which we aim to avoid (see point (4) in our rebuttal). Second, MIMIC-CXR only contains reports, limiting it to report generation tasks, whereas our method focuses on Medical QA. Although we could create instruction-following data from the reports, this is complex and not our paper’s main objective. Finally, we compare our method with more SOTA MLLMs, such as RadFM and MedFlamingo, demonstrating superior performance.

(6) Novelty of our method (comparison to [7,a,b,c]): There are some misunderstandings. [b, c] use traditional methods limit in comparative analysis, while we use MLLMs for medical VQA, covering more tasks. [a] adds learnable bias and scale factors for instruction cues, but the visual context remains instruction-irrelevant. The closest work is [7], designed for autonomous driving without considering challenges in medical image, e.g., comparative, instruction-related extraction. Differently, our method addresses all problems of [7,a,b,c], and we also conduct experiments to prove this and will add in the latter version.

(7) Relation between the method and the dataset: The dataset requires the ability of capturing high-resolution information, instruction-following and comparative analysis, which aligns well with our method. We also evaluate our method on Qilin-Med-VL and achieve better performance than SOTAs.

—R4— (8) Evaluation on other datasets: We propose HiA for Chinese high-resolution, instruction-following medical data, which can not be fully evaluated by current MedQA datasets. Thus, we introduce Chili-Joint for comprehensive evaluation. Besides, we also conduct experiments on Qilin-Med-VL dataset and achieves consistent results with Chili-Joint. Detailed experiments will be added.

—R1,R3,R4— (9) Typo errors: Will correct in the latter version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper presents a new method, “HiA” (High-resolution Instruction-aware Adapter), aimed at enhancing Chinese Multimodal Large Language Models (MLLMs) in medical applications. HiA tackles key issues related to low-resolution image analysis in current models, thereby improving diagnostic capabilities. The results show the good performance improvement, marking a substantial advancement for Chinese medical MLLMs.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper presents a new method, “HiA” (High-resolution Instruction-aware Adapter), aimed at enhancing Chinese Multimodal Large Language Models (MLLMs) in medical applications. HiA tackles key issues related to low-resolution image analysis in current models, thereby improving diagnostic capabilities. The results show the good performance improvement, marking a substantial advancement for Chinese medical MLLMs.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top