Abstract

Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predefined closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs’ reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses. The project is available at \url{https://danielchyeh.github.io/MDPipe/}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0298_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0298_supp.pdf

Link to the Code Repository

https://danielchyeh.github.io/MDPipe/

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yeh_Insight_MICCAI2024,
        author = { Yeh, Chun-Hsiao and Wang, Jiayun and Graham, Andrew D. and Liu, Andrea J. and Tan, Bo and Chen, Yubei and Ma, Yi and Lin, Meng C.},
        title = { { Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. A well-defined research question that addresses close-set answers and missing reasoning issues in previous methods.
    2. Development of a multi-modal diagnostic pipeline. Including Vision translator and a Large Language Model (LLM)-based summarizer.
    3. Demonstration that MDPipe outperforms existing methods, including GPT-4, and offers clinically sound rationales for its diagnoses.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The presentation is commendable; the article is fluent, understandable, and features aesthetically pleasing and clear figures.
    • Introduction of a visual translator to address the challenges of accurately representing visual data in MLLMs.
    • Collection of clinical knowledge using real-life clinician diagnoses to refine the LLMs with nuanced, domain-specific insights.
    • Conduct of a User (Clinician) Preference Study.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Ocular surface disease encompasses more than just dry eye and meibomian gland dysfunction; it also includes corneal diseases, etc. Given that the experimental section is limited to meibography images, other modalities such as AS-OCT and slit-lamp images of ocular surface disease have not been experimented with or discussed. Consider changing the title to “Eyelid Disease.”
    2. Details on the visual translator are missing. In the final paragraph of Section 2.2, it is mentioned, “With the visual translator, we were able to precisely measure morphological features.” How was the translator trained, and how accurate are the measurements with respect to different U-Net architectures?
    3. The implementation details in the refinement phase are unclear. Given that only four NVIDIA GeForce RTX 3090 GPUs were used, reviewers may be curious about how to save memory if full-parameter fine-tuning was performed, or if Parameter Efficient Fine-Tuning (PEFT) methods like adapters or LoRA were used.
    4. The train/test split of the merged dataset for the experiment is not provided. The unknown number of test cases makes it difficult to assess the quality of the results.
    5. Blepharitis and Meibomian Gland Dysfunction (MGD) may lead to Dry Eye (DE) . If a patient has MGD, it implies they also suffer from evaporative DE. The classification tasks, along with their evaluation metrics, may be flawed.
    6. A comparison with Llava and GPT-4-Vision is missing, which I believe should be the main competitors according to Figure 2(a).
    7. A comparison with non-LLM solutions, such as directly using image representation and metadata to train a Multi-Layer Perceptron (MLP) classifier, is lacking.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • The data processing phase heavily relies on GPT-4, but GPT-4’s outputs can vary with different random seeds. How is this uncertainty controlled?
    • How was the U-Net trained, and which dataset was used? How does the article ensure that the training data for the vision translator did not leak into the subsequent QA-based LLM evaluation?
    • Are the experimental data publicly available? It appears that the currently publicly available CRC and DREAM datasets are not the meibography images mentioned in the paper. It would be best to change the dataset names to avoid confusion. Please also note the data acquisition method for meibography images in the article.
    • CRC:https://paperswithcode.com/dataset/crc
    • DREAM:https://paperswithcode.com/dataset/dream
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Technical details should be further clarified, including a formal description of MDpipe, the vision translator, the training objectives of LLMs, and the prompt template for the GPT-4 Report Summarizer.
    • Experimental details, such as dataset splitting, should be provided. If possible, include an evaluation of Llava and GPT-4-Vision.
    • A brief section on related work is necessary, primarily focusing on introducing medical domain LLMs like PMC-LLaMA and Med-Alpaca, and medical domain MLLMs like Med-flamingo[1] and Med-PalM Multimodal[2].

    References: [1] Moor, Michael, et al. “Med-flamingo: a multimodal medical few-shot learner.” Machine Learning for Health (ML4H). PMLR, 2023. [2] Tu, Tao, et al. “Towards generalist biomedical AI.” NEJM AI 1.3 (2024): AIoa2300138.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic and research question are novel, the idea is practical and feasible, and the good presentation contributes to the paper’s high quality. However, the lack of key methodological and evaluative details makes it challenging to assess its contribution. The most concerning aspect is the unfair experimental setting (missing MLLM baselines). Additionally, the target classification categories of DE, MGD, and Blepharitis may not be suitable tasks since MGD and Blepharitis can lead to DE. If the aforementioned concerns are adequately addressed, I would be pleased to increase my rating.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for the details in rebuttal. They have addressed my concerns about methodological and evaluation details. However, considering the rigor of the paper, I insist that:

    1. The title should be narrowed down to eyelid diseases to avoid exaggerating the contribution of the article. Because the article only preliminarily evaluated three diseases in a subset of ocular surface diseases.
    2. In the definition of the category in the experiment, the correlation between MGD and DE should be clarified, such as what proportion of evaporative DE patients are caused by MGD, but mutually exclusive labels are marked.
    3. The comparison with other Multimodel LLM is an important experiment, although limited by the policy, these experiments cannot be presented at the rebuttal stage.

    For the above reasons, I can only raise the score to weak accept, but I look forward to the author being able to solve the concerns in terms of the title and the Multimodel LLM baseline in the camera ready.



Review #2

  • Please describe the contribution of the paper

    The paper describes a visual translator/clinical summarizer capable of diagnosing dry eye based on Meibomian Gland Dysfunction present in meibography images. The resulting model is called MDPipe and consists of three components: (1) a visual translator to quantify meibography images, (2) an LLM summarizer given clinical report summaries, and (3) clinical knowledge from real-life clinician diagnoses with domain expertise.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The multi-modal nature of the model is novel, utilizing image data, text summaries, as well as real-time clinician input. The use of meibography images is also quite interesting, as this is not a commonly observed modality in ophthalmology papers. The ablation study and clinician user study also add to the paper quality.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The introduction is a bit hard to follow; I did not catch the meaning of NIKBUT until it appeared in one of the later figures even though it first appeared in the first figure so would be useful to add it to the early text as well. Also, the models used are pretrained large language models, so some description on their fine-tuning for this medical dataset would help to enhance quality/reproducibility especially since no other code is provided.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The models used are pretrained large language models, so some description on their fine-tuning for this medical dataset would help to enhance quality/reproducibility especially since no other code is provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The introduction is a bit hard to follow; I did not catch the meaning of NIKBUT until it appeared in one of the later figures even though it first appeared in the first figure so would be useful to add it to the early text as well. Also, the models used are pretrained large language models, so some description on their fine-tuning for this medical dataset would help to enhance quality/reproducibility especially since no other code is provided.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The introduction is a bit hard to follow; I did not catch the meaning of NIKBUT until it appeared in one of the later figures even though it first appeared in the first figure so would be useful to add it to the early text as well. Also, the models used are pretrained large language models, so some description on their fine-tuning for this medical dataset would help to enhance quality/reproducibility especially since no other code is provided.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my main concerns, and it appears that fellow reviewer concerns have been mostly addressed, but given a few more changes have been asked for the camera-ready version, I am keeping my score at ‘weak accept’ in hopes that authors will make these requested changes in the final version.



Review #3

  • Please describe the contribution of the paper

    The paper introduce a method that enhances traditional diagnostics by converting visual data into quantifiable morphology that is then analyzed in conjunction with clinical metadata. This data is further synthesized by an LLM-based summarizer to produce insightful clinical reports. The paper validates this approach with robust evaluations demonstrating that MDPipe surpasses current diagnostic standards, providing clinically sound rationales and improving the accuracy of ocular disease diagnoses.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces MDPipe, a novel multi-modal diagnostic pipeline that integrates visual data from meibography images and clinical metadata using a visual translator for enhanced diagnostic precision in ocular surface diseases. It employs GPT-4 to synthesize this data into comprehensive clinical reports, demonstrating innovative application and robust clinical feasibility. MDPipe’s effectiveness is validated through extensive comparative evaluations against existing diagnostic methods, showing superior accuracy and clinical relevance. Additionally, the model’s reasoning capabilities are refined with real-life clinician insights, further improving its diagnostic performance and making it a promising tool for medical practice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the integration of multimodal data and LLMs is commendable, the core concepts of employing image-to-data translation and machine learning for diagnostics are well-established. Previous studies have explored the quantification of meibomian gland morphology using advanced image processing techniques, which may reduce the perceived novelty of the visual translator component.
    2. The paper lacks detailed descriptions of the algorithms used in the visual translator and LLM-based summarizer, and it does not mention whether the pipeline or any components are open-sourced. Greater transparency in methodology and access to the source code would be beneficial for validation and comparison with existing models.
    3. The use of GPT-4 and specific pre-trained models might impose limitations on accessibility and adaptability in broader clinical practice. This dependency on proprietary tools could hinder adaptation in varied clinical environments where such resources are unavailable. Additionally, investigating the use of offline LLMs like LLaMA could address concerns about data leakage and enhance data security.
    4. Relevance and Comparison to Classification Models: As the task inherently involves classification, the significance of using LLMs over more traditional classification models is not clearly justified. Comparing MDPipe’s performance directly to established classification models could better highlight the advantages or necessity of employing LLMs in this context.
    5. The lack of a diverse dataset evaluation might raise concerns about the model’s performance generalization.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper should enhance its reproducibility by providing more detailed documentation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Would be good to open source the code.
    2. While the use of LLMs is innovative, a direct comparison with traditional classification models could clarify the advantages of your approach. This would help justify the use of more complex LLMs over simpler, possibly more interpretable models.
    3. Addressing how MDPipe performs across different datasets could better underscore its robustness and applicability in diverse clinical settings.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation for a conditional acceptance of this paper, pending the authors’ rebuttal, is based on several key considerations:

    The paper is well-structured and clearly written. The application of LLMs to integrate and interpret complex multimodal data in clinical diagnostics is promising. The paper does present some limitations, such as not being open-sourced and lacking a direct comparison with more traditional classification models. If the authors can provide compelling responses to these issues in their rebuttal, particularly concerning the model’s comparability and accessibility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Based on the information contained in this paper, the author needs to open source the code and provide more information about, i.e. the segmentation model used. And more comparison. I recommend borderline weak accept.




Author Feedback

We thank the reviewers for their insightful suggestions. We would like to address the major critiques as follows:

[R1, R3, R4] LLM fine-tuning implementation: Implementation leverages the TRL-Transformer Reinforcement Learning GitHub repo to fine-tune Llama2. We utilized the TRL library incorporating 4-bit QLoRA for efficient training. Fine-tuning was conducted with a batch size of 4 per device with maximum 10,000 steps. Learning rate was 2e-4 with a constant scheduler and maintained a maximum sequence length of 512. All other parameters align with those specified in the TRL GitHub repo.

[R3, R4] Visual translator implementation: Visual translator is implemented based on the Wizaron/instance-segmentation-pytorch GitHub repo. We utilized a ResNet50 backbone for our instance segmentation network, training separate models for lower (553 images) and upper (486 images) eyelids on CRC data annotated with gland masks. Each model was trained on 256x256 resized images for 300 epochs using a batch size of 8 and a learning rate of 1.0, employing the Adadelta optimizer with a weight decay of 1e-3. Other parameters align with those specified in the GitHub repo.

[R3, R4] LLM-based summarizer details: The summarizer uses the GPT-4 API in the Erol444/gpt4-openai-api GitHub repo, with a unique seed to control uncertainty. We input raw morphology data along with a structured task template (Fig. 3) to produce clinical reports.

[R3, R4] Make code, model & data public: We will release code and model weights on GitHub and Hugging Face after peer review. Datasets will be made available upon request for research purposes with appropriate Data Transfer & Use Agreements for sharing protected patient medical data.

[R3] Dataset train/test split: Train/test split is 90%/10%. Training set has 1903 metadata-only and 1257 image+metadata cases; Test set has 198 metadata-only and 155 image+metadata cases. There are a total of 878 subjects.

[R3] MGD implies DE, classification tasks may be flawed: The presence of MGD does not imply that the patient has evaporative DE. While MGD is often an important etiological factor in evaporative DE, that is not true for all cases (Galor, 2014). MGD, DE and blepharitis are distinct conditions, albeit often with similar symptoms. In our model, we defined independent labels for these conditions based on the TFOS 2017 DEWS II Definition and Classification Report (Craig, 2017). DE is defined by loss of tear film homeostasis, ocular surface damage and symptoms. MGD is defined by ductal stenosis and quality of glandular secretion. Blepharitis is based on eyelid margin inflammation, debris, and collarettes.

[R3] Rephrasing title with “Eyelid Disease”: Slit lamp and OCT would certainly be within the scope of future work for additional ocular surface diseases, however clinicians do not refer to MGD and blepharitis as “eyelid diseases”. The TFOS definition of “ocular surface” is that it comprises the structures of the eye and adnexa, including cornea, conjunctiva, eyelids, eyelashes, tear film, lacrimal glands and Meibomian glands (Craig, 2017). Therefore, in alignment with accepted literature, our method addresses a subset of ocular surface diseases.

[R4] Lack of diverse datasets, performance for different datasets: It is important to note that our datasets do come from diverse study populations. Data from the CRC and DREAM (a major clinical trial with 11 meibography sites across the US) are combined. Our distributions are mostly similar to US Census statistics for age, sex, and race. Our dataset also covers a wide range of disease severities. Work is ongoing to obtain additional data on much younger and older populations, male subjects, and those of African ethnicity.

[R1, R3, R4] Suggestions to revise and requests for additional material: We appreciate the reviewers’ insights, however changes to the paper and inclusion of new data, experiments, or results in this rebuttal is specifically prohibited by MICCAI guidelines.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, the reviewers reached a unanimous agreement. I also believe that the quality of the paper is good and it should be accepted.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the rebuttal, the reviewers reached a unanimous agreement. I also believe that the quality of the paper is good and it should be accepted.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top