Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs’ fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on nine open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2156_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ruc-aimc-lab/FunBench

Link to the Dataset(s)

https://github.com/ruc-aimc-lab/FunBench

BibTex

@InProceedings{WeiQij_FunBench_MICCAI2025,
        author = { Wei, Qijie and Qian, Kaiheng and Li, Xirong},
        title = { { FunBench: Benchmarking Fundus Reading Skills of MLLMs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {283 -- 293}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a fundus benchmark for MLLMs, consisting of hierarchical tasks and targeted evaluation modes. It evaluates various vision encoders and LLMs, finding that MLLMs heavily rely on their internal LLMs for fundus reading tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Benchmark: Creating a fundus benchmark is clinically significant and essential for fairly evaluating existing MLLMs. The tasks included in this benchmark are detailed and diverse.
    • Experiments: The evaluation encompasses the latest general and medical MLLMs. The three modes of MLLM evaluation are both novel and comprehensive.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Benchmark: The benchmark is not comprehensive and lacks some important tasks and commonly used datasets:
      • Tasks: This benchmark includes tasks across four levels (modality, anatomy, lesion, disease). However, image quality assessment is a crucial step in the screening process. Public datasets like DDR and DeepDRiD also provide image quality annotations. This task should be included in the benchmark.
      • Datasets: The benchmark is missing several commonly used datasets. For example, it lacks some CFP datasets (EyePACS (the largest DR dataset), Messidor, APTOS, FGADR, and MMAC (myopic maculopathy)), OCT datasets (DRAC), and UWF datasets (UWF4DR).
    2. MLLM Evaluation: The evaluated MLLMs focus solely on general and medical domains, neglecting the ophthalmology domain, which includes models such as EyeGPT and RetFound. While I understand that some of these models may not be open-sourced or may be vision-only, discussions of these related works should be incorporated.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    -The benchmark is not comprehensive.

    • Lacks discussion on important related work.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper
    1. Designed an ophthalmology benchmark with unprecedented granularity, surpassing the level of detail in all existing medical benchmarks

    2. Innovatively analyzed the distinct roles of vision encoders and LLM components, introducing a novel perspective in medical-related benchmarks

    3. Examined the specific contributions of vision encoders and LLM components within multimodal large models

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The task granularity design is highly sophisticated and demonstrates expertise in ophthalmology

    2. Analyzed the distinct roles of vision encoders and LLM components, whereas most current benchmarks primarily evaluate MLLM performance as a whole without separating these components

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lacks case studies. To comprehensively evaluate the models, it requires more case analysis of relevant tasks rather than just presenting numbers. For some tasks, such as internVL2.5’s performance on the disease-annotated tasks, while not particularly strong, it demonstrates competitive results on L1-L3 tasks. Similarly, HuatuoGPT-V shows weaker performance on Anatomy but performs reasonably well on L3 and L4 tasks. The paper needs a more detailed analysis of these performance gaps to better understand the underlying factors.

    paper reference: https://mmmu-benchmark.github.io/ https://uni-medical.github.io/GMAI-MMBench.github.io/

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Minor weaknesses:

    1. The study utilizes widely-applied open-source datasets that many large models have already incorporated for VQA training. It remains unclear how the authors ensure against data leakage issues.

    2. The significance of the hierarchical structure requires clearer explanation. Although the authors mention hierarchical relationships between tasks, this classification scheme is not strongly clinically correlated. Additionally, the independence of tasks within each branch is not clearly demonstrated—L4 tasks implicitly incorporate content from L1-L3.

    3. While the paper designs both low-level and high-level tasks, the discussion section lacks a systematic summary of performance across these task categories; also, the author needs to explain how these different task levels ultimately influence diagnostic accuracy.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper provides a highly granular task division, and the approach of separately evaluating VE and LLM components is innovative. However, it has limitations in its case studies and contains some minor flaws, leading me to consider it a borderline acceptable paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a systematic and multi-level fundus image understanding benchmark (benchmark)-FunBench and constructed a high-quality test collection covering 14 publicly available datasets, and constructed four task levels (L1-L4) according to the original annotations. Furthermore, three evaluation modes (E-Mode I/II/III ) for evaluating the visual encoder (VE), language model (LLM) and overall system end-to-end capability of the model respectively, providing a modular and holistic joint evaluation framework. Finally, several mainstream medical multimodal macromodels are systematically evaluated and compared.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a systematic and multilevel benchmark for fundus image understanding, FunBench, is proposed to modularly evaluate the visual encoder (VE) and language model (LLM) via E-Mode I and II, and to assess the overall reasoning capability through E-Mode III. This “decoupling + integration” strategy is novel in the field of medical multimodal modeling. Moreover, by leveraging 14 publicly available datasets covering CFP, OCT, UWF, and multimodal modalities, the benchmark effectively reveals the strengths and limitations of current mainstream medical MLLMs in real-world tasks, providing valuable insights for future improvements.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors propose a well-structured and conceptually complete benchmark for multilevel fundus image understanding. By introducing a modular evaluation framework based on model components, the work demonstrates significant research value and practical relevance. However, certain aspects related to task design, experimental control, and error analysis could be further improved. The following suggestions are provided:

    1. While the hierarchical structure of tasks (L1–L4) is logically defined, some semantic overlap appears to exist between certain levels—for example, between lesion-level analysis in L3 and disease diagnosis in L4. It is recommended that the authors provide representative examples to clarify the distinctions and overlaps between each task level, in order to demonstrate that the hierarchy reflects differences in information abstraction rather than arbitrary categorization.

    2. The number of subtasks is imbalanced across levels. For instance, L3 includes 39 subtasks, whereas L1 and L2 each contain only two. This imbalance may introduce bias in evaluation. The authors are encouraged to add a chart illustrating the data/task distribution across levels and to analyze its potential impact on the overall benchmarking results.

    3. Since different tasks utilize CFP, OCT, UWF, and other imaging modalities, it is important to assess the model’s generalizability across modalities and to determine whether the imaging modality itself introduces performance bias. A modality-wise performance breakdown would strengthen the evaluation.

    4. The use of linear probes is appropriate for assessing the separability of visual features extracted by the VE. However, this approach may not fully reflect the VE’s representational capacity in real-world tasks. It is suggested to include a fine-tuning experiment to complement the linear probe results and provide a more comprehensive evaluation of the VE’s potential.

    5. Although Spearman correlation analysis is provided in Table 4, the coupling between VE and LLM could be further quantified. A more rigorous decomposition study—such as evaluating multiple LLMs with the same VE, or vice versa—could help validate the extent to which the LLM drives performance in different tasks.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel and well-structured benchmark, FunBench, for evaluating MLLMs in fundus image understanding. The multi-level task design and modular evaluation strategy (VE, LLM, and holistic) are innovative and practically valuable. Despite some minor issues in task balance and analysis depth, the work is technically sound and offers clear contributions to the field.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We appreciate all reviewers (R1, R2, R3) for their valuable comments and the opportunity for clarification.

=== Response to R1 ===

  • We accept R1’s suggestion. Since image quality is disease dependent, we will add the image quality assessment task in Level 4. All the public datasets suggested by R1 will be incorporated in FunBench-v2, to be released at Hugging Face.

  • We would like to clarify that EyeGPT is a large language model, not an MLLM. As for MLLMs in the ophthalmology domain, e.g., OphGLM [6] and DeepDR-LLM [15], none is open-source nor API available for evaluation. As for RetFound, this model is essentially a Vision Encoder (VE), not an MLLM. Hence, we will include RetFound in E-Mode I (Linear-probe based VE Evaluation).

=== Response to R2 ===

  • Indeed, due to the natural connection between lesions and diseases, semantic overlap between L3 and L4 exists to some extent. That said, the two levels are distinct, because for a number of diseases, the presence of certain lesions alone is often insufficient for disease diagnosis. Meanwhile, as a model might produce correct diagnosis yet on the basis of incorrect inference, L3 and below are important for a comprehensive evaluation of a model’s know-how. Following R2’s advice, we will provide representative examples to clarify the distinctions and overlaps between each task level in the final version.

  • We clarify that our evaluation protocol has already taken the cross-level imbalance into account: the MEAN performance column in Tab. 3 is averaged over the four levels, whilst the per-level performance is obtained by averaging over the (sub-)tasks and in a hierarchical manner if subtasks exist as L3. Such a performance calculation effectively removes any bias caused by imbalanced numbers of subtasks across different levels.

  • We will add a modality-wise performance breakdown. MLLMs show better overall performance on CFP and OCT than on UWF.

  • We will report the fine-tuned performance of CLIP-ViT, a popular choice of VE (Tab. 2).

  • We clarify that our current setup supports the decomposition study suggested by R2, namely evaluating multiple LLMs with the same VE, see the following five models in Tab. 2, LLaVA-v1.5, Qilin-Med-VL-Chat, LLaVA-v1.6, LLaVA-Med-v1.5, HuatuoGPT-Vision, all using the same CLIP-ViT as their VE. We will discuss the suggested study in “LLM Comparison” (Sec. 3.2).

=== Response to R3 ===

  • Following R3’s advice, we will add multi-dimensional task-specific cases for a more intuitive understanding of how different MLLMs perform.

  • Indeed, as the current MLLMs do not disclose details of their training data, there is possibility of data leakage. Even in such an “optimized” scenario, the MLLMs show significant deficiencies in fundus reading skills. Meanwhile, developers of new MLLMs are advised to exclude the test set of FunBench from their training data.

  • We clarify that the four-level design mirrors the progressive complexity of fundus image interpretation. In contrast to tasks from the previous levels, the L4 tasks specifically evaluate an MLLM’s ability of integrating imaging findings with clinical knowledge for final diagnosis.

  • Following R3’s suggestion, we will expand the last paragraph of Sec. 3 with a systematic summary of performance across both low-level and high-level tasks. In order to reveal how a specific model performs at earlier levels influences diagnostic accuracy, we will report spearman correlation between the performance scores of L4 and the performance scores of other three levels, which shows that L3 has the largest correlation coefficient of 0.212 to L4.

We will incorporate all the above into the camera ready. Thank you.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    This paper introduces FunBench, a novel benchmark for evaluating the fundus reading skills of medical large language models (MLLMs). FunBench features a hierarchical task structure and multiple evaluation modes, offering a comprehensive framework to assess the capabilities of vision encoders and language models. By integrating 14 public datasets covering various imaging modalities (CFP, OCT, UWF) and creating four task levels (L1-L4), it enables a detailed evaluation of mainstream medical MLLMs.

    The paper makes a valuable contribution to the field by introducing a well - structured benchmark for assessing MLLMs in fundus image understanding. Although there are minor issues in task balance and analysis depth, the research is methodologically sound and offers a clear contribution to the field. Therefore, it is recommended for acceptance.



back to top