Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model’s prediction to physiological structures and their characteristics. We then transform this reasoning into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our agent provides more accurate diagnostic explanations and enables precise localization of pathologies in OCTA images.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3192_paper.pdf

SharedIt Link: https://rdcu.be/eHxca

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_20

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/3192_supp.zip

Link to the Code Repository

https://github.com/chenjun-li/GFT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiChe_Finetuning_MICCAI2025,
        author = { Li, Chenjun AND Lux, Laurin AND Berger, Alexander H. AND Menten, Martin J. AND Sabuncu, Mert R. AND Paetzold, Johannes C.},
        title = { { Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {198 -- 207}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a novel framework to enhance the interpretability of Diabetic Retinopathy (DR) staging using OCTA images. The core idea involves using a GNN on biologically-informed graphs extracted from OCTA data for initial DR staging, employing Integrated Gradients (IG) to identify key explanatory features from the GNN, and then transferring this “graph-based knowledge” via instruction-tuning (using a teacher-student VLM approach) to a final VLM agent. This agent aims to provide both the DR stage classification and a natural language explanation based on the input OCTA image.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Addresses critical need for explainability: The work directly tackles the significant challenge of “black-box” models in medical diagnostics, aiming to provide interpretable DR staging results 2) Novel knowledge transfer pipeline concept: The proposed multi-stage pipeline for extracting structured, interpretable knowledge from a specialized model (GNN+IG) and distilling it into a generalist VLM via instruction tuning to generate explanations from raw images presents an interesting conceptual approach. 3) Includes human expert evaluation: The authors incorporate evaluation by human experts (two ophthalmologists) to assess the quality of the generated explanations based on clinically relevant criteria. This is a commendable step beyond relying solely on automated metrics. 4) Improved classification & generalization demonstrated: The proposed Graph-knowledge Finetuned models show improved DR staging classification accuracy compared to baseline and standard finetuned VLM approaches on both proprietary and public datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) Limited methodological novelty vs. high complexity: The framework relies heavily on the integration of existing, well-established components (specific GNN architectures like SAGE in Section 2.2, Integrated Gradients in Section 2.3, standard VLMs and fine-tuning techniques like LoRA in Section 2.5 and Section 3). The primary novelty lies in the complex multi-stage pipeline design (Figure 1), which could be viewed as an incremental engineering contribution rather than a fundamental methodological advance. The significant complexity introduced by using multiple models (GNN, Teacher VLM, Student VLM) is not sufficiently justified against potentially simpler explainability approaches. 2) Insufficient validation of VLM explanation utility: A crucial missing comparison is the evaluation of the clinical utility of the natural language explanations generated by the final VLM agent versus the direct interpretable outputs obtainable from the GNN+IG stage itself. It is unclear whether the complex VLM generation process provides significantly more actionable or trustworthy insights for clinicians compared to directly interpreting the GNN’s reasoning. 3) Limited scale and potential bias in evaluation: The human expert evaluation, while valuable, involved only two experts assessing 48 responses (mentioned in Section 3), limiting the statistical robustness and generalizability of the findings on explanation quality. Furthermore, using a “teacher” VLM (OpenAI o1, mentioned in Section 2.4) for automated scoring of explanation quality introduces potential bias, as the evaluator model is similar to the one used for generating training data. 4) Data reliance and limited ablation: The main multi-class staging results (Table 1) depend on a proprietary dataset, limiting reproducibility. The study also lacks comprehensive ablation experiments to clearly isolate the contribution of each component within the complex pipeline (e.g., the specific impact of the graph construction, the IG method, or the two-stage tuning process).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an interesting approach towards explainable DR diagnosis but suffers from some weaknesses. The methodological novelty appears limited relative to the high complexity of the proposed pipeline. Crucially, the experimental validation fails to adequately demonstrate the added value of the VLM-generated explanations compared to potentially simpler interpretations derived directly from the GNN+IG module. It’s essential for the authors to solve the limited scale of the expert study and potential evaluation biases during rebuttal for meeting the rigorous standards of MICCAI.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper transforms complex vascular patterns into structured text descriptions and fine-tunes a large model to facilitate end-to-end interpretable diabetic retinopathy diagnosis. It also proves that this method can effectively improve diagnostic accuracy and explainability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposes a pipeline that can create a question-answering text dataset based on a graph structure for fine-tuning a large model, providing more accurate explanations while improving diagnostic accuracy. In addition, this paper tests the model on an out-of-domain dataset, showing good generalization performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. There is a situation where the VLM explanation and the GNN result are inconsistent, and then the explanation given by VLM seems unreliable.
2. How to count the correct location regions is not explained. And how can we ensure that the text describing the location is highly relevant to the corresponding area in the image?
3. In Table 3, the explanation of GFT-GPT-4o performance is the best, but the classification results in Table 1 and Table 2 show that GFT-Llama 3.2 11b is the best. This situation slightly indicates the unreliability of the explanation from VLMs. What are the possible reasons for this?
4. How stable is the method? The output of large language models is easily affected by the randomness of hyperparameters, resulting in large differences in results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper integrates GNN with VLM to provide explanations while improving diagnostic accuracy, which demonstrated good generalization ability.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This manuscript proposes a comprehensive approach for explainable diagnosis of diabetic retinopathy (DR), using graph neural networks (GNN) and vision-language models. The approach consists of multiple steps, summarised in section 2.1. The method is tested on a proprietary dataset and a public dataset, measuring both classification performance and explanation quality.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-S1) Code is made available -S2) The paper introduces several interesing and inspiring ideas, which would be nice to discuss at the conference. In particular I like the general idea, on how to transfer the knowledge from a dedicated well-performing classification model (the GNN in this case) to the final vision-language model that imitates the classification predications and is able to give additional textual explanations. -S3) The results are promising: quite high classification performance and better explanation quality than alternative more straightforward approaches.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

-W1) The results are reported without any measures of variance (e.g. confidence intervals). -W2) Interobserver variability of the explanation quality ratings by two observers has not been reported. -W3) Many details of the method are not explained, or only in vague and qualitative terms like “the most important features such as vessel diameter and roundness”. It is unclear to me how exactly the question-answer pairs are generated by the teacher model. Details of model fine-tuning are missing.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- In Table 1, the bold numbers are not always the highest of the column. For example for Recall Healthy and Recall NPDR, there are higher values in other rows.
- The “Biomarkers” model in Table 2 has quite high performance; please explain what were the input biomarkers, and what classification model was used. Isn’t this approach very explainable by design as well?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Strengths S2 and S3 highlight the potential of this work. This paper is a nice demonstration of how to exploit VLMs for creating explainable diagnostic models.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their thorough evaluation and constructive feedback. We appreciate your recognition of our paper’s novelty and would like to take this opportunity to clarify a few points: R1:

During our experiments, we rarely observed inconsistencies between VLM and GNN predictions (fewer than 8% of the test data), we inspected them with our experts and identified them as mostly borderline cases. This is typical because DR is a gradually progressing disease therefore difficult for both GNN and VLM to resolve. As noted in Section 2.5, our retrieval process also helps mitigate this issue by providing similar cases as context.

The ophthalmologists used a quadrant-based system to verify text-to-image correspondence. A region was marked as correct only when both experts agreed. The inter-rater weighted agreement was κ=0.83.

Classification and explanation are different optimization targets. GPT‑4o has a much stronger pre-trained language module and a larger context window, so after instruction‑tuning it produces richer and more diverse explanations. Llama‑3.2, with fewer generation steps and more tunable hyperparameters, converges better on the classification task. 

During the experiments, the models’ temperatures were set to a relatively low value of 0.2 to ensure consistent output. We have also conducted ablations that support the results and will share them in our repository. R2:

We will add confidence intervals from our 5-fold cross-validation: on average, balanced accuracy ±10.13%, precision ±5.67%, and recall ±4.47%.

We now report weighted κ = 0.83 between the two ophthalmologists on the explanations, indicating almost perfect agreement.

Section 2.4 outlines our Q&A generation approach. The teacher model was prompted with structured tables containing node/edge importance scores from the IG method. Our public code repository includes the full template and hyperparameters.

We will correct Table 1 formatting. The biomarkers model uses standard vascular metrics (BVD, FAZ area, tortuosity) with an SVM classifier, which provides feature importance but lacks location-specific explanations. R3:

Our novelty lies in graph‑to‑language knowledge distillation. The pipeline translates node and edge level retinal information into clinically meaningful Q&A pairs, which can be used to fine-tune a VLM that never sees the graph at inference. This differs from prior work that either keeps the graph encoder at test time or uses generic image captions.

Sociological studies (refs. 4 and 17) have demonstrated the advantages of conversation-based information exchange. For clinicians, written reports also serve as a useful means of information transfer. While we are actively working on scaling up this study, we have seen more ophthalmologists rate VLM explanations as more helpful for diagnosis than feature attribution visualizations. We hope to share more insights in follow-up research.

Our inter-rater agreement (κ=0.83) indicates the evaluation reliability, yet we acknowledge the limited scale of the current expert study, which will be added as part of the discussion. The automated evaluation serves as a supplementary metric, which has been widely used in many influential works, such as LLaVA and LLaVA-Rad.

We have conducted ablations that show the contribution of each component: on average, graph construction and IG attribution improve classification accuracy by 8.1%, and two-stage tuning improves localization ability by 10.3%.

Last but not least, we thank the area chair and all three reviewers for the thoughtful assessments and the early‑accept recommendation. Your comments helped us clarify several points, especially around the evaluation method, stability analysis, and the presentation of results, which we will incorporate in the camera‑ready version. We look forward to presenting our work at the conference this Fall.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis

Author(s):