Abstract

Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2766_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/AB-Story/RetSTA-7B

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CaiJiu_RetSTA_MICCAI2025,
        author = { Cai, Jiushen and Zhang, Weihang and Liu, Hanruo and Wang, Ningli and Li, Huiqi},
        title = { { RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {553 -- 563}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a method for standardizing clinical fundus image reports based on LLM, RetSTA, which aims to solve the problem of inconsistent format, terminology, and style in clinical fundus diagnosis reports, thereby improving the quality of medical data. The authors refer to international authoritative medical standards and construct 362 standardized terms to ensure the standardization and international compatibility of the terms. They expand their dataset by simulating noise and verify their method’s effectiveness through experiments.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Development of a standardized terminology database: The authors constructed a comprehensive database encompassing clinical terms and common descriptions specific to fundus examinations, ensuring these terminologies’ standardization and international compatibility.

    2. Introduction of RetSTA models: Two large language models, RetSTA-7B-Zero and RetSTA-7B, were proposed. These models achieve report-level standardization for clinical fundus diagnostic reports for the first time, addressing an unmet need in this field and setting a new precedent.

    3. Enhancement through data augmentation: By employing a data augmentation strategy that simulates the complexity and diversity of real-world clinical scenarios, the authors fine-tuned the Qwen2.5-7B model. This significantly enhanced the model’s generalization capabilities. RetSTA-7B demonstrated exceptional performance in both Chinese and English report standardization tasks, showcasing its cross-language and domain-specific adaptability.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Incomplete experimental details: some key experimental details in the paper are not sufficiently elaborated. For instance, Table 2 mentions the performance of RetSTA-7B-Zero, but this information is not reflected in Table 1. Additionally, when discussing the impact of data augmentation on model standardization in Table 3, it would be helpful to include the performance metrics of the original dataset (400 samples) for comparison purposes.

    2. Lack of subjective evaluation by medical professionals. While the paper states that the reliability of terminologies was ensured under the guidance of senior ophthalmologists, there is no mention of subjective evaluations from these doctors regarding the model’s outputs. This omission could hinder the practical adoption of the model in real-world clinical applications.

    3. Performance on unseen or rare terminologies. The paper does not address how the model performs on unseen terminologies, particularly those that occur less frequently, such as terms related to rare or complex cases. While the focus may have been on high-frequency terms, the ability to handle rare cases is critical for ensuring the model’s robustness and generalizability.

    4. Details of the data augmentation strategy. Specific details about the implementation of the data augmentation strategy are missing. No examples are provided to illustrate how this strategy was applied, whether performed on Chinese, English, or both. Including such details would improve the reproducibility and clarity of the methodology.

    5. Details of terminology construction process. The paper does not provide sufficient information on how the standardized terminology database was constructed. It is unclear whether the process involved purely manual efforts, a combination of human expertise and machine assistance, or fully automated extraction methods. Clarifying this aspect would help assess the reliability and comprehensiveness of the terminology database.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The details, such as some key experimental details, the data augmentation strategy, and the terminology construction process, are not explained clearly.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors constructed a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. The authors also established two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B integrated standardized data generated by RetSTA-7B-Zero along with corresponding English data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Relevant problem of standardising reports;
    2. Thorough comparison with state-of-the-art LLMs;
    3. Ablation study.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Even though the checkpoints are shared, there is a need to provide more instructions on: a) how to run the models; b) provide test data as examples; c) provide evaluation scripts for reproducing the Table 1 results.
    2. Evaluation by domain experts is missing, it would be great to include human evaluation;
    3. The domain is very specific, a few experiments on other reports outside ophthalmology.
    4. How large is the test set?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A relevant problem of standardising reports was explored with thorough comparison against the state-of-the-art LLMs.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper addresses the standardization challenges in clinical fundus diagnostic reports by constructing a bilingual standard terminology for fundus clinical terms and commonly used descriptions in diagnosis. The researchers developed two large language models: RetSTA-7B-Zero, which was fine-tuned on an augmented dataset simulating clinical scenarios and demonstrated strong standardization capabilities despite limitations in disease coverage, and RetSTA-7B, which integrated standardized data generated by RetSTA-7B-Zero along with corresponding English data to achieve report-level standardization for the first time. Experimental results validated that RetSTA-7B outperforms other compared LLMs in bilingual standardization tasks, demonstrating superior performance and generalizability. This research provides a foundation for future work in training text-image foundation models and multimodal LLMs, with potential applications beyond ophthalmology.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work is still of great clinical significance. In the clinical diagnosis of fundus diseases, the detailed description of symptoms largely depends on the interpretative experience of ophthalmologists. The prevalence of issues such as negative and vague expressions, typos, and system storage and format errors in reports exacerbates the challenge of standardizing fundus diagnostic reports. To address this, the authors leverage the advantages of large language models to propose a new method for creating standardized bilingual fundus diagnostic reports.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.In section “2.1 Standard Terminology”, it is mentioned that a bilingual fundus standard clinical terminology library containing 362 standard terms was successfully constructed. It is suggested that these 362 fundus standard terms could be made available in a public repository for future researchers to study and utilize. 2.Although this article has strong clinical application value, it lacks detailed descriptions of several key technical points throughout the text. For example: A simple and effective data augmentation strategy mentioned in section 2.2. A rule-based filter referenced in section 2.3. The paper only informs readers about the functions implemented without elaborating on these technical aspects. 3.In the comparison between non-standardized and standardized reports shown in Fig.1, I noticed that some keywords from the original report were omitted in the standardized version. For instance, the term “inferior” in “inferior retinal nerve fiber layer defect” was removed. In the context of glaucoma, nerve fiber layer defects are not limited to the inferior region; they can occur in any quadrant of the optic disc, including the superior, inferior, nasal, or temporal regions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clinical Significance

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We are grateful to the reviewers for acknowledging our contributions and for their constructive comments. This response letter will address each concern raised, providing detailed explanations of the methodology, clarifications on limitations, and plans for future improvements. ① Experimental details (R1-Q1, R2-Q3) RetSTA-7B-Zero’s performance was not put in Table 1 as it was fine-tuned solely on Chinese data, whereas Table 1 reported English standardization test results. Our models rely on a self-constructed fundus standard clinical terminology and standardized fundus report datasets. Due to their domain-specific foundation, they are inapplicable to apply in non-ophthalmology fields directly, though standardization techniques/methods can be extended to other medical field. ② Technical implementation details (R1-Q4&Q5, R3-Q2) We will provide these details in the final version. Details of the data augmentation: We employed a validated Chinese NLP toolkit for syntactic noise (homophone replacement, punctuation edits, clause merging), semantic perturbation (negation and modifier addition), and synonym replacement via a custom thesaurus. Details of the rule-based filtering mentioned in Section 2.3: Generated reports were split into clauses by commas. Clauses were scored pre/post-standardization; those above a threshold were retained, otherwise deleted. Details of terminology construction process: Core terminology was manually extracted. For common expressions, we used AP clustering algorithm to cluster clauses from all reports and selected the most standard expression from each cluster. Ambiguous terms were reviewed by ophthalmologists to ensure their accuracy. ③ Terminology and more instructions release for reproducibility (R2-Q1, R3-Q1) To ensure the reproducibility of this study and facilitate further research and application in the academic community, we will provide evaluation scripts and test data samples for the RetSTA-7B-Zero model through a public repository upon the formal publication of this paper. The bilingual fundus standard clinical terminology can be provided for research purposes under the permission of the corresponding author. ④ Suggestions for experiments (R1-Q1&Q2&Q3, R2-Q2, R3-Q3) Thanks for the valuable comments from reviewers, which are important guidance for improving our study. We plan to focus on the following improvements in our future work: (a) Improving ablation studies on data augmentation techniques (R1-Q1); (b) Constructing a test set containing rare terminologies to evaluate the model’s robustness and generalization capability (R1-Q3); (c) Strengthening the retention of locational descriptions of lesion areas during standardization to achieve a balance between standardization and clinical semantic integrity (R3-Q3); (d) Designing an expert evaluation scheme, inviting ophthalmologists to assess terminology accuracy and report integrity of model outputs , so as to strengthen the reliability and applicability of the model in real-world clinical scenarios (R1-Q2, R2-Q2). ⑤ Size of test set (R2-Q4) We employed a test set comprising 2,000 samples for evaluation.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top