Abstract

Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets. The learning process is usually based on data-driven paradigm without prior knowledge. To address this issue, we propose MM-Retinal, a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books. Moreover, enabled by MM-Retinal, we present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT. It is designed with image similarity-guided text revision and mixed training strategy to infuse expert knowledge. Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios. MM-Retinal and KeepFIT are available at \href{https://github.com/lxirich/MM-Retinal}{here}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0375_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0375_supp.pdf

Link to the Code Repository

https://github.com/lxirich/MM-Retinal

Link to the Dataset(s)

https://drive.google.com/drive/folders/177RCtDeA6n99gWqgBS_Sw3WT6qYbzVmy

BibTex

@InProceedings{Wu_MMRetinal_MICCAI2024,
        author = { Wu, Ruiqi and Zhang, Chenran and Zhang, Jianle and Zhou, Yi and Zhou, Tao and Fu, Huazhu},
        title = { { MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper has two major contributions:

    1. It builds a multi-modal fundus dataset MM-Retinal that encompasses high-quality image-text pairs of 2169 color fundus photography (CFP), 1974 fundus fluorescein angiography (FFA) and 233 optical coherence tomography (OCT) images collected from fundus diagram books.

    2. The authors also propose a knowledge-enhanced foundational model KeepFIT that is mixed pretrained on public and their constructed MM-Retinal datasets. This vision-language pre-training framework features a novel image similarity-guided text revision method to better inject expert knowledge into image only public datasets.

    The proposed KeepFIT achieves SOTA performance on six representative downstream tasks, especially for zero-shot and few-shot scenarios on different retinal imaging datasets. The authors have performed comprehensive experiments with several benchmarking.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has several strengths:

    1. A better way to curate vision-language dataset: MM-Retinal crawls image-text pairs from diagram books, so it naturally features higher quality and richer expert knowledge. Compared with those public available fundus image datasets that are large in scale but only have category or report information, this new dataset is characterized by multi-modality, including CFP, FFA and OCT, highly dense knowledge, diverse text and vocabulary, as well as comprehensive disease categories. This is also better when compared to the automated vision-language crawling from open access paper repositories such as PubMED central. I hope MM-Retinal will be released to the public so that this publicly available dataset will definitely facilitate future research on fundus field.

    2. Vision-language pre-training including the public image datasets: The proposed KeepFIT includes a novel and interesting mixed training approach that integrates expert knowledge. This method utilizes large-scale image datasets with only category information but no textual description, and introduces rich expert knowledge in the private dataset MM-Retinal into simple text prompts in public datasets through image similarity and multi-head cross-attention mechanisms. This way, category-related knowledge can be better integrated into the hidden representation of the text, so called text revision. The experiment further demonstrates the importance of expert knowledge and the effectiveness of the method. Compared with extremely large-scale but semantically single datasets, small-scale knowledge and limited image data can achieve better results. Specifically, the authors have not left out the publicly available image only datasets during their training and to integrate all of them with their MM-Retinal dataset, it combines the power of both.

    3. Transfer learning to downstream tasks: The authors test the zero-shot and few-shot learning abilities of unseen categories, and performs fine-tuning experiments on unseen downstream datasets on six representative datasets to demonstrate the generalization and transferability of the proposed method, and reasonably compares the baseline models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the paper demonstrates novel ideas of dataset curation and model training, there are several concerns and missing details which are not clear from the presentation in the paper. Moreover, the paper writing needs to be improved in general. The language is not so fluent.

    Please see the section of detailed comments and constructive feedback for my questions and concerns raised. Please address those.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The dataset MM-Retinal needs to be released as it’s seminal for reproducibility of the proposed work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please address the concerns below.

    During the dataset construction process:

    1. The data source was not clearly specified, only briefly mentioned as from diagram books. It is important to know which books were selected. How did the authors search for such books. What query terms were used? Were these books downloaded or was it online query search? It will also be good if the authors could release the code for data preparation along with the MM-Retinal dataset, so other medical AI community users such as Dermatology, Radiology could benefit from a similar approach of data curation.

    2. It was mentioned that the constructed dataset is bilingual, but the purpose and motivation behind creating a bilingual dataset were not explained. Please clarify. Was both the languages used during training?

    3. The paper did not mention how the amount of data processed at each step changed. It only included the final data statistics, which did not allow readers to intuitively understand the entire process. The four steps mentioned for data curation are too high-level without dwelling into details about what changed during each step. The authors have focused more on describing the dataset statistics, but readers would like to know the detailed process more so that such approach of data curation could be adopted for other fields - dermatology, radiology and alike.

    4. The paper stated that the constructed dataset involved 96 abnormalities and diseases, which is the same number as the 96 categories declared in FLAIR. However, the paper did not declare how they identified and determined the categories when crawling the data. The related work may involve matching image-text pairs with category labels, but we do not know why these 96 categories were chosen and formed. Please clarify.

    5. The paper lacked analysis on using tools like GPT-4 to automate the correction of OCR errors, instead opting for manual correction, which reduces the generalizability of the dataset construction method. Did the authors thought about resorting to GPT-4 like models or LLMs.

    6. Compared to the work of foundation models in other medical fields, although the data is knowledge-intensive and the experimental results are good, a dataset size of over 4300 is objectively not considered large. The authors did not further explain this or address the reasons for the relatively low dataset size. We do not know why building a larger dataset is a challenge, whether this is limited by limited data sources, or whether the authors believe that as long as it is high-density knowledge data, this size is sufficient for fundus.

    7. From the experimental setup, the authors seem to prefer dividing CFP, FFA, and OCT into three mutually independent tasks and training three mutually independent models. However, in the dataset construction, the three modalities were merged into one complete dataset, indicating a preference for training with a unified fundus dataset. The authors did not explain this somewhat contradictory approach.

    For the methodology and experimental section:

    1. The authors mention the 1:1 mixed training method adopted to avoid optimization bias. However, this does not imply that the data used in the experiment is also balanced at 1:1, as the best results come from a combination of 50% FLAIR (139,174) and MM (2,169). The authors did not clarify the appropriate sampling strategy. In the experimental setting with a batch size of 24, if the label sets of MM and FLAIR within a batch do not intersect, would there be knowledge injection? Does random sampling limit performance?

    2. Based on the first point, the authors state in “4.2 Comparison on Zero-Shot and Few-Shot Tasks” that “Large datasets may introduce noise, diminishing transfer effectiveness.” However, the best performance comes from a combination of MM and 50% of the large dataset FLAIR. This combination does not remove noise from the large dataset. Knowledge injection also does not correct the labels of the large dataset.

    3. Based on the first point, the authors did not explain or analyze why MM+50% FLAIR can achieve best results instead of 100%. If 50% of the data can assist MM in improving performance, why can’t the remaining 50% achieve the same effect? If both 50% can, why the combined 100% cannot? If the remaining 50% cannot, then how should the data be sampled? In fact, for Table 2(a), MM+100% FLAIR only falls below 50% in one aspect.

    4. The experiment lacks a comparison of FLAIR+syn+MM. Additionally, The experiment does not reflect the changes in performance for high-density knowledge text regarding to different text quantities (e.g., divide text by category). This is because the authors consider 233 OCT to be insufficient, but around 2000 is sufficient. Therefore, the readers do not get a clear idea of the amount of data required (for fundus), similar to the point 6 mentioned above.

    For paper writing:

    1. There are typos in formulas (3) and (4) that need to be corrected. Or in Fig. 2, symbols for lossm and lossp should be interchanged.
    2. The specific experimental results mentioned in the abstract and section 4.2 do not match Table 1. Please double review the experimental results.
    3. Images are low resolution. Could the authors please upload the vector source files for better readability.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors proposed a new dataset of MM-Retinal which was collected and curated from book sources available online. The authors also proposed a smart way of integrating publicly available image dataset and their built MM-Retinal to train a vision-language model.

    However, concerns remain about the clear explanation of the details of both data curation and model training. Could the authors please address those questions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents and evaluates a novel foundation model (FM) for color fundus photography (CFP) and fundus fluorescein angiography (FFA) image classification. It jointly takes advantage of 1) large public datasets of fundus images coupled with simple text (in fact labels) and 2) a much smaller new dataset (MM-Retinal), where images are coupled with rich text from ophthalmology textbooks. Although the paper has flaws, the idea is interesting.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • A novel multi-modal dataset (MM-Retinal) is built: it contains image-text paired data of CFP or FFA, collected from fundus diagram books with comprehensive ocular knowledge via accurate image-text descriptions from ophthalmologists.
    • A novel FM (KeepFIT) is trained. It jointly takes advantage of 1) large public datasets of fundus images coupled with simple text (in fact labels) and 2) the smaller MM-Retinal dataset, where images are coupled with rich text.
    • Performance is shown to outperform that of the recent FLAIR [17] baseline on various downstream tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • References to the textbook(s) should be provided: the publication year is unknown, the country of origin is unknown, etc. The number of textbooks is also important. All we know is that there are texts in English and in Chinese.
    • The proposed algorithm reads as a simple combination of FLAIR [17] and TipAdapter [25]. Although I believe there are differences with TipAdapter, those differences should be highlighted and motivated.
    • Table 1: it is not clear to me why FLAIR with flair data (Avg score: 0.610) differs that much from KeepFIT with flair data (Avg score: 0.795). This should be stressed and discussed as, in fact, this is the main factor explaining performance gain in the paper. In comparison, adding MM-Retinal and the proposed “Image Similarity-Guided Text Revision” only marginally improves performance (to Avg score: 0.856).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    MM-Retinal and KeepFIT will both be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The beginning of the paper (up until section 3) is misleading. It seems that pre-training relies on the small MM-Retinal dataset (4,116 image-text pairs - 2,169 for CFP / 1,947 for FFA), while in practice, it is trained on large public datasets; the MM-Retinal dataset is essentially here to provide rich text descriptions on a smaller subset of samples.
    • Similarly, the authors mention the suitability of their approach for OCT in several places, while the OCT modality is not investigated due to limited samples, which is also misleading.
    • Novelty compared to FLAIR [17] and TipAdapter [25] should be highlighted.
    • Image quality in a textbook is necessarily lower than a real-life medical image (e.g., worse resolution, different illumination properties). A discussion on how it (potentially) impacts performance should be included.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed methodology is marginally novel, but the idea is interesting, the topic is trendy and the dataset will potentially be useful to the community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a dataset with paid image-text samples MM-Retinal and a novel VL framework KeepFIT.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths: 1.A multi-modal dataset called MM-Retinal with over 4.3K high-quality image-text pairs in CFP, FFA and OCT modalities. 2.A novel training pipeline for VL model training with mixed public data and newly-proposed data. 3.Extensive experiments prove the effectiveness of the released multi-modal dataset and proposed training strategy. 4.Comprehensive comparison on multiple downstream tasks/datasets/backbones with several existing works. Good ablation study and analysis which has well investigated what makes the proposed methods well performed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Overall a good paper with significant contribution but some obvious points should be further addressed: 1.I strongly suggest to refine the figures to vector graphs, making the texts and some details clear. 2.The way obtain the MM-Retinal Dataset is interesting but lacks some confidence. Although manuscript clearly describe the procedure of processing the public resources, it is not convincing that the dataset can be high-quality as claimed with only quantitative analysis. Can authors further address this since the dataset is a key contribution to this work. 3.Similarly, the accuracy of modality clustering is not convincing to me. Also, since some texts are translated between EN anc ZH, is there any extra constraints or maps added to the translation while some clinical terms are non-intuitive. 4.From Table 1, it seems that MM-Retinal contributes to the performance more than the Syn which is a dataset with millions of samples. It is due to the high-quality image-text paired samples of MM-Retinal while Syn only generates samples from CLIP manner? 5.I suggest to provide an anonymized link to some pipelines of the proposed works to improve the understanding and reproducibility of the work.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See comments on Weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The comprehensive experiments with quantitative analysis while contributions on datasets and technical parts are limited with no more convincing details.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers and meta-reviewers for their consistent support of our work, as well as providing insightful suggestions. Based on the comments, we have refined our paper to enhance clarity in our presentation. For instance:

  1. We have provided the data sources and download links for the diagram books we utilized on GitHub.
  2. The details of dataset construction, motivation for creating a bilingual dataset, and the method of defining the 96 categories of MM-Retinal have been emphasized in the main paper and supplementary materials.
  3. We have improved our paper writing, removed misleading statements, and refined the figures into vector graphs.
  4. The MM-Retinal dataset, codes for dataset construction, and KeepFIT model have been made public on GitHub.

Overall, we have carefully addressed the comments and will continue to polish our paper for the camera ready version.




Meta-Review

Meta-review not available, early accepted paper.



back to top