Abstract

Model generalisability, i.e. performance on multiple unseen datasets, can be improved by training on large volumes of annotated data, from which models can learn diverse representations. However, annotated medical data is limited due to the scarcity of expertise. In this work, we present an efficient data sampling pipeline to select DIVerse and bAlanced images (DataDIVA) from image pools to maximise model generalisability in retinal imaging. Specifically, we first extract image feature embeddings using the foundation model off-the-shelf and generate embedding clusters. We then evenly sample images from those diverse clusters and train a model. We run the trained model on the whole unlabelled image pool and sample the remaining images from those classified as rare categories. This pipeline aims to sample the retinal images with diverse representations and mitigate the unbalanced distribution. We show that DataDIVA consistently improved the model performance in both internal and external evaluation, on six public datasets, with clinically meaningful tasks of referable diabetic retinopathy and glaucoma detection. The code is available at https://doi.org/10.5281/zenodo.12674694.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2889_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2889_supp.pdf

Link to the Code Repository

https://zenodo.org/records/12674694

Link to the Dataset(s)

All used datasets are publicly available and properly cited in the paper.

BibTex

@InProceedings{Zho_Enhancing_MICCAI2024,
        author = { Zhou, Tianfeng and Zhou, Yukun},
        title = { { Enhancing Model Generalisability through Sampling Diverse and Balanced Retinal Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces DataDIVA, a data sampling pipeline designed to enhance model generalisability in retinal imaging by selecting diverse and balanced retinal images via clustering techniques. It leverages an open-source foundation model to extract image features and guide the sampling process. The approach addresses the scarcity of annotated medical data and aims to improve real-world clinical applications. More concerns can be found in the following parts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Enhanced Generalisability: The paper conducted extensive and rich experiments, along with persuasive visualizations, demonstrating that the pipeline effectively improves model performance across multiple datasets.
    2. Efficient Sampling: It provides an efficient and novel method for selecting informative data via clustering techniques from large unlabelled and unbalanced datasets.
    3. Open-Source Contribution: The code and materials are shared, contributing to the medical AI community.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The schematic diagram of the method can be drawn in more detail.
    2. Some mathematical symbols in the formulas also need to have their meanings clarified, for example, M and N.
    3. The approach is innovative, but the method is somewhat simplistic. I recommend including results using all data as a control in the experiments to demonstrate the model’s usability.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please check the weakness part.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please address my concern in the weakness part.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents DataDIVA, a novel way of data sampling pipeline for medical datasets for real-world clinical scenarios to target improved generalizability and increase reliability and robustness on unseen datasets.

    Specifically, very intelligently foundation models are used to guide the data sampling process. The authors demonstrate their idea on datasets for retinal imaging tasks of diabetic retinopathy and glaucoma detection.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has intelligently devised a sampling strategy using the foundation models in medical domain. Compared to the existing sampling methodologies, this is a novel and smart way to use the current foundation models as they are supposed to have learnt a generalized feature space. The authors present their work in the domain of retinal fundus image classification.

    Experimental settings are well defined in this paper, and authors have conducted essential ablation studies. The authors are able to demonstrate that their proposed approach surpasses the current sampling strategies for two tasks on diabetic retinopathy and glaucoma detection. The authors also show visualization results of t-SNE maps of the trained model on the external dataset which clearly shows that model trained using their sampling strategy has better ability to distinguish between the categories.

    The paper is well written and technically sound and easy to follow and read. I liked reading the story and the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Major:

    My major concern about this proposed technique is that it will be limited to the classification labels available in the Din which is sampled from the Du and thus the external validation sets Dout need to have the same task to test the efficacy of this method. This is also evident by the experiments performed by the authors on the datasets where they have resorted to EyePACS for Diabetic Retinopathy grading and the corresponding Dout datasets are APTOS and IDRiD and the other task of glaucoma detection also follows a similar setting. The above experiments suffice for the ‘diverse’ sampling strategy proposed, however, have the authors thought about testing the proposed ‘balanced’ sampling strategy on certain long-tailed classification datasets (many classes and unbalanced distribution) to justify this better.

    Please also see further questions below.

    1. One important step for selecting the diverse samples is to remove the outliers before performing the K-means clustering and further processes. The authors have not described how they removed these potential outliers. As selection of outliers will eventually affect the sampling strategy, it is essential to understand this process.
    2. Why only five quantile subgroups for the second stage of sampling? Usually, such quantile methods have certain criteria to select the number, and selection of such number will also influence the sampling diversity especially depending on what the dataset Du is. Thus, more light is needed on this.
    3. How was the data split of EyePACS decided? Were 35,126 training images and the 53,576 testing images selected randomly? The same question for the AIROGS dataset.
    4. The selection set size for the Ds is said to be M which is « N (original size). How is this M determined? I don’t see a study which evaluates the performance based on the size of M selected.
    5. One observation from the experiments results in Table 1 is that the performance on the Din [EyePACS] is significantly lower compared to the external sets of Dout [APTOS and IDRiD] for diabetic retinopathy. For the case of glaucoma detection though, the internal set performance is higher than one of the external sets. Why this may be the case? Could the authors shed some light on this?

    Minor suggestions:

    1. Abstract: ‘using off-the-shelf foundation models’ instead of ‘foundation models off-the-shelf’
    2. Fig.1: Readability could be improved. For (b), please mention foundation model to get the feature representations. What are the concentric circles during diverse sampling in one of the clusters? It’s not clear.
    3. The authors mentioned there are three sampling steps, however, there are only two – first is M/K samples selection for inter-cluster diversity and then for each cluster, divide into five quantile groups based on the distance to select the samples Ddiverse.
    4. Please extend the conclusion to include a bit more details of the approach.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have mostly described the hyperparameters used in the experiment settings, however, some need more clear explanation and visibility. Please see the comments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weakness section for detailed and constructive comments. Please address those.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has suggested an interesting idea and approach by cleverly utilizing foundation model feature space for sampling strategy. This method is especially useful when transferring to external datasets and evaluation for robustness evaluation. The authors have also demonstrated their proposed approach on two different datasets and tasks in retinal imaging. Their proposed approach is able to surpass the current sampling strategies popularly employed.

    However, certain questions still remain about the validation of approach especially when it involves long-tailed distribution with many classes and not only limited to a couple of categories. Concerns also remain about the selection of certain hyperaparameters and their explanation, as well as the explanation of results.

    I am happy to bumpy up my score based if my concerns about explanation and hyperaparameters are adequately addressed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a data sampling pipeline to select the labelled samples for training, and thus to enhance the model generalizability in retinal images. Specifically, a foundation model (RETFound) was used to extract the image feature, and k-means was applied to cluster the images, then two selection stratages with consideration of data diverse representation and balanced distribution were proposed for sample selection.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors introduced a data sampling strategy that effectively boosts the model’s generalizability in retinal images.
    2. The proposed method is straightforward and substantiated by efficient experiments
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The description of Section2.3 is not clear, what does the means of ‘However, we can focus on the predictions that correspond to rare categories and sample data points from them’. It might be better to explain how the rare categories are selected for balancing the training dataset.
    2. In the experiments, a comparison between the model trained with all the labeled data and the model trained with the proposed sampling strategy would be conducted to evaluate the effectiveness of the sampling strategy in improving model generalizability.
    3. For the external dataset validation, it is not clear if the model are trained on the external dataset, if not , how to explain the sentence ‘The REFUGE and ORIGA, two benchmarks commonly used for glaucoma detection, are used for external evaluation. We sample 600 images (about 1.7% of the EyePACS train set) for referable DR detection and 1200 images (about 1.7% of the AIROGS train set) for glaucoma detection. The sampled images are split into 80%:20% as training and validation sets during model training.’
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    please see the weakness

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the proposed method and the robustness of the experiments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We greatly appreciate the thorough feedback provided by the reviewers and their acknowledgement of DataDIVA’s efficacy in enhancing model generalisability. We have responded to the questions and further enhanced the paper’s clarity in light of helpful suggestions.

  1. We aim to achieve both diverse and balanced sampling with DataDIVA. We conduct diverse sampling to get the first half of Ds and train an initial model for the disease detection task. We then deploy the initial model on the remaining unlabelled data Du and sample the second half of Ds from those classified as rare categories. The combined Ds constitutes the final sample cohort for data labelling and model training.

We have verified the efficacy of this approach through internal and external tests, with two network backbones, on different clinically meaningful tasks. Nevertheless, we acknowledge that more insights can be brought by 1) investigating the performance of DataDIVA on long-tailed classifications with a complex mixture of multiple diseases; 2) studying the impact of sample size M (we currently use a small sample size M considering such a label workload is manageable for clinical experts in real-world scenarios); and 3) comparing the performance between DataDIVA model with that trained on full data. We value the reviewer’s suggestions regarding these experiments and intend to incorporate them in the future extension, considering the substantial involvement of additional benchmark datasets, performance comparisons, and space for results and discussions.

2) We will add more details to Figure 1, including the involvement of foundation models and clarification on various components. All suggestions relevant to the presentation will be incorporated into the camera-ready submission.

3) Within each cluster, we calculate the distance between data points and the cluster centroid. To remove those points that are extremely far away from the centroid (defined as outliers in this paper), we set a threshold of 95% and removed the data points that were distributed farther than 95% of the distance distribution. This clarification has been included in the paper.

4) The number of quantile subgroups for each cluster should impact the model performance. We allocated more space to investigate the influence of diverse and balanced sampling (Table 1), cluster numbers (Supplementary Figure 1), the basic models for feature extraction (Supplementary Figure 2), and various network backbones (Table 2, 3). The further study of hyperparameters will be included in the future extension.

5) We adhere to default data splitting for EyePACS on Kaggles. The AIROGS, without default splitting, was randomly split into 70%:30%, where 70% works as the unlabelled pool and 30% for internal evaluation. Only the images sampled from EyePACS (600 images, 1.7% of the EyePACS train set) and AIROGS (1200 images, 1.7% of the AIROGS train set) were used for model training. All external datasets (APTOS-2019, IDRiD, REFUGE, and ORIGA) remain isolated for external testing. We have revised the description to avoid any confusion.

6) Although certain datasets are dedicated to the same tasks, e.g. EyePACS, APTOS, and IDRiD for diabetic retinopathy, the model performance on each may vary due to dataset characteristics, including data size, heterogeneity of imaging devices and quality, and degree of label noise. This dataset discrepancy makes the “challenge” of datasets varied, so that sometimes internal test performance does not necessarily outperform that of external tests. This can be observed in Figure 2 of RETFound [31], where the model trained on APTOS-2019 performs similarly on external tests with IDRiD, compared to the model trained and tested internally on IDRiD.




Meta-Review

Meta-review not available, early accepted paper.



back to top