Abstract

We explore the efficacy of a region-based method for image tokenization, aimed at enhancing the resolution of images fed to a Transformer. This method involves segmenting the image into regions using SLIC superpixels. Spatial features, derived from a pretrained model are aggregated segment-wise and input into a streamlined Vision Transformer (ViT). Our model introduces two novel contributions: the matching of segments to semantic prototypes and the graph-based clustering of tokens to merge similar adjacent segments. This approach leads to a model that not only competes effectively in classifying diabetic retinopathy but also produces high-resolution attribution maps, thereby enhancing the interpretability of its predictions.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3756_paper.pdf

SharedIt Link: https://rdcu.be/dV163

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_4

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3756_supp.pdf

Link to the Code Repository

https://github.com/ClementPla/RetinalViT/tree/prototype_superpixels

Link to the Dataset(s)

https://www.kaggle.com/c/diabetic-retinopathy-detection/data https://www.kaggle.com/c/aptos2019-blindness-detection/data https://github.com/nkicsl/DDR-dataset https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid

BibTex

@InProceedings{Pla_ARegionBased_MICCAI2024,
        author = { Playout, Clément and Legault, Zacharie and Duval, Renaud and Boucher, Marie Carole and Cheriet, Farida},
        title = { { A Region-Based Approach to Diabetic Retinopathy Classification with Superpixel Tokenization } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {36 -- 45}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a graph-based clustering method with SLIC superpixels for DR grading, which enhancing the model interpretability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Interpretability in deep learning stands as a crucial subject, and the authors have made strides by integrating superpixel and graph-based clustering techniques. With pertinent enhancements to their methodology, they have provided a relatively satisfactory approach, introducing a novel means of comprehending the model’s workings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The prototype initialization method hinges on labeled data, potentially limiting its applicability in scenarios lacking segmentation labels. 2) Table 1 may present an unfair comparison due to variations in resolution among the compared items.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The issue of GPU resource consumption is mentioned in the paper. And it is suggested to reflect more quantitatively the time and resources required for training and compare it with other methods.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a relatively complete article. The methods are helpful in improving the interpretability of the model, the method is innovative, and the experiments are relatively adequate.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    There are two major contributions of the proposed work:

    1. the matching of segments to semantic prototypes
    2. the graph-based clustering of tokens to merge similar adjacent segments
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Transformer has the merit of scalability, but conventional ViT divides the image into square regions for the tokenization process, which conflicts with the irregular shape of anatomical structures. This work uses the idea of super-pixel to tackle this difficulty and achieves good performance.
    2. This work borrows the idea of community detection in social network to cluster tokens to merge similar adjacent segments for pooling. Each segment is represented as a node, and the adjacent nodes are connected by edges, the similarity measurement is associated with each edge as the weight. By clustering nodes, segments are merged. The idea is novel and natural.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed work heavily depends on the previous SLIC algorithm, and lacks the explanation of the details of tokenization process of super-pixels.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The algorithmic details are explained thoroughly in details. The work can be reproduced by a graduate student.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed work heavily depends on the previous SLIC algorithm, and skip some explanations. It will be more helpful to give more details on the tokenization process, such as whether each super-pixel is treated as a token, and how to determine the order of the tokens from the adjacency graph. It will be

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work proposes the ideas of using super-pixel for tokenization for transformer framework, which is natural and effective; the work also proposes to use community detection algorithm in social network for merging segments for the pooling, which is creative and works well. The experimental results are convincing, and show the improvements compared with the existing ViT methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper describes a segmentation-driven approach for feature extraction in retinal images for the classification of diabetic retinopathy. A CNN extract features from the retinal image, which are then tokenized based on SLIC superpixels (recombined based on affinity). A Transformer model outputs the DR grading.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The manuscript is very well written and easy to follow and the methods are thoroughly described
    • A strong experimental validation was performed
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Authors should provide a description of the used performance metric
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper proposes an interesting approach for automated detection/grading of diabethic retinopathy. The manuscript is very well written and the research goals and used methods are clearly described.

    The authors should describe briefly the used performance metric (Cohen’s quadratic k), or at least provide a suitable reference for prospective readers to easily understand the reported results.

    In section 3.2, “Prototype Initialization”, the authors could provide a brief description of the biomarkers from the MAPLES-DR dataset (not only in a figure caption). Although a reference is provided, including it in the manuscript makes it easier to follow.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very well written and easy to follow. The reported validation seems sufficient (three independent validation datasets) and the obtained results seem promising, even with a somewhat short training period.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We want to thank the reviewers for their time and their thoughtful comments. All of their remarks are insightful and our future work will tackle these points. We apologize for the scarcity of details due to the restriction of the number of pages. In future work we will try to address the concerns that were raised.

Prototype initialization (R1+R4)

R4 noted that the biomarkers used for prototype initialization are only specified in the caption for Figure 2. We will try to move them in the text body for the final version. R1 noted that our initialization strategy needs labeled data. This is indeed a limitation towards the interpretability of our method. However we do not need a large amount of labeled data to obtain high quality prototypes (MAPLES-DR for example is only 198 images), making this labeling relatively easy.

Comparison to lower resolution methods (R1)

Our ViT baselines from [22] and [14] do indeed operate at a lower resolution than the CNN baseline and our proposed method. Our goal with this work is to propose a new method that allows for Transformer-like models to operate effectively at high resolutions through the superpixel tokenization and clustering, whereas existing methods are constrained by the quadratic memory cost of the attention mechanism. As such we don’t think that this comparison is unfair, because our method is what allows to operate at those higher resolutions.

Resource utilization (R1)

R1 mentioned that we do not quantify the time and GPU memory necessary to train each method. The original code, published alongside the paper, relies on a naive (but much simpler to implement) interpolation of the whole feature maps to the original resolution. In practice, there is a lot of room for improvement: for example, we can use a custom CUDA operator which interpolates only within the area of each superpixels and directly reduce corresponding pixels in parallel. Since the original submission, we have worked on such an implementation, which can be found on a distinct branch of the same code repository. Inference has been considerably accelerated but the current state of this implementation does not allow proper benchmarking (in particular, the custom backpropagation of this operation remains to be implemented).

Superpixel tokenization (R3)

R3 noted that our work depends heavily on the SLIC algorithm. Our model is a functional proof of concept which we believe encourage the exploration of new tokenization approches. SLIC has the triple advantage of being versatile, fast and easy to implement. It is a natural candidate for validating the concept of superpixels tokens, but any improvement on the segmentation itself should lead to even better/more interpretable ViT.

R3 also noted a lack of clarity about the tokenization process. Each superpixel segment is indeed considered to be one token to be fed to the Transformer. We average the vectors from the feature maps at the corresponding pixel locations. Our approach does not use positional embedding.

Also, we should also emphasize that the order in which the tokens are considered in the adjacency graph is not important, as each of our operation (attention, pooling, etc.) is permutation invariant.

Evaluation with Cohen’s kappa (R4)

R4 noted the lack of a reference to Cohen’s kappa as a performance metric. This metric has been the de facto standard in DR grading for several years in competitions [5, 8] and in the literature [3, 12, 14, 17, 20].




Meta-Review

Meta-review not available, early accepted paper.



back to top