Abstract

Interpretability is a key requirement for the use of machine learning models in high-stakes applications, including medical diagnosis. Explaining black-box models mostly relies on post-hoc methods that do not faithfully reflect the model’s behavior. As a remedy, prototype-based networks have been proposed, but their interpretability is limited as they have been shown to provide coarse, unreliable, and imprecise explanations. In this work, we introduce Proto-BagNets, an interpretable-by-design prototype-based model that combines the advantages of bag-of-local feature models and prototype learning to provide meaningful, coherent, and relevant prototypical parts needed for accurate and interpretable image classification tasks. We evaluated the Proto-BagNet for drusen detection on publicly available retinal OCT data. The Proto-BagNet performed comparably to the state-of-the-art interpretable and non-interpretable models while providing faithful, accurate, and clinically meaningful local and global explanations.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0480_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0480_supp.pdf

Link to the Code Repository

https://github.com/kdjoumessi/Proto-BagNets

Link to the Dataset(s)

https://data.mendeley.com/datasets/rscbjbr9sj/3 https://www.kaggle.com/datasets/paultimothymooney/kermany2018

BibTex

@InProceedings{Djo_This_MICCAI2024,
        author = { Djoumessi, Kerol and Bah, Bubacarr and Kühlewein, Laura and Berens, Philipp and Koch, Lisa},
        title = { { This actually looks like that: Proto-BagNets for local and global interpretability-by-design } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces Proto-BagNets, a prototype-based model designed for identifying drusen lesions in OCT images, extending the concepts introduced in the earlier ProtoPNets paper. While the method doesn’t exhibit superior classification performance compared to baseline approaches, the authors assert that its utilization of localized learned prototypes enhances the interpretability of the model. This aspect holds significant clinical relevance as conventional post-hoc explanations often fall short in providing comprehensive interpretability, whereas prototype-based learning endeavors to address this gap. However, there are some limitations to this proposed study which the authors need to address to be accepted as a paper in this conference.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper displays strong motivation, a well-structured presentation, and tackles a clinically significant issue regarding the interpretability of AI models in medical image diagnosis.

    2. Building upon a foundation of existing research on prototype-based learning, the authors introduce ProtoBagNets as an advancement over the widely recognized ProtoPNets model. This introduction not only acknowledges the existing body of literature but also demonstrates how localized explanations can enhance trustworthiness and improve interpretability.

    3. The authors have augmented the credibility of their work by engaging an ophthalmologist to validate their method. This collaboration underscores the practical applicability and reliability of the proposed approach.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The distribution of drusen cases in the test set deviates from typical clinical scenarios, with nearly equal proportions of healthy and drusen images. Adjusting the test set to mirror real-world data distributions would enhance the practical relevance of the evaluation.

    2. It is recommended that the authors provide precision and recall scores alongside other performance metrics to offer a comprehensive understanding of the classification efficacy, particularly after modifying the test set to reflect realistic data distributions.

    3. The observed lower performance of Proto-BagNet compared to baseline methods, as indicated in Table 1, warrants further investigation. The absence of error bars in the results table raises concerns about the reliability of the reported classification performance, necessitating additional scrutiny.

    4. The evaluation of the method’s robustness across varying receptive field sizes is crucial, especially considering its potential application to images with diverse scales, such as gigapixel histology images. Furthermore, comparing the prototypes generated by ProtoPNet with those of Proto-BagNet could provide valuable insights into the effectiveness of the proposed approach and its comparative advantages.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is recommended that the authors consider modifying the test set to maintain a balanced distribution between drusen and healthy cases, reflecting the typical scenario encountered in clinical practice.

    2. In addition to reporting standard performance metrics, such as accuracy, it would be beneficial for the authors to include precision and recall scores derived from the confusion matrix. Furthermore, incorporating error bars in the experimental results would enhance the robustness and reliability of the reported findings.

    3. Exploring the impact of varying receptive field sizes on classification performance while preserving interpretability through prototypical analysis could provide valuable insights. Therefore, it is suggested that the authors investigate this aspect to ensure the method’s applicability across different image scales.

    4. While emphasizing the importance of interpretability, it is essential for the authors to also address and discuss the classification performance of their method. Striking a balance between interpretability and classification accuracy is crucial for the practical utility of the proposed approach.

    5. To gain a comprehensive understanding of the algorithm’s behavior, it would be beneficial for the authors to discuss any instances of misleading prototypes or failure modes encountered during experimentation. This insight could inform further refinement and improvement of the proposed method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper demonstrates innovation in addressing the interpretability challenges of AI models in medical image diagnosis, it falls short in crucial aspects such as maintaining a balanced test set distribution, reporting comprehensive performance metrics, and addressing concerns regarding classification performance and algorithm robustness. These limitations detract from the overall strength of the contribution and suggest that further refinement is needed before the paper can be considered for acceptance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the weak points presented in the review. Further, they have incorporated additional performance metrics upon review.



Review #2

  • Please describe the contribution of the paper
    • Introduced a novel approach termed Proto-BagNets to mitigate the issue of receptive fields and faithfulness in traditional ProtoPNet models.
    • Validated the performance, semantic understanding, and faithfulness of Proto-BagNets using retinal OCT (Optical Coherence Tomography) data.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The technical details are comprehensively provided.
    • The proposed method is concise and straightforward to implement.
    • The performance is competitive when compared to ResNet.
    • The method’s restriction on the prototype’s receptive field is both evident and significant.
    • The paper thoroughly validates multiple aspects, including accuracy, semantics, and faithfulness.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of Clarity: The motivation is not clearly claimed. Only reference to previous literature does not make sense to readers.
    2. English Writing and Presentation: The introduction’s final sentence is overly long and difficult to read. Given Weakness #1, the introduction should be rewritten for clarity. In Figure 2, the use of a yellow box to highlight the receptive field is commendable, but it is unclear why the green box representing the prototype in the second row is identical to the first. Based on my understanding, should the yellow and green boxes in Proto-BagNet overlap?
    3. Outdated Baselines in Section 3.2: The paper should include more recent benchmarks such as TesNet and ProtoPool, and from a receptive field perspective, Deformable ProtoPNet should also be considered. There are also works that construct ProtoPNet using Vision Transformer (ViT) architectures [1].
    4. Insufficient Analysis in Section 3.4: Further exploration is needed to answer the following questions:
      • The rationale behind the selection strategy for the 40-sample subset is unclear. Are they samples that were successfully predicted? What would be the implications if they were misclassified?
      • What constitutes the prototype for the “healthy” class?
      • How does precision change from k=1 to k=5?
      • Insufficient Evaluation in Section 3.5: While the section analyzes the fidelity of Proto-BagNets, the fidelity data for ProtoPNet is missing, making it impossible to assess contributions in this area.

    Reference: [1] Xue, Mengqi, et al. “Protopformer: Concentrating on prototypical parts in vision transformers for interpretable image recognition.” arXiv preprint arXiv:2208.10431 (2022).

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    If the datasplit and the clinicians annotation would be public?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Reorganization: Reorganize the introduction to allow readers to intuitively understand the limitations of current prototypes in medical imaging, particularly the large receptive field and fidelity issues addressed by Proto-BagNets. Address how these issues impact interpretability and whether their effects are more significant in OCT images.
    2. Detailed comparison with Recent Method: The latest benchmarks, including Deformable ProtoPNet and ViT-based ProtoPNet, also address the issue of large receptive fields. The authors are encouraged to include necessary discussions and, if possible, comparative experiments in subsequent revisions.
    3. Language and Figure Polishing: The current presentation may be confusing for those not familiar with prototype research, such as in Figure 2.
    4. Title Modification: The paper currently only validates one disease within OCT data. It is suggested that the title includes “for the diagnosis of drusen” to specify the application.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have presented a new method that addresses some of the limitations of the current ProtoPNet, with comprehensive experimental considerations and rich technical details, which contribute to acceptance.

    However, the scope of experiments is somewhat lacking, and there is significant room for improvement in writing and presentation, which may hinder understanding for readers not specialized in interpretability. These reasons contribute to rejection.

    Given the difficulty in validating interpretability and the potential for improvement during the rebuttal period, I believe the reasons for acceptance slightly outweigh those for rejection. However, if the writing and presentation do not see sufficient enhancement, I may not be able to maintain this score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you for your response and commitment to improvement in subsequent versions, I will maintain my score as weak accept.



Review #3

  • Please describe the contribution of the paper

    The paper develops a novel model called Proto-BagNets in an effort to combine the benefits of local feature models and prototype learning models in a unified architecture. The paper evaluates the proposed architecture on the task of drusen detection and shows that it can perform on par with previous models while yielding local and global explanations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is interesting and can successfully achieve the benefits of both local feature models as well as prototype learning models.
    2. The experimental results show that the prototypes learned by the model are aligned with the symptoms that ophthalmologists when making decision about cases.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method has numerous hyperparameters, which makes it challenging to tune in practice. Further, no details or guidelines about how to set them is provided. 1.1. How to set the receptive field size r = 33? It assumes that the prototype’s patch size is almost 33. The training images are resized to 496 when training the model. How should one change r when using a different image size? 1.2. How does the results change if the parameter ‘K’ being set to a higher or lower values? 1.3. What values of \lambda coefficients were used in the training objective during training?
    2. The experiments in Sec. 3.5 are really interesting, but I believe that the paper should present performance metrics of the model when it is applied on the masked inputs. For instance, how much is AUC when applied on the masked inputs? This can better support the argument that the model learns to focus on the disease-related symptoms.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    My suggestions are that:

    1. The paper provides detailed instructions and suggestions about how to set the hyperparameters of the proposed method.
    2. New results showing the performance of the method on the masked inputs be added to the paper.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe that the paper introduces an interesting method to combine the benefits of the local feature models and prototype models. The experiments show the alignment of the learned prototypes and the symptoms used in clinical practice.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the authors’ responses in the rebuttal. The rebuttal addresses most of my concerns about the paper:

    1. The description of how the patch size r=33 is chosen in the framework should be added to the paper. It is an interesting way to do so.
    2. All the hyperparameters should be documented in the paper.

    I think the paper can be improved by adding the model’s performance after masking the informative features of the input.

    I summary, the paper’s strengths outweighs its weaknesses, and I recommend acceptance.




Author Feedback

We thank the reviewers for their constructive and comprehensive feedback. The reviewers appreciated a clear and justified motivation, the clinical relevance, the thorough validation and the well-written paper. Below, we address the main comments:

Data distribution [R1]. While the distribution of the test set seems to deviate from a typical clinical scenario, the prevalence of drusen can vary strongly depending e.g. on age group (e.g. ~30% for 20-24 years; to ~49% for 45-49 years; ref1). Another study [ref2] reported even higher drusen frequency. While we agree that matching the distribution of the clinical application setting is important, in this case we therefore used the official test split provided with the dataset [13], which has ~250 images per class. We also reported the classification results on the validation set, which was less balanced (73% healthy vs 27% drusen).

Additional performance measures [R1, R4]. Thank you for asking. We had already computed additional measures, but not reported them in the text. Precision and recall on the test set for ProtoBagNet were 0.996 and 0.940; and 0.999 and 0.996 for ProtoPNet. We added these values to Table1. AUC values (0.9918 vs 0.9916) for the performance on masked inputs are in line with our verbal description (Sec. 3.5), and we added them to the text.

Performance and interpretability tradeoff [R1, R3]. A good tradeoff between performance and interpretability is crucial for clinical usefulness. We showed that Proto-BagNets perform slightly worse than black-box ResNets and ProtoPNets, which both are less interpretable than our ProtoBagNet. The lower performance comes from loss terms that enhance interpretability but compete with performance. Determining the ideal tradeoff will depend on the specific clinical setting. We rephrased a few sentences to make this clearer. If adding CIs is within the rebuttal guidelines, we are happy to do so. They are generally tight.

Misleading prototypes [R1]. One of the main challenges we encountered initially was prototypes that were redundant and less clinically relevant. We mitigated these issues with a dissimilarity loss term to address redundancy and a sparsity loss term to use only the most relevant concepts, two key contributions of our paper. Now we find very few “misleading prototypes”, only some “unexpected prototypes” which do not contain well-known concepts of drusen (Suppl Fig. 3).

Effect of receptive field size [R1, R4]. The appropriate receptive field sizes may vary depending on the clinical task and image resolution. In our case, drusen are small (< 63µm [ref1,ref2]) and fit into a patch of 33x33. For other tasks, the receptive field can be changed to inject clinical knowledge and adjust for resolution. Due to rebuttal rules, we defer a detailed analysis to future work.

Hyperparameters [R3, R4]. We reduced the complexity of our experimental setup by using hyperparameter settings suggested in [2, 4, 13] where possible. The hyperparameters related to our contributions were chosen based on a grid search on the validation set as described. However, we noticed that we did not report the final choice for lam_L1_x = 0.04, and , lam_diss = 0.005. Due to rebuttal rules, we defer a detailed analysis to an extended version.

Writing and presentation [R3]. While the clarity, motivation, and presentation of our work were generally appreciated (R1, R4), we followed the suggestions of R3 to improve the paper’s presentation.

Reproducibility and sample selection [R3]. The data split has already been provided in the anonymous repository. The annotations will be available. The 40 annotated images were randomly selected from the drusen class of the test set. We noticed that 3 of them were misclassified by our model (although most of the topK relevant regions from each prototype highlighted concepts of drusen while a few of them highlighted unknown concepts)

[ref1] https://shorturl.at/qtMTV [ref2] https://shorturl.at/bdwFJ




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Three reviewers agree that the paper has novel contribution and merits. The authors introduce ProtoBagNets building upon a foundation of prototype-based learning. It would be interesting to discuss the paper in MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Three reviewers agree that the paper has novel contribution and merits. The authors introduce ProtoBagNets building upon a foundation of prototype-based learning. It would be interesting to discuss the paper in MICCAI.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have successfully addressed most of the reviewers’ comments.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have successfully addressed most of the reviewers’ comments.



back to top