Abstract

In recent years, there has been remarkable progress in the field of digital pathology, driven by the ability to model complex tissue patterns using advanced deep-learning algorithms. However, the robustness of these models is often severely compromised in the presence of data shifts (e.g., different stains, organs, centers, etc.). Alternatively, continual learning (CL) techniques aim to reduce the forgetting of past data when learning new data with distributional shift conditions. Specifically, rehearsal-based CL techniques, which store some past data in a buffer and then replay it with new data, have proven effective in medical image analysis tasks. However, privacy concerns arise as these approaches store past data, prompting the development of our novel Generative Latent Replay-based CL (GLRCL) approach. GLRCL captures the previous distribution through Gaussian Mixture Models instead of storing past samples, which are then utilized to generate features and perform latent replay with new data. We systematically evaluate our proposed framework under different shift conditions in histopathology data, including stain and organ shift. Our approach significantly outperforms popular buffer-free CL approaches and performs similarly to rehearsal-based CL approaches that require large buffers causing serious privacy violations.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2182_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2182_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Kum_Continual_MICCAI2024,
        author = { Kumari, Pratibha and Reisenbüchler, Daniel and Luttner, Lucas and Schaadt, Nadine S. and Feuerhake, Friedrich and Merhof, Dorit},
        title = { { Continual Domain Incremental Learning for Privacy-aware Digital Pathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a GMM-based feature generation method for domain incremental learning. Experimental evaluations on multiple pathology image datasets and settings support the efficacy of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is buffer-free and therefore can avoid the privacy issue. Also, the method is simple and easy to implement.
    2. Multiple domain incremental settings were included in experiments.
    3. The paper is well written and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed GMM-based method is not novel. Generating samples either in data space or feature space based on GMMs is not novel. Please refer to (Yang et al. Dynamic Support Network for Few-shot Class Incremental Learning, PAMI’2022) and (Pfulb et al., Overcoming Catastrophic Forgetting with Gaussian Mixture Replay, arXiv’2021). In particular, the GMM-based method in this submission can be considered as a simplified version of the method from Yang’s study.
    2. Multiple closely related studies are not mentioned. Besides the above two studies, please also refer to (Yang et al., Continual Learning with Bayesian Model based on a Fixed Pre-trained Feature Extractor, MICCAI’2021).
    3. Baselines in experiments are largely out-of-date. More recently proposed methods (e.g., those mentioned above and well-known buffer-based methods like iCaRL and DER) should be included in experiments.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the listed weakness above. In addition, two ablation studies should be added. One replaces GMMS by simply storing a number of feature vectors whose memory consumption is similar to the size of GMMs per class. The other is to remove the classifier head and simply use GMMs as generative model for classification. Please also discuss the limitation of the method, e.g., what if distributions of features are overlapped between classes or domains? Also, please clarify whether the training-test split is at slide level for the first and third settings (first and third rows in Table 1).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Due to the limited novelty in methodology, multiple missing related work, and lack of comparisons with SOTA baselines in experiments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Some of my concerns have been addressed. However, the lack of empirical comparisons with recently proposed SOTA methods and the tile-based train-test split made my original decision unchanged.



Review #2

  • Please describe the contribution of the paper

    The paper addresses the problem of forgetting in continual learning (CL) in presence of various data distribution shifts. The main application is domain incremental learning for digital histopathology cell classification in presence of various stains, organs, and data acquisition centers. The proposed method is based on training domain specific data generators in a latent space and use them to replay past samples when updating the model with new domain data. The domain specific generators are modeled using Gaussian Mixture Models (GMMs) and trained on a feature space computed with a frozen neural network image encoder. Experimental results show the advantages of the proposed approach when compared with other buffer-free approaches with comparable results with methods that replay actual past samples (buffer-based) and reference methods that ultimately see all the available data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    a. Use of domain specific GMMs for generative data replay with the benefits of not requiring storing of past domains data b. Generative replay in a relatively lower dimensional space (latent space) for decreased training complexity c. Convincing results on the effectiveness of the proposed method on several domain shifts: stain shift, organ shift and mixed

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    a. Using generative data replay for CL has been proposed in principle in the following article which slightly limits the novelty aspect of the proposed work: Shin, Hanul, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. “Continual learning with deep generative replay.” Advances in neural information processing systems 30 (2017) b. Not clear from the article how much data augmentation (intensity/color-based, style transfer) is performed during the training for a domain and how they could mitigate some performance gaps in general

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    More support for the effectiveness of the proposed method would be shown by addressing mitigation through other typical types of imaging data augmentation (see 6.b above).

    Can the authors comment if there is any limiting performance due to completely freezing the image encoder as it assumes that the latent space is sufficiently rich to express the class differences for histopathology.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is a relatively simple and effective way to mitigate the effect of forgetting in continual learning in cases where past data is not available for model updates (for federated learning or privacy aware applications). Main novelty is the use of domain specific GMM models in latent space for sample replay.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    • The paper proposes a replay-based continual learning approach that uses generative modelling instead of storing prior training data. This enables it to overcome privacy concerns associated with the keeping of data. Instead of modelling the actual input data, the model generated intermediate latent features.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses a relevant problem with a new approach
    • The results presented in the paper are convincing and support the claim of the paper, it is comparable to buffer based approaches and outperforms approaches without buffer, while outperforming buffer based approaches with small buffer.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is not entirely clear how this approach would deal with low-level differences across domains, i.e., if the layer modelled by the GMM is lower and higher dimensional. This should be explained in the paper.
    • The choice of the layer is not explained, a comparison of different layers used for this approach would be helpful.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Is the GMM sufficient to capture the variability in the data, or can you clarify - also experimentally - where the limits of this approach are?
    • In some continual learning / domain shift scenarios the domains are actually different in low levels of the network, if e.g., image characteristics are concerned. Can you give a sense what happens if you move to a lower layer to extract the intermediate latent representation, i.e., if a bigger part of the network becomes domain specific (i suspect that then even a higher level part of the network might be shared). Is this a dimensionality problem with this approach?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant problem, the methodology is explained well, and is novel, and the empirical evaluation is thorough.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

R1+R3+R4) Choice of feature encoder: We mention in Sec.2a that a wide range of encoders can be used. As the ResNet backbone is primarily used in literature, it is also used here for consistency. The layer prior to the classification layer is used for various downstream tasks due to its rich discriminative abilities. As mentioned in Sec. 2a, last paragraph, fixing the encoder avoids the invalidity of stored generators over time. We agree that accuracy could be further improved with an encoder offering more discriminative features. In future work, we will explore more feature extractor backbones. R3) Novelty: To the best of our knowledge, generative replay is not explored in digital pathology so far. Our solution is privacy-aware and performs better than buffer-free and comparable to buffer-based works. Additionally, it does not require large storage as in buffer-based works. We mention these novelties in Sec.1, but probably need to highlight them even more. We will revise Sec.1 to avoid any misinterpretation regarding novelty. R3+R4) Referred works: As suggested, referred works will be added. Our work is significantly novel and distinct from the referenced studies in several aspects, including (a) target research domain, (b) CL incremental learning, and (c) CL strategy. The referred works mainly focus on learning new classes from a single domain in subsequent episodes belonging to natural images, whereas we focus on learning shifts arising from domain (organ, stain, etc.) changes in pathology. Here, a complex architectural-based CL strategy, as followed for new class arrival (Yang et al.2022 focused on few-shot learning), is not necessary. It is evident from the literature that replay strategies outperform regularisation and architectural-based strategies. However, generating actual images for replay (Pfulb et al.2021) will cause critical privacy violations. Additionally, generating artificial whole slide image patches with GMM (Pfulb et al.2021) or GAN (Shin et al.2017) is challenging compared to digit-like images. GAN-based work would raise further challenges in its applicability in pathology, e.g., updating a single GAN generator may cause photocopy problems to old domains; domain-specific GANs would add model complexity; training GAN with limited training data may lead to unstable pathology image generation, etc. In contrast, our generator is lightweight, works even with limited training data, and offers privacy-aware replay, which is a key factor when applying CL in pathology. Further, in contrast to the work (Yang et al.2021), our framework facilitates continuous knowledge accumulation from novel domains in learnable layers after the latent replay layer, which ensures its long-term usage. Additionally, contrary to (Yang et al.2021), in our work, there is no requirement to pre-train the feature extractor with multiple domains before the CL sessions. R3) More baselines: We cover a wide variety of baselines commonly used in domain incremental works. The considered buffer-based baselines (2017-2020) are from years similar to iCarL (2017). Although we also had results for more buffer-based works (GDumb, MIR, DER, iCarL), which performed similarly to A/GEM, ER, & LR (Sec4 Para1), we only show the most common ones as representative upper bounds. Adding more buffer-based works does not provide additional insight but would shift the focus of our paper away from privacy-aware, buffer-free approaches. R3) Train-test split: The amount of stain shift slides is not sufficient for slide-level split. However, after consulting with our medical collaborators, our tile datasets include sufficient tissue variability across splits. Also, tile-level split is a common strategy in digital pathology. R4) Augmentation: Following the definition of CL in literature, intensity or color-based style transfer are not performed (they cannot cover all possible variability arising from different shifts, e.g., organs, centers, stains).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although the reviewer maintain the ‘weak reject’ after rebuttal, but the reason is the missing of experiments, which cannot be added during rebuttal. Hence, I would suggest ‘accept’ to this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although the reviewer maintain the ‘weak reject’ after rebuttal, but the reason is the missing of experiments, which cannot be added during rebuttal. Hence, I would suggest ‘accept’ to this paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top