Abstract

Despite notable advancements, the integration of deep learning (DL) techniques into impactful clinical applications, particularly in the realm of digital histopathology, has been hindered by challenges associated with achieving robust generalization across diverse imaging domains and characteristics. Traditional mitigation strategies in this field such as data augmentation and stain color normalization have proven insufficient in addressing this limitation, necessitating the exploration of alternative methodologies. To this end, we propose a novel generative method for domain generalization in histopathology images. Our method employs a generative, self-supervised Vision Transformer to dynamically extract characteristics of image patches and seamlessly infuse them into the original images, thereby creating novel, synthetic images with diverse attributes. By enriching the dataset with such synthesized images, we aim to enhance its holistic nature, facilitating improved generalization of DL models to unseen domains. Extensive experiments conducted on two distinct histopathology datasets demonstrate the effectiveness of our proposed approach, outperforming the state of the art substantially, on the Camelyon17-WILDS challenge dataset (+2%) and on a second epithelium-stroma dataset (+26%). Furthermore, we emphasize our method’s ability to readily scale with increasingly available unlabeled data samples and more complex, higher parametric architectures. Source code is available at github.com/sdoerrich97/vits-are-generative-models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0740_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/sdoerrich97/vits-are-generative-models

Link to the Dataset(s)

https://worksheets.codalab.org/rest/bundles/0xe45e15f39fb54e9d9e919556af67aabe/contents/blob/ https://worksheets.codalab.org/rest/bundles/0xa78be8a88a00487a92006936514967d2/contents/blob/ https://drive.google.com/file/d/1YeFcs2yeJmxCFI3puQKUZuac13La1BpW/view

BibTex

@InProceedings{Doe_Selfsupervised_MICCAI2024,
        author = { Doerrich, Sebastian and Di Salvo, Francesco and Ledig, Christian},
        title = { { Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a generative method for domain generalization in digital histopathology. The approach dynamically integrates characteristics from tiny image patches into original images to create synthetic variants, thereby enriching the dataset and improving the generalization capabilities of deep learning models. The method demonstrated improvements and is tested on two distinct histopathology datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The work introduces a novel way of generating synthetic images for better domain generalization in histopathology. Overall I would rate this a great idea: not too complex and works.

    • Writing is easy-to-read and generally well-structured paper

    • Figures support the work well and could only improve a little bit to be perfect, but would say this is up to taste.

    • Decent evaluation: The basic experimental setup seems legit.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Demonstrate basic effectiveness but do not show “why” or that it is safe to deploy. – Their method has the inherent risk of hallucinating contents, which might introduce undesired biases, i.e., how good are the FID SSIM for synthetic images? Also, only having PSNR for the reconstruction is limiting. – The results on the downstream task emphasize this problem by not giving standard deviations or significance tests. → In summary, the claims made in the paper sound questionable and overstated.

    • Aggressive writing: harshly discredits other works and tends to overstate own work: – discredit without proof: “However, these methods often require access to target samples during training or struggle with adapting to new domains and unseen stain colors.” → Such statements require citations. – overstating: “Although our method’s patch-wise image reconstruction may produce slight grid artifacts” → would argue the image get to be unusable for human classification.

    • It is not shown that the scalability potential is valuable, as larger ViT are not evaluated on the downstream task. Also, ViT is not a key component enabling scalability, as this is also true for CNN [Smith et al.]. Why pick it over the others? Could a CNN based approach also have worked with this kind of Training scheme? → Simply no ablation into the architecture choices supports the claims made about ViT.

    • Figure 4 shows characteristics with different institutions. Why are hospitals 4 and 5 included? I hope these are not used as “unlabeled samples” in the training of the downstream model. While it is unclear from the text and figure, it would mean a data leakage problem, invalidating the results. For my rating, I assumed they were not part of the training data, and a rigorous evaluation was conducted.

    • Why are the lambdas introduced when they are all set to 1 and no experiments necessitating their introduction? Seems to be unnecessary clutter.

    • Figures could have been more efficient, e.g., figures 2/3/4 all depicted somewhat the same; it could easily be reduced to 2 figures or just a single one, but I generally like them as they support the reader in understanding the work.

    • Figure 1 is too cluttered for my personal taste, but it helped me better understand the work.

    Smith, Samuel L., et al. “Convnets match vision transformers at scale.” arXiv preprint arXiv:2310.16764 (2023). Justification

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Generally most hyperparameters are given to try to reproduce without source code, but I would guess it will be much easier with the promised code release. Therefore, I would rate the reproducibility high.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the concerns about the metrics, lack of confidence intervals and the general style of writing the paper I recommend the rejection, as too much of the paper would need to change to be acceptable after the rebuttal. However, generally think the authors are up to something great, and addressing these points could significantly strengthen the paper’s contribution to the field.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    Stand with my previous vote. I think this paper requires better evaluations and experiments why. The method reassembles malicious adversial attack training too much, which is not addressed and keeps the risks of hallucinating contents.



Review #2

  • Please describe the contribution of the paper

    The authors describe and evaluate a methodology to generate synthetic histopathology images using a generative self-supervised Vision transformer model. These synthetic images are shown to improve the generalization of another DL model to unseen domains in the context of histopathology.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. New methodology to generate synthetic images with unseen combinations of anatomy and image characteristics.
    2. Experimental study of the generated images
    3. Improved results after training DenseNet model with additional synthetic data Domain adaptation is an important topic and the paper is relevant.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors should address the following in order to strengthen the claims of the paper:

    1. Generation: The proposed method is somewhat similar to ContriMix as it separates the anatomy features and characteristic features with the help of loss functions. Novelty seems to be in allowing more diversity, however the following questions should be answered: a. The approach requires extraction of features from another sample in the same batch $x_m$. If the training dataset doesn’t contain some characteristic features which are present in test dataset, it is not clear how the proposed approach will help solve the problem. For example in the epithelium-stroma dataset, how could performance improve on IHC dataset? Were images from IHC dataset used for augmenting the dataset?

    b. If possible, please compare your methods with the techniques used in the paper “CONTEXTUAL VISION TRANSFORMERS FOR ROBUST REPRESENTATION LEARNING” which achieves the state of the art on Cameloyon dataset.

    c. Synthetic data generated doesn’t look natural. Gridlines are present. Is there a way to resolve this?

    d. Did you perform any experiments to decide on the size of embeddings for anatomy and characteristic features? In particular, why should their size be equal?

    1. The experimental section should be improved to help reproducibility.

    a. What is the number of synthetic images that were added to the training dataset in both benchmarks considered?

    b. Can the authors provide more details to ensure that the comparison with other techniques is fair? Did they use data-augmentation approach for other methods in both benchmarks? Were same number of images augmented?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Not much information has been provided regarding the code. The experimental section and comparison is weak. The experiments haven’t been repeated and the variance is not reported. Details on amount of augmentation performed in two benchmarks is not provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should address the following in order to strengthen the claims of the paper:

    1. Generation: The proposed method is somewhat similar to ContriMix as it separates the anatomy features and characteristic features with the help of loss functions. Novelty seems to be in allowing more diversity, however the following questions should be answered: a. The approach requires extraction of features from another sample in the same batch $x_m$. If the training dataset doesn’t contain some characteristic features which are present in test dataset, it is not clear how the proposed approach will help solve the problem. For example in the epithelium-stroma dataset, how could performance improve on IHC dataset? Were images from IHC dataset used for augmenting the dataset?

    b. If possible, please compare your methods with the techniques used in the paper “CONTEXTUAL VISION TRANSFORMERS FOR ROBUST REPRESENTATION LEARNING” which achieves the state of the art on Cameloyon dataset.

    c. Synthetic data generated doesn’t look natural. Gridlines are present. Is there a way to resolve this?

    d. Did you perform any experiments to decide on the size of embeddings for anatomy and characteristic features? In particular, why should their size be equal?

    1. The experimental section should be improved to help reproducibility.

    a. What is the number of synthetic images that were added to the training dataset in both benchmarks considered?

    b. Can the authors provide more details to ensure that the comparison with other techniques is fair? Did they use data-augmentation approach for other methods in both benchmarks? Were same number of images augmented?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has limited novelty and some important details are missing in both the methodology and experimental section.

    The authors are encouraged to answer the detailed questions so that the reviewer can make a more informed decision.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I re-iterate that the authors have addressed a very important and challenging problem. The performance of their approach under domain shift is impressive and shows the potential in the technique.

    The response from the authors to the questions are straight-forward and honest. They have understood the limitations of their approach and provided justifications and potential solutions.

    Most of the response has focused on clarifying the causes, consequences, solutions and impact of grid-artifact issue. I accept their view-point and recommend for acceptance.

    That said, not all aspects of the approach are clear and there are areas of improvement in methodology, choice of hyper-parameters, size of embeddings etc.

    I am improving my rating from “Weak Accept” to “Accept” as I consider that the approach is novel and can provide benefit to the community.



Review #3

  • Please describe the contribution of the paper

    The authors propose to use a transformer encoder to map images to a latent space, where anatomy and intensity distribution features are clearly separated. New synthetic data can then be generated by mix-and-match of the anatomy and intensity features across the available data. The generated data enables to clearly improve performance with respect to existing baseline on unseen dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel method with convincing results
    • good literature review
    • extensive experiments with comparison to SOTA on 2 datasets, and ablation studies.
    • very well written and easy to follow
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weakness, but I would like the authors to discuss:

    1. the grid artefact that they obtain on all images,
    2. the intensity of the generated sanples, which only reproduces the average colour instead of the actual staining patterns.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Clearly explained and easy to follow. The methods is easily reproducible, even without the code. Experiments are also reproducible, since they were conducted on public datasets.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Major

    1. The grid artefact should be discussed. Do the authors have some reason for it ? It looks like either an interpolation or a frequency problem. If not, maybe it could be solved by introducing some learnable layers in the image synthesiser ?

    2. The anatomy of the samples is indeed preserved, but it seems that for the “characteristics”, the model learns the predominant colour of a domain (green for example), instead of the actual staining pattern (green/pink). Some might even think that the authors could obtain better results by converting their images to grey-scale, and apply a random background colour (no grid artefacts). This issue should be discussed.

    3. I think the lambdas should either be set by experimenting on the validation set, or be studied in a sensitivity study. None of this is currently done.

    4. Could the authors report the training time for all the methods ?

    5. I think the authors should make a connection with disentanglement methods (including in the literature review), which are strongly related to this paper. This could be another selling point, that would appeal to an even part of the community.

    Minor:

    • I think it should be made clear that the z_s^a in equation (2) depend on m. Also shouldn’t it be z_m,q^c in the second term on the line below equation (3) ? Finally, shouldn’t Px[1:L/2] be in superscript at the end of the expression of Z_s,q^c ?
    • What do the authors exactly mean by characteristics ? Do they mean the intensity distribution ? I think this term should be defined more precisely.
    • The colours in Fig 1 are a bit misleading, because the blue/red makes me think of positive/negative pairs in contrastive learning figures, which is not at all the case here.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clear and well-motivated method. The disentanglement of anatomy and intensity features is elegantly achieved. Clearly a good fit for the next MICCAI !

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    I stand by my rating, especially since the authors have addressed my comments about adding a discussion on the grid artefacts.

    I would also like to emphasise that I respectfully disagree with reviewer 4 on the comments about clinical interest of the images. This has never been set as a goal for this paper, which simply tries to increase the generalisability of downstream methods by extending the distribution of the training data. However, i agree with R4 that standard deviations and statistical tests are missing.

    Overall, I think this work opens new avenues in data generation, and although some aspects of the paper are improvable, it will foster interesting discussions at MICCAI.




Author Feedback

We sincerely thank all reviewers for their thorough evaluations and valuable feedback. We are encouraged by the positive assessment of our work’s organized structure, clarity, novelty (R3, R4, R5), relevance to the field (R3), and experimental evaluations (R4, R5). We appreciate R4 recognizing the supporting role of our figures and R5 commending our literature review. We have carefully considered all comments and suggestions and hope to satisfactorily address the major concerns below. Additionally, we hope that the release of our source code will help mitigate any reproducibility concerns. We deeply value R4’s positive feedback, recognizing our work as “up to something great” and noting its potential to “contribute to the field.” This affirmation strengthens our belief that MICCAI is an ideal venue to present and engage with the community on this line of research.

Grid artifacts in synthetic images: causes and solutions: We acknowledge the presence of grid artifacts in our synthetic images (R3, R4, R5). These artifacts arise from our image synthesizer (IS), which generates images by combining patch-wise anatomy features from one image with characteristic features from another via dot-product multiplication of their respective feature matrices. We chose the dot-product approach inspired by the Attention mechanism in Transformers due to its speed and space efficiency compared to a learnable Multi-Layer Perceptron (MLP). However, we agree with R5 that modifying the IS could mitigate the grid artifacts. Specifically, replacing it with a learnable MLP could help preserve overall characteristics better and reduce artifacts. This change would also remove the size constraint on the feature embeddings, as correctly noted by R3, which is currently needed to facilitate the dot-product-wise image synthesis. We will add a brief comment on this matter to the manuscript.

Consequences of grid artifacts for image quality and assessment: We appreciate R4 for highlighting the theoretic concern regarding a potential hallucination of content within the generated images. Although this concern is justified, we mitigate it through our proposed losses: the feature consistency losses ensure the disentanglement of the two feature classes, while the self-reconstruction loss ensures the encoding of meaningful information for image reconstruction and generation. Thus, our approach preserves the original anatomy during the image synthesis, as shown in Figure 4 for images from the training set (rows 1 and 2), the validation set (row 3), and the unseen test set (row 4). These results further emphasize our encoder’s promising generalization capabilities, as seen by the consistent image quality across these (unseen) samples during training (R4).

Impact of grid artifacts on method deployment: Furthermore, while grid artifacts impact the visual assessment of the generated images, as observed by R4, they do not negatively affect our method’s impact. Our method is only used for augmenting the training set, thereby enriching its diversity, but does not affect the test set. Thus, the latter is not subject to hallucinations and downstream performance will not be affected, as shown by our state-of-the-art benchmark performance. Nonetheless, we acknowledge that deployment to clinical practice is a complex process that requires substantial efforts and care including security and regulation that are far beyond the scope of a conference paper.

Performance consistency even under extreme domain shifts: We thank R3 for raising an important point on how our method can improve downstream performance even for extreme domain shifts without requiring information on the test set. While we don’t solve this issue entirely, we provide significant progress with initial evidence (cf. Sec 3.2) showing substantially increased classification robustness. This improvement stems from our method’s superior diversity, allowing for more invariant representations compared to reference works.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Most reviewers gave the positive rates to this work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Most reviewers gave the positive rates to this work.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers appreciated the novel design of generating synthetic histopathology images, which leads to improved model generalization. The authors should consider adding the discussion on the grid artifacts and clarifying the unclear descriptions raised by reviewers to further improve the paper’s quality.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers appreciated the novel design of generating synthetic histopathology images, which leads to improved model generalization. The authors should consider adding the discussion on the grid artifacts and clarifying the unclear descriptions raised by reviewers to further improve the paper’s quality.



back to top