Abstract

Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0484_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0484_supp.pdf

Link to the Code Repository

https://github.com/ZheLi2020/InfoDist

Link to the Dataset(s)

https://medmnist.com/

BibTex

@InProceedings{Li_Image_MICCAI2024,
        author = { Li, Zhe and Kainz, Bernhard},
        title = { { Image Distillation for Safe Data Sharing in Histopathology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The current paper describes a new method to generate synthetic and realistic histopathology images that are highly representative of the real datasets. For this, the authors train a latent diffusion model and use a distillation method on the embedding of the new synthetic images based on graph analysis to choose the maximally informative ones

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Image generation is an incredibly important field in biomedical imaging due to the urgent need to train data driven approaches and the lack of available data for it. Additionally, in the field of medical image analysis two are major and specific limitations: data sharing and privacy (as highlighted by the authors) and lack of effective domain normalisation methods.

    • Regardless the main motivation and impact of the current work, the authors use novel techniques, such as diffusion models and unsupervised analysis of latent spaces to extract common patterns embedded in the dataset to identify relevant images.

    • They addapt an unsupervised clustering method, called community detection, to prove their hypothesis that essential information can be encoded and identify by relating the embedding of different images in a learnt latent space.

    • This entire data distillation is actually benchmark in a case scenario of image classification, showing the capacity to achieve comparable results to those when using the real dataset.

    • The proposed method, besides enabling the obtaining of highly representative synthetic datasets, serves as a way to optimally reduce large training datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the approach is novel, I would recommend the authors working on Figure 1 and the method description to make clear the workflow. The entire pipeline follows a complex methodology and as it is now, it is not straightforward to understand it. See detailed recommendations below.

    There are some concepts in the text that could bring some discussion and may be worth clarifying them:

    • Maximally informative images: this is subjective depending on the type of information. Is it for a specific data class or pattern? What if the datasets are not annotated or the augmentation is needed to perform a posterior cell segmentation. I would try to make clear the this point in the text.
    • Diffusion models learn to encapsulate essential information from images and the main claim as that from here it is not possible to retrieve personal information or data protected by privacy. Is it completely true?

    • While it is not required, I would strongly encourage the authors making their code available.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Infomap algorithm is referenced without a citation or explanation. Likewise, it is meant to be used in the representation space, but it is not clear how this space is obtained.
    • High modular centrality is mentioned in the text but it is never defined nor explained how the authors measure the modular centrality and choose which one is considered high. Indeed, it is mentioned in the pseudo-code but not defined.
    • Page 3: The paragraph that starts with “Xg can be further embedded…” is unclear. What does it mean that the size of Xg is large or when does this occur?
    • Figure 1: Section a takes too much space for what it is contained. Instead the authors could use this space to extent the explanation about how the graphs are built from the embedding. Also how bn and bp are obtained or what’s the role of the synthetic data in their definition. Where are the embeddings coming from? How do the 100 images of each class relate to the graph or how exactly are the images chosen?
    • The community identification algorithm assumes that the graph represents the stopping points of a random walker but it is not clear how this is build with the embeddings or how the random wal route is related to this image embedding.
    • “Resolution” term is used to refer to the size of the images in pixels. Please, correct this in the text and Figure 2. Resolution refers to the original pixel size of the image and the level of detail that the diffusion model is able to provide.
    • In Figure 2, besides the results, it would be interesting to see the images resulting in False positives and False negatives in the classification w.r.t. the ground truth.
    • From the text it is not clear that the classes that will be used for the image classification benchmark, are actually needed to run the distillation method. I think this should be made clear both in the abstract and the main text.
    • Table 1: why are there some metrics missing? ALso, there are two sections called distilled but the first one refers to distilled real data. I would clearly specify it.

    -Please, consider improving the following sentence in the abstract: “As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability…”

    • Page 2: Nevertheless, the challenges of data scarcity and data sharing ARE…
    • Page 2: “We would expect that approaches for dataset distillation can achieve…” Do you mean “approaches using dataset distillation”?
    • Page 3: “by UMAP which also constructS…”
    • Page 7: “This can be applied when either a set of real patient…” -> “This can be applied when a set of real patient…”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the current proposal is creative and proposes some innovations and good ideas. Yet, as it is now, it is quite complicated to fully understand the methodology or replicate the pipeline.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a dataset distillation method to create a small but representative synthetic dataset using diffusion models and graph community analysis. The method is evaluated on PathMNIST and shows superior performance than state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a novel way of producing synthetic data. The methodology is clearly described. Experiments are well designed. Results demonstrated the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some minor weaknesses: In the third paragraph of Introduction, the author pointed out the concern about data privacy as “models are trained on real data”. But the model (UViT) in the proposed method is also trained on real data to produce synthetic data. Even though the classification model does not touch the real data, it touches the synthetic data originated from real data. I did not find code associating with the work. In addition, there are many hyper parameters such as the percentage ρ of the positive samples in contrastive learning, the threshold η to remove the links with low weights, and the ratio of L_b and L_ce is hard coded as 1:1.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Sensitivity analysis of hyper parameters as mentioned in weaknesses is missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Why is the data selection necessary and why not directly use the full set of synthetic data? Please justify the selection of the most informative subset.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of using graph community analysis for dataset distillation is novel.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors have highlighted an important and practical issue of histopathology data unavailability for machine learning-based analysis. As a solution, a federated learning setup has been proposed, which is a logical choice but comes with its own challenge of privacy issues. To deal with both issues simultaneously, they have proposed a data distillation approach that can generate a synthetic dataset with minimal loss of information from the original dataset. They have presented a combination of a latent diffusion model paired with graph community analysis for image selection, achieving human-readable synthetic images that maintain utility for downstream classification tasks. This proposed method offers a promising alternative to federated learning, particularly in overcoming domain shifts and biases while ensuring data privacy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The data distillation method has mainly been used for model compression and augmentation tasks, but it has not yet been employed as a privacy measure for a federated learning setup for histopathological images, as proposed by the authors. Similarly, diffusion models for generative histopathology have also been studied recently, but not with the aim of ensuring privacy while maintaining the quality of synthetic images, as proposed by the authors. Additionally, the authors are the first to propose graph community analysis to select the most informative images with the best possible representation of the synthetic dataset, ultimately enabling downstream tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed approach is computationally intensive, particularly during training of the latent diffusion model and assessment of community detection, which could limit its practical application. The lack of analysis on computational resources required is a significant oversight, especially considering its intended use in a federated setting across multiple establishments. Also, it’s essential to consider potential challenges related to the scalability and generalizability of the proposed method. As the application of federated learning in healthcare settings continues to expand, ensuring scalability across diverse datasets and institutions is crucial. Therefore, addressing scalability concerns, such as the performance of the proposed approach with larger and more heterogeneous datasets, would enhance the paper’s contribution. Additionally, providing insights into the interpretability and explainability of the synthetic images generated by the proposed approach could enhance its utility and acceptance in real-world clinical settings.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Consider providing detailed computational resource analyses to facilitate implementation across various establishments in a federated setting. Additionally, the authors should comment on the generalizability of the method for other types of medical imaging. Exploring techniques to reduce computational complexity without compromising performance could also enhance the feasibility of the proposed approach. Moreover, investigating methods for optimizing the computational workload during both training and inference phases would be beneficial for real-world deployment, particularly in resource-constrained environments commonly found in healthcare settings.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper introduces a novel and potentially impactful approach to address data unavailability and privacy concerns in histopathology, the lack of computational resource analyses and generalizability. Addressing these concerns would significantly strengthen the paper’s contribution.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their constructive feedback and positive assessment! We will publish our code with the camera ready version.

Maximally informative images (R1). We generate graphs for each class separately, then select the same number of images from each graph. This process is explained in the text and the Algorithm 1. We train a class-conditional diffusion model and generate synthetic images with labels. We do not generate pixel-wise labels for segmentation.

Privacy protection (R1, R3). The diffusion model is trained on real data and learns the data distribution. It can be trained locally once without sharing it publicly. Then we can generate synthetic images, but these images are different from any real images, even though they contain some realistic features. Therefore, there is no personal information in synthetic images.

Infomap algorithm (R1). The Infomap algorithm is detailed in [3]. It is a community detection algorithm capable of identifying communities in a graph. The representation space refers to the space where images are projected to features. The graph is constructed using these features, which are obtained from the output of the penultimate layer of a classifier or through UMAP. In the graph, the nodes represent image features, and the edge weights correspond to the Euclidean distance between nodes. After constructing the graph, the Infomap algorithm is executed to generate communities (or clusters). Subsequently, we uniformly select images with high modular centrality from each community.

Modular centrality (R1). The scalar score of modular centrality combines two scores to quantify both the intra-community and inter-community influence.

The size of X_g (R1, R3). Our goal is to capture the input data distribution as comprehensively as possible. To achieve this, we generate a large number of synthetic images and then select a small subset of representative images.

Results table (R1). Some results are missing because the corresponding papers do not provide the results for this metric. We will differentiate between the two “distillation” methods in Table 1.

Hard coded hyperparameters (R3). We conducted an ablation study on these hyperparameters and selected the best combination to report the results. We opted not to include the table in the paper because of redundant evidence.

Computation cost (R4). We will expand on the computation cost in the final paper.




Meta-Review

Meta-review not available, early accepted paper.



back to top