Abstract

Sharing medical datasets among healthcare organizations is essential for advancing AI-assisted disease diagnostics and enhancing patient care. Employing techniques like data de-identification and data synthesis in medical data sharing, however, comes with inherent drawbacks that may lead to privacy leakage. Therefore, there is a pressing need for mechanisms that can effectively conceal sensitive information, ensuring a secure environment for data sharing. Dataset Condensation (DC) emerges as a solution, creating a reduced-scale synthetic dataset from a larger original dataset while maintaining comparable training outcomes. This approach offers advantages in terms of privacy and communication efficiency in the context of medical data sharing. Despite these benefits, traditional condensation methods encounter challenges, particularly with high-resolution medical datasets. To address these challenges, we present MedSynth, a novel dataset condensation scheme designed to efficiently condense the knowledge within extensive medical datasets into a generative model. This facilitates the sharing of the generative model across hospitals without the need to disclose raw data. By combining an attention-based generator with a vision transformer (ViT), MedSynth creates a generative model capable of producing a concise set of representative synthetic medical images, encapsulating the features of the original dataset. This generative model can then be shared with hospitals to optimize various downstream model training tasks. Extensive experimental results across medical datasets demonstrate that MedSynth outperforms state-of-the-art methods. Moreover, MedSynth successfully defends against state-of-the-art Membership Inference Attacks (MIA), highlighting its significant potential in preserving the privacy of medical data.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2872_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Kan_MedSynth_MICCAI2024,
        author = { Kanagavelu, Renuga and Walia, Madhav and Wang, Yuan and Fu, Huazhu and Wei, Qingsong and Liu, Yong and Goh, Rick Siow Mong},
        title = { { MedSynth: Leveraging Generative Model for Healthcare Data Sharing } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces MedSynth, a dataset distillation (condensation) method that uses a generative model to facilitate secure and efficient medical data sharing. It combines an attention-based generator with a Vision Transformer to learn representations of extensive medical datasets, which can then be shared without exposing sensitive raw data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The application of a generative-based approach to distill medical datasets represents a new methodology in the field of medical dataset distillation.

    2. The utilization of residual attention blocks in the generative adversarial network, combined with a fine-tuned Vision Transformer for logit matching, proves effective in extracting and condensing knowledge from medical datasets.

    3. The authors conduct an analysis using Membership Inference Attacks to assess the resilience of MedSynth, evaluating its ability to protect against potential privacy breaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. A lot of important works related to medical dataset distillation are missed, although these works do not use generative-based distillation. The authors should introduce these methods instead of ignoring them. For example: 1) Soft-Label Anonymous Gastric X-ray Image Distillation. 2) Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing. 3) Dataset Distillation for Medical Dataset Sharing. 4) Communication-Efficient Federated Skin Lesion Classification with Generalizable Dataset Distillation.

    2. The authors should also compare the proposed method with SOTA dataset distillation methods based on gradient matching or feature matching, not only generative-based methods.

    3. The ablation study of the two newly introduced modules is missing.

    4. I noticed the proposed method is run on 8 Nvidia V100 GPUs. So how about the training time and if it is possible for low computing resources in hospitals.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Introduce all methods related to the medical dataset distillation field.

    2. Compare the propsoed method with different types of dataset distillation methods.

    3. Conduct the ablation study for the two newly introduced modules.

    4. Discuss the training time of the proposed method and computing resources valid in hospitals.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is based on the paper’s insufficient literature review and lack of comprehensive experimental validation. It overlooks significant existing research in medical dataset distillation, particularly non-generative approaches, and fails to provide detailed comparative and ablation studies.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have successfully addressed my concerns, and I am now inclined to accept this paper.



Review #2

  • Please describe the contribution of the paper

    The paper introduces MedSynth, an innovative dataset condensation approach for medical data sharing. By condensing large-scale medical datasets into generative models, MedSynth enables secure sharing among hospitals without compromising the privacy of original data. The approach combines an attention-based generator with a Vision Transformer (ViT) for feature matching, enhancing the ability to extract fine-grained information from medical images. Extensive experiments on medical datasets demonstrate that MedSynth outperforms state-of-the-art methods. Additionally, the paper evaluates the resilience of MedSynth’s generated models against membership inference attacks, highlighting its effectiveness in preserving medical data privacy. The paper provides a comprehensive discussion on datasets, implementation details, comparison with the latest techniques, generalization ability comparisons, and membership inference attack analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting topic
    • Clear description of the method
    • Well-structured and easy-to-follow
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • lack of detailed explanation
    • more experiments
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I have the following questions.

    • The robustness of the method: The authors have conducted experiments on fixed Generator and ViT architectures, training them with specified hyperparameters. My concern is: Is it easy to train them successfully, and in what scenarios might they fail?
    • Typically, the community employs the most representative Membership Inference Attack (MIA) methods [1][2] to evaluate a model’s privacy level. However, Section 4.5 does not specify which method the authors adopted and how they conducted the experiment. I suggest the authors elaborate on this point.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I appreciate the idea of this paper’s use of synthesized images as original images, which can protect privacy.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After carefully reading the rebuttal and other reviews, my final decision is weak accept



Review #3

  • Please describe the contribution of the paper

    This paper proposes a private data sharing framework based on distilling the knowledge of a private dataset in a generative model (conditional GAN) to allow sharing between centers. The proposed generative model is trained in two consecutive stages: In the first stage, a conditional GAN setup is adopted with attention layers in the generator producing an attention map that is concatenated with the generated image before passing the result to the discriminator. In the second stage, the images generated by the stage-1-trained GAN are passed to a LoRA-finetuned ViT to compare their features (ViT latent representations) with the ones from images from the original real dataset. The authors evaluate their method on downstream classification tasks on the public ISIC and Alzheimer datasets, corroborated for different model architectures (DenseNet, ConvNet, ResNet18) apart from ViT, and show how a membership inference attack on their model is less accurate than random membership guessing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes and applies several innovative concepts such as a two-stage GAN training, attention-mask concatenation with generated images, and LoRA based fine-tuning of ViT for feature extraction.

    • Interesting and clinically relevant synthetic data utility evaluation on downstream tasks for two different domains and modalities (dermatology and brain MRI) on public datasets.

    • A membership inference experiment is conducted to assess the claim of the possibility of privacy preserving data sharing via generator sharing across centers.

    • Comparisons with related methods from the literature were implemented and reported for the two downstream classification tasks

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Missing ablation of the impact of the different components of the proposed method. Also, the paper proposes a two-stage training approach while, however, both feature matching and conditional GAN training could have been included into a single stage, e.g. as done in other GAN architectures such as Pix2PixHD (Wang et al, 2018). It would have been important to show whether the proposed two-stage approach leads to better performance (downstream) and image quality when compared to its single-stage counterparts.

    • Quantitative image quality evaluation (e.g. FID) has not been reported

    • For some aspects of the experiments, the descriptions are kept quite short resulting in some information seeming to be missing or vague.

    • No standard deviation (nor p-values) are reported. Results should ideally have been shown for multiple seeds.

    • The MIA attack evaluation is a nice idea, but the results here would need more detailed description, especially as the accuracy of MIA attacks for the generator is lower than random guessing even though the attacker was trained in the process. The training would need to be described in more detail and the images used to assess membership (e.g. which images were used as non-members?). It would have been very important to show why the generator, which learns the training data distribution, does not leak training data information and if and why the method in this work defends better against MIA compared to other methods form the literature (e.g. cGAN, DiM)

    Minor weaknesses:

    • Claim 1 claims “secure sharing across hospitals without disclosing raw data and ensuring privacy protection”. This claim seems to strong based on the experimental results as only membership inference attacks were tested. Further tests (e.g. reconstruction attacks, property inference, model extraction attacks) would be needed to empirically claim privacy protection of the method for the specific dataset.
    • Claim 3 (Introduction) claims better performance than state-of-the-art based on classification performance. However, if the goal is encapsulating data in a generative model for inter-centre sharing, this claim should verify that the method in question contains the best tradeoff between privacy-preservation, data sampling quality, and downstream task performance (the later ideally including a domain shift scenario to simulate different centers).
    • From equation 1, the value of weight a used in this study seems not to be provided. Further information on this value would be necessary to assess the importance of the attention map concatenation (ideally alongside ablations).
    • In 4.3. the authors state that “for the other methods, a ResNet 18 architecture is used, whereas our work utilizes ViT-base16 with LoRA”. Here it is not entirely clear if this statement refers to the classification model or the GAN. In case of the former, the comparison would not allow conclusions regarding the synthetic data (as ViT might be a stronger classifier).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper uses public datasets allowing to recreate a similar setup. However, as the paper does not state that it will share the source code, the reproducibility of the exact results of this work will be limited.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The sentence “This generative model can be shared securely, allowing hospitals to enhance their resources without compromising privacy“ seems to be a too strong claim. To date, the only method providing a proven privacy guarantee that may justify such a claim is differential privacy. In fact, a comparison with models trained under differential privacy is recommended for further iterations of this work.

    • The authors refer to the inherent inductive bias towards global features of vision transformers at several places in the text. However, I was convinced that transformers do not have an inductive biases towards global features, just that rather convnets have an inductive bias towards local features. In theory, transformers should have no bias towards learning local semantics equally well as global semantics (or vice versa) as the spatial distance between tokens (i.e. input patches) is learnable and arbitrary.

    • Consider adding a reference for each method in table 1 (e.g. it was not entirely clear what DM stands for).

    • The reported AUCs seem very high for Alzheimer Dataset, which makes it important to share code and concrete dataset splits to allow for verification.

    • For ablations such as in Figure 2, it would be better to show them for all of the datasets. Here ISIC seems missing, which leaves the question if results of this figure would generalize.

    • Ideally, the evaluation should show how well the synthetic vs original dataset (both sourced from center A) perform on a dataset with common domain-shifts in center B.

    • It was not entirely clear how the method could be compared to DCGAN, as vanilla DCGAN does, in principle, not produce images conditioned on a class.

    • Typo in title: Either it may read ”Leveraging Generative Models” or “Leveraging a Generative Model”

    • In equation 5, it seems L_g should be capitalized to L_G

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides an innovative architectural setup, addresses a relevant research problem and validates on two clinically relevant classification tasks and datasets. There are some limitations and weaknesses of this work outlined in the sections above such as missing quantitative synthetic data evaluation, missing ablation (which is important due to the proposed two-step framework could have been trained in a single step and due to the setup containing several components with unknown individual impact on results). In sum, it was difficult to decide between Weak Accept and Weak Reject, where the promising comparison with the state-of-the-art methods was crucial for leaning towards acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors provided several clarifications with additional insights. However, given the remaining limitations described in the comments (e.g., test with other privacy attacks, missing ablations, quantitative and qualitative synthetic data evaluation), the initial recommendation is maintained. As stated, the author’s are recommended to revise all claims in this paper to restrict them only to the empirically validated findings. For instance, “secure sharing across hospitals” or “ensure privacy protection” are too strong claims, as the synthetic data (generator) can still leak private patient data after sharing. For instance, one might be able to confidently determine if a real patient image was used during generator training given that some synthetic images closely resemble this patient image (or particular features in it). Apart from comparison with the single-stage training approach, it would have also been important if the authors could have included the results they mention for the models they tested, i.e., “GANs (cGAN, WGAN, StyleGAN) with different settings (latent space dimension, loss, training parameters and regularization)” and for the fine-tuning module the “ConvNet, ResNet, DenseNet” results. The provided MIA attack method clarification should be included in the final version.




Author Feedback

Thanks for the reviewers’ useful comments. We address the concerns raised by the reviewers below.

Robustness of the Method [R1] – We clarify that we pre-train our GAN. It provides stable initialization, reducing sensitivity to hyperparameters, and ensuring stable and reliable training across many architectures and datasets. To mitigate model collapse and lower the likelihood of GAN failure, we use Wasserstein loss with gradient penalty, as indicated in Equation(2) of the study, which promotes stable training.

Membership Inference Attack (MIA) [R1 and R4] - We generate the white-box and black-box attacks based on the threat model proposed in [1]. For both, the attacker training set consists of a random 10% of the original dataset (ISIC/Alzheimer) with synthetic fake samples as non-members. In the white box attack, knowing the target GAN architecture, the attacker inputs the training set to target GAN’s discriminator, extracts and sorts the prediction probabilities and uses the highest probabilities to predict the training set members. In the black-box attack, without knowing the target GAN architecture, the attacker first trains a local GAN using the target GAN samples and carry out the steps typical of a white-box attack. The attack’s accuracy is determined by the percentage of correctly identified images from the training set. Lower attack accuracy, below random guessing, indicates increased model security against MIA attacks. Our method defends against MIA attack more effectively because the generative model captures information from a condensed dataset, reducing the attack surface for inference attacks, and preventing leakage of training data.

Missing References [R3]- We appreciate for suggesting references related to medical dataset distillation and will cite them and discuss.

Comparison to Non-Generative Methods [R3] - We clarify that we have already shown in Table 1 that our method achieves significant performance improvement over a non-generative feature matching method, named DM (Distribution Matching) on ISIC 2019 and Alzheimer datasets. Although we did not present results, our study shows our method performs significantly better than gradient matching method. As suggested, we will compare our method with other non-generative methods.

Training Time [R3] – We used the system with eight Nvidia V100 GPUs installed but used 4 GPUs only. It takes about 30 hours on 4 V100 GPUs on ISIC 2019(about 20k train images).

Ablation study for the two newly introduced modules [R3, R4]- For the first module (Generator pre-training), we validated many GANs (cGAN, WGAN, StyleGAN) with different settings (latent space dimension, loss, training parameters and regularization) and chose attention-based GAN as we observed it outperforms other GANs. For the second module (Fine-tuning the generator), we also validated different network architectures (ConvNet, ResNet, DenseNet) other than ViT for feature extraction as well as on different matching strategies (feature, gradient and logits matching), and presented ViT with logits matching as the feature extraction method as it obtained the best performance for medical datasets.

Two-stage Approach [R4]- Using a single-stage training approach with random initialization of the GAN may lead to vanishing gradient problem and higher training time due to hyper-parameter tuning. To overcome these issues, we used two-stage approach. Pre-training the GAN on a related medical dataset can provide a good initialization point and improves downstream task’s performance.

Generalization Ability[R4]- As suggested, we will evaluate the generalization ability on ISIC dataset.

ViT [R4] – Thank you for pointing out. We clarify that ViT’s self-attention mechanism enables the model to capture global interactions among all components of the input image by considering relationships between all positions at once.

1.LOGAN: Membership Inference Attacks Against Generative Models, PET Symposium, 2019.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The concept of the paper is innovative and the topic relevant and interesting to the medical imaging community. The rebuttal clarified most of the reviewers’ points. For the camera ready, I would recommend adjusting the claims of the paper regarding the contributions and clearly state what is being done without overgeneralizing. Consider succinctly extending the explanations of the experimental setup to ensure reproducibility and avoid vagueness.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The concept of the paper is innovative and the topic relevant and interesting to the medical imaging community. The rebuttal clarified most of the reviewers’ points. For the camera ready, I would recommend adjusting the claims of the paper regarding the contributions and clearly state what is being done without overgeneralizing. Consider succinctly extending the explanations of the experimental setup to ensure reproducibility and avoid vagueness.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Authors fail to explain why their work is needed. They mention in the Abstract that existing methods “encounter challenges” which then repeat in the Introduction as “encounter difficulties as they directly extract information from the original dataset into pixel space, and the feature distribution of condensed samples frequently lacks diversity.”, without providing any references. References 11-15 do not mention lack of diversity as a limitation of their methods.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Authors fail to explain why their work is needed. They mention in the Abstract that existing methods “encounter challenges” which then repeat in the Introduction as “encounter difficulties as they directly extract information from the original dataset into pixel space, and the feature distribution of condensed samples frequently lacks diversity.”, without providing any references. References 11-15 do not mention lack of diversity as a limitation of their methods.



back to top