Abstract

Condition video diffusion models (CDM) have shown promising results for video synthesis, potentially enabling the generation of realistic echocardiograms to address the problem of data scarcity. However, current CDMs require a paired segmentation map and echocardiogram dataset. We present a new method called Free-Echo for generating realistic echocardiograms from a single end-diastolic segmentation map without additional training data. Our method is based on the 3D-Unet with Temporal Attention Layers model and is conditioned on the segmentation map using a training-free conditioning method based on SDEdit. We evaluate our model on two public echocardiogram datasets, CAMUS and EchoNet-Dynamic. We show that our model can generate plausible echocardiograms that are spatially aligned with the input segmentation map, achieving performance comparable to training-based CDMs. Our work opens up new possibilities for generating echocardiograms from a single segmentation map, which can be used for data augmentation, domain adaptation, and other applications in medical imaging. Our code is available at \url{https://github.com/gungui98/echo-free}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1171_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1171_supp.zip

Link to the Code Repository

https://github.com/gungui98/echo-free

Link to the Dataset(s)

https://echonet.github.io/dynamic/ https://www.creatis.insa-lyon.fr/Challenge/camus/

BibTex

@InProceedings{Ngu_TrainingFree_MICCAI2024,
        author = { Nguyen, Van Phi and Luong Ha, Tri Nhan and Pham, Huy Hieu and Tran, Quoc Long},
        title = { { Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work propose a conditional diffusion models-based methodd for echocardiogram synthesis. They improve the previous work SDEdit by starting denoising process from a pseudo-video, which facilitate its inference. They evaluate the effectiveness of the proposed method on two public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    echocardiogram synthesis is an attracting task, which significantly improves the downstream tasks by data augmentation. This work develop a training-free method using the cutting-edge technique diffusion models, which also save the expensive costs on collecting paired data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The motivation of introducing optimal transport problem is not clear. Please further clarify this.
    2. As shown in Table 1, the proposed method is sensitive to diffusion step. Is there any experiment that test the sensitivity of SDEdit to denosing steps if reducing the number of steps can improve the performance?
    3. Although the author argue that the proposed method can be used for data augmentation and other applications, I think such generated videos may be not helpful for downstream tasks because the synthesized results are far from ground truth.
    4. What does the pink part of Figure 1 indicate? It doesn’t seems important.
    5. More experiments on other modalities and tasks are highly desired to validate the effectiveness of the proposed method.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Please indicate whether the source code will be publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see weaknesses. The author should indicate the difference between the proposed method and SDEdit.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is largely based on SDEdit. Although the author state that they introduce optimal transport problem, its motivation is not clear.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Considering the response to all reviewers, I tend to rate weak reject. Significant improvement on novelty and motivation is desired. The clarification on the difference against SDEdit is required.



Review #2

  • Please describe the contribution of the paper

    The authors extend recent work on diffusion models to the domain of echo video synthesis. The approach involves a 3D-Unet denoiser and bespoke pseudo video generation for an initial state that accounts for both an initial mask of the first frame and a pixel distribution to mimic.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The authors claim a “training-free” approach that generates quantitatively similar results that appear more believable than prior work in echo synthesis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -re pseudo video generation Eq3 it seems like the video you’re trying to generate x_0^K is an input to the initial V^K. -Is the competing CLS-free [1] or [7]? In either case, the quantitative results favor CLS-free, while Fig2 shows the proposed method’s results are far more believable. -Specifically noted issues with alternative approaches, like content leakage or paired datasets, are not directly addressed in discussion. -no downstream effects, like segmentation output or ejection fraction estimation, are presented.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    -“Based on SDEdit, our method start the reverse denoising process” - should cite SDEdit here, first use. Also, “starts” maybe? -“(DDPM) to generated synthetic echoechocardiograms” - to generate -“semantic label mapping of diastolic in” - diastolic what/frames maybe? -“DMs have been proposed to generate realistic echocardiograms [15, 14, 14, 17]” - 14 cited twice -“Different from cAMUS” - capital C -In one of the training cases, what if rather than m_0 being fed into the model, you substituted for any other mask in the dataset. Would any of the quantitative evaluations look all that different in the output given that you’ve matched x_0’s distribution?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the method over the cited SDEdit is limited, the degree to which the method is “training-free” is doubtful, no downstream effects are covered, and the writing is unclear.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Reviewer appreciates the clarifying statements of the author in rebuttal. Review remains weak reject.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel training-free condition video diffusion model (CDM) specifically designed for echocardiogram synthesis. The core contribution of this model, termed “Free-Echo,” is its ability to generate realistic echocardiograms from a single end-diastolic segmentation map without requiring any additional training data. This is accomplished using a 3D-Unet architecture with Temporal Attention Layers and a conditioning method based on SDEdit. The model is evaluated on two public echocardiogram datasets, demonstrating its capability to produce spatially aligned echocardiograms comparable to those generated by traditional training-based CDMs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces a novel training-free conditioning method for video diffusion models in the domain of medical imaging, specifically for echocardiogram synthesis. This is significant as it eliminates the need for paired datasets of segmentation maps and echocardiograms, which are often hard to obtain in medical fields due to privacy concerns and data scarcity.

    • The method’s ability to generate plausible echocardiograms from a single end-diastolic segmentation map is particularly innovative. This simplifies the data requirements and enhances the applicability of the model in real-world clinical settings where complete video data may not always be available.

    • The model was rigorously tested on two public datasets, CAMUS and EchoNet-Dynamic, with metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), Fréchet Inception Distance (FID), and Fréchet Video Distance (FVD). This comprehensive evaluation demonstrates the model’s capability to generate high-quality synthetic echocardiograms that are comparable to existing training-based models.

    • By facilitating the generation of realistic echocardiograms without additional data training requirements, this model can significantly impact data augmentation, domain adaptation, and other applications in medical imaging. This opens new avenues for research and development in echocardiography, particularly in environments with limited data availability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The performance of the model is highly sensitive to the diffusion step parameters. As noted in the results section, varying the diffusion step during the reverse process significantly impacts the quality of the generated echocardiograms, suggesting a potential limitation in the robustness and generalizability of the model.

    • The echocardiograms produced by the model are of low resolution and short duration. This might limit the clinical applicability of the synthetic echocardiograms, as higher resolution and longer duration might be necessary for accurate medical diagnosis and analysis.

    • While the paper compares its model to existing methods, it lacks a detailed benchmarking against the latest state-of-the-art models in echocardiogram synthesis. For instance, other diffusion model approaches such as those by Reynaud et al. (2023) and Stojanovski et al. (2023) could provide a more rigorous competitive analysis.

    • The paper does not include clinical validation of the synthetic echocardiograms by professional medical practitioners. Such validation is crucial to assess the practical utility and accuracy of the generated images in clinical scenarios.

    • The model is primarily tested on end-diastolic frames. Its adaptability to other cardiac phases or different cardiac conditions remains untested, which could be crucial for its broader application in cardiac imaging and diagnosis.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper does not mention if the code or the implementation details of the model are publicly available. Providing access to the source code would greatly aid in reproducing the results and facilitate further research and verification of the claims made in the paper.

    While the paper provides a general overview of the model architecture and the diffusion process, it lacks detailed specifications of some model parameters and configurations used during the experiments. Full disclosure of these parameters would help in accurately replicating the study.

    The paper mentions the use of hyperparameters like noise levels and the diffusion steps but does not provide a comprehensive list of all hyperparameters and their values. Detailed hyperparameter settings are crucial for reproducibility and to understand the model’s sensitivity to these parameters.

    The paper mentions the division of datasets into training, validation, and test sets but does not detail the criteria for this splitting or whether any form of stratification was used. This information is important to ensure that the model is evaluated in a consistent and fair manner.

    More details on how each performance metric (SSIM, PSNR, FID, FVD) is calculated could be provided. For instance, specifics on the computational tools or libraries used and any preprocessing steps involved would help in replicating the evaluation accurately.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While the paper introduces an innovative approach to generating echocardiograms using a training-free model, it would benefit from a more detailed description of the model architecture, particularly the 3D-Unet with Temporal Attention Layers. Clarifying how these components specifically contribute to the model’s performance could provide deeper insights into the method’s effectiveness.

    The sensitivity of the model to the diffusion steps is a critical aspect noted in the results. It would be beneficial to include a more detailed discussion on the selection of diffusion steps and their impact on the quality and clinical relevance of the synthesized echocardiograms. Additionally, explaining why certain steps were more effective could guide future improvements in the model.

    More information on how the datasets were split (randomly, stratified, etc.), and the rationale behind the chosen method would help in understanding the experimental design. This detail is crucial for ensuring the robustness and fairness of the model evaluation.

    Providing a complete list of hyperparameters used, including those for the Adam optimizer and any other relevant settings, would enhance the reproducibility of your results. It’s also recommended to discuss the rationale behind choosing these particular settings.

    The paper could benefit from a broader comparison with current state-of-the-art methods in echocardiogram synthesis. Including a detailed comparison, possibly in a tabular format, showing the model’s performance against others using the same datasets and metrics, would provide a clearer picture of where your model stands in relation to existing technologies.

    Integrating feedback from medical professionals regarding the usability and realism of the generated echocardiograms could significantly strengthen the paper. This could involve qualitative assessments or even quantitative measures of diagnostic accuracy when using the synthetic images.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces a novel training-free video diffusion model for synthesizing echocardiograms, which is significant as it eliminates the need for paired datasets, addressing data scarcity in medical imaging. The method’s ability to generate high-quality echocardiograms from minimal data is convincingly demonstrated on two public datasets. The potential impact on medical research and clinical applications, combined with solid methodological execution, supports the recommendation to accept this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my concerns.




Author Feedback

We would like to thank the reviewers for their valuable feedback and constructive comments. We appreciate the time and effort they have put into reviewing our work. We are happy to hear that all reviewers agree on the important contribution of our work in the field of echocardiogram synthesis and the potential impact of our method on medical imaging research. We will release the code upon acceptance of the paper to address the reproducibility concerns of the reviewers (#3, #4, and #5). Regarding the major concerns raised by the reviewers, we would like to address them as follows:

  • Reviewer #3 has concerns about the requirement of paired data: The model of our method is trained as an unconditioned diffusion model. In equation (3), it is true that we aim to create a pseudo-image x_0^K, the initial volume V^K is obtained by duplicating the x_0^K to form a pseudo video. The pseudo image x_0^K is created with a random frame from the training set, therefore, it requires no paired data. Unlike the CLS-free method.

  • Reviewer #3 concerns about comparison with the trained-based method: The method we compare with is CLS-free from Ho et al. [7]. Figure 2 shows that the proposed method’s results have a comparable quality to training based methods, while still being training-free. But that is not the case for the large scale quantitative evaluation. In fact, we also include samples (such as CAMUS (2)) in the supplementary material that show the proposed method’s results are not as aligned as the CLS-free method.

  • About the downstream task benefit from Reviewer #3 and #5: The reason we did not include downstream effects such as segmentation output, ejection fraction estimation, or segmentation is that our method focuses on controlling the video spatially, there is no guarantee that the generated frames will have the same ejection fraction, and motion is synthesized by diffusion models. Furthermore, controlling the ejection fraction via counterfactual generation, such as Reynaud et al. [15], requires ground truth ejection fraction, which violates the training-free principle that we aim to achieve. Our vision for this work is to be an initial step towards a more general training-free video synthesis method that can not only generate realistic echocardiograms but also output corresponding labels, such as segmentation or clinical measurements.

  • About the concern of Reviewer #5 on the motivation of optimal transport: We deeply understand the reviewer’s concern about the novelty of our work and the use of optimal transport. The reason behind choosing optimal transport came from the recent advances in diffusion models. In general, the problem of training-free video synthesis is special case of a broader problem, unpaired domain-to-domain translation. In our case, one domain is the segmentation map, and the other domain is the echocardiogram video. One of promising approaches is Schrodinger Bridge or entropy-regularized Optimal Transport [1], which aims to find align the two probability path from pure gaussian noise to the data distribution of the two domains. However, the optimal transport problem in high-dimensional space is computationally intensive, and requires two diffusion models to generate source and domain distributions [2]. Our approach can be seen as a relaxation of the general Schrodinger Bridge problem, where we only consider one translation from the segmentation map to the echocardiogram video. Optimal transport allows the pseudo-image to have similar intensity histogram as a sample from training set, resemble overall anatomical structure of heart chambers, and the diffusion model will fill out the motion and fine-grained details at remaining denoising steps.

[1] Leonard, C. “From the Schrodinger problem to the Monge-Kantorovich problem.” Journal of Functional Analysis, 2012. [2] Su, Xuan, Jiaming Song, Chenlin Meng, and Stefano Ermon. “Dual diffusion implicit bridges for image-to-image translation.” ICLR, 2023.”




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors did a really good job with the rebuttal, and I appreciate their efforts to address the reviewers’ concerns. However, I am sorry to say that the paper is just not quite ready for acceptance.

    Novelty Concerns: The method’s novelty is limited and heavily based on existing work (SDEdit) with minor modifications. The claim of being “training-free” is questionable and not convincingly demonstrated. Lack of Comprehensive Evaluation: The absence of downstream effects such as segmentation output or ejection fraction estimation makes it difficult to assess the practical utility of the generated echocardiograms. There is no clinical validation to support the claims about the quality and utility of the synthetic echocardiograms. Sensitivity and Robustness: The method is highly sensitive to diffusion step parameters, raising concerns about its robustness and applicability in different settings. Reproducibility Issues: The paper lacks sufficient details for reproducibility, including publicly available code and detailed model parameters. Inadequate Comparison with State-of-the-Art: The paper does not provide a detailed benchmarking against the latest state-of-the-art models, which is essential to establish the proposed method’s effectiveness. While the paper addresses an interesting problem and proposes a potentially useful method, it does not sufficiently meet the criteria for acceptance due to the concerns outlined above. Therefore, I would recommend rejecting the paper in its current form.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors did a really good job with the rebuttal, and I appreciate their efforts to address the reviewers’ concerns. However, I am sorry to say that the paper is just not quite ready for acceptance.

    Novelty Concerns: The method’s novelty is limited and heavily based on existing work (SDEdit) with minor modifications. The claim of being “training-free” is questionable and not convincingly demonstrated. Lack of Comprehensive Evaluation: The absence of downstream effects such as segmentation output or ejection fraction estimation makes it difficult to assess the practical utility of the generated echocardiograms. There is no clinical validation to support the claims about the quality and utility of the synthetic echocardiograms. Sensitivity and Robustness: The method is highly sensitive to diffusion step parameters, raising concerns about its robustness and applicability in different settings. Reproducibility Issues: The paper lacks sufficient details for reproducibility, including publicly available code and detailed model parameters. Inadequate Comparison with State-of-the-Art: The paper does not provide a detailed benchmarking against the latest state-of-the-art models, which is essential to establish the proposed method’s effectiveness. While the paper addresses an interesting problem and proposes a potentially useful method, it does not sufficiently meet the criteria for acceptance due to the concerns outlined above. Therefore, I would recommend rejecting the paper in its current form.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I appreciated the novelty of using the optimal transport for pseudo video generation to condition the model.

    But due to the lack of evaluation on downstream tasks and potential clinical usefulness at the current stage, I recommended a Reject.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I appreciated the novelty of using the optimal transport for pseudo video generation to condition the model.

    But due to the lack of evaluation on downstream tasks and potential clinical usefulness at the current stage, I recommended a Reject.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed reviews and the criticism relates to novelty and insufficient benchmarking. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the synthesis of ultrasound data is as complex and challenging as it is clinically impactful. The authors provided a strong rebuttal and although their work is in early stages, it indicates an important direction of research. The authors should highlight limitations in their discussion

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed reviews and the criticism relates to novelty and insufficient benchmarking. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the synthesis of ultrasound data is as complex and challenging as it is clinically impactful. The authors provided a strong rebuttal and although their work is in early stages, it indicates an important direction of research. The authors should highlight limitations in their discussion



back to top