Abstract

Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1453_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zho_HeartBeat_MICCAI2024,
        author = { Zhou, Xinrui and Huang, Yuhao and Xue, Wufeng and Dou, Haoran and Cheng, Jun and Zhou, Han and Ni, Dong},
        title = { { HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce HeartBeat, a uniform framework that enables perceiving versatile conditions simultaneously to guide controllable video generation. They factorize the multimodal conditions into local and global parts. In addition, the authors propose a two-stage training scheme that decouples the visual concepts and temporal dynamics learning to ease the model training.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work is the first exploration of highly customized Cardiac ultrasound video synthesis based on the guidance of multimodal conditions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It seems that this work is mainly based on the general video generation framework VideoComposer[25], but the authors lack a statement of differences with VideoComposer. In my opinion, the authors have made two main improvements based on VideoComposer: (1) categorizing the CONDITIONS into local and global, and taking separate treatments for them. However, there is a lack of ablation experiments to prove the effectiveness of this design; 2. decoupling spatial-temporal and proposing a two-stage training instead of directly generating videos (to reduce the difficulty of model training). The paper lacks a description of HeartBeat-3D, such as what its framework is.
    2. The comparisons in table 1 are not convincing. The authors only compare with MoonShot [29] and VideoComposer [25]. Considering that their results on the CAMUS dataset are not reported in the papers and GitHub, are the results in Table 1 from the authors’ own replication? If so, this needs to be made clear in the paper. The authors need to explain why comparisons with [16,21,24] and others are lacking.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Figure 1 is a bit unclear, you can use a vector diagram.
    2. Are the results of MoonShot [29] and VideoComposer [25] in Table 1 from the authors’ own replication? If so, this needs to be made clear in the paper. The authors need to explain why comparisons with [16,21,24] and others are lacking.
    3. I recommend a more detailed descrition of HeartBeat-3D.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The current experimental results of this work do not yet support its claimed contribution, and I would need further explanation from the authors.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposed an unified framework synthesizing ECHO videos from multimodal conditions, including sketches, optical flow and text descriptions. The demo shown in the repository demonstrates a superior performance of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper exhibits several strengths:

    1. The proposed framework showcases outstanding performance in controllable and high-fidelity ECHO video synthesis
    2. The proposed method serves as a unified framework capable of perceiving multimodal conditions simultaneously, allowing for precise and flexible control over the synthesized output.
    3. The two-stage training scheme proposed effectively decouples the learning of visual concepts and temporal dynamics.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The description of the condition insertion procedure lacks clarity. It’s unclear whether the network is fine-tuned for each unique combination of multimodal conditions every time. Additionally, it’s unclear if it’s possible to flexibly drop several conditions during inference. If such flexibility exists, the method for achieving this should be clearly explained in the proposed framework.
    2. The potential consequences of contradictory conditions are not addressed. For instance, if sketches and segmentation masks provide conflicting information, it could be interesting to investigate how the model handles such scenarios.#
    3. All conditions used in this study, e.g., the optical flow, the segmentation masks, etc., were derived from real images. If all six conditions were derived from the same image, it’s unclear how the synthesized video differs from the original video. Providing a case study or example would help ensure that the model is not simply replicating the real dataset.
    4. While the concept of 3D CINE synthesis is intriguing, there is a lack of downstream analysis to demonstrate the utility of the method. Exploring potential applications of the proposed synthesis pipeline would provide valuable insights into its practical relevance.
    5. The transition from 2D to 3D U-Net architecture is not well-explained, making it difficult to understand how the network was expanded. Simply referencing other works without clear connections to the proposed framework is unhelpful. How did the proposed method differ from reference 28? A more detailed explanation of the architecture expansion and its relevance to the proposed work is needed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see my comments in the previous section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would recommend acceptance based on the performance and novelty of the proposed framework. However, it’s worth noting that the efficiency of the synthesized ECHO has not been investigated.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a two-stage framework for controllable and high-fidelity ECHO video synthesis that enables perceiving multimodal conditions. These multimodal conditions are divided into local and global signals to provide fine-grained and coarse-grained controls with designed injection strategies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The fist to propose controllable ECHO video synthesis model guided by multimodal conditions.
    2. Two-stage training scheme (T2I & T2V) for visual concepts learning and temporal dynamics modeling guarantees high-fidelity and temporal-coherent synthesis.
    3. Six multimodal conditions are involved and factorized into local and global ones, and corresponding injection methods are designed to ensure the composability of different conditions.
    4. The experiments are comprehensive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The effectiveness of fusion operation for local controls, i.e., element-wise addition, lacks validation.
    2. Some details of the pipeline are not clear. There possibly appears to be a missing arrow in the finetuning stage of Fig.1 that indicates the direction of concatenated z_t and local controls.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the strength and weakness

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I think this is a good paper and should be accepted.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We cordially thank the reviewers for their constructive comments. We will revise them in the camera-ready/journal version paper.




Meta-Review

Meta-review not available, early accepted paper.



back to top