Abstract

Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3070_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/MedAITech/U_I2V

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Che_Ultrasound_MICCAI2024,
        author = { Chen, Tingxiu and Shi, Yilei and Zheng, Zixuan and Yan, Bingcong and Hu, Jingliang and Zhu, Xiao Xiang and Mou, Lichao},
        title = { { Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes latent dynamic diffusion models for ultrasound image-to-video synthesis. The synthesized videos can be used to augment the training dataset and improve the classification performance. Authors conduct experiments on public ultrasound video datasets. Though the proposed method could be useful, the novelty and technical contribution are incremental and limited.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The writing is good and easy to understand.

    2. Experiments on public datasets and compare with the baseline model cINN.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The idea of synthetic augmentation, i.e., using synthetic data to augment the training set has been well explored in previous methods, especially in the medical image processing domain (e.g., Chen et al.). [1] Chen, Richard J., et al. “Synthetic data in machine learning for medicine and healthcare.” Nature Biomedical Engineering 5.6 (2021): 493-497.

    2. The two-stage design of the latent dynamic diffusion model is not new and very similar to the latent video diffusion model (He et al.). The task of image-to-video with diffusion models is also explored extensively in the setting of natural scene (e.g., Ni et al.). These make their contribution incremental and less interesting. [2] He, Yingqing, et al. “Latent video diffusion models for high-fidelity long video generation.” arXiv preprint arXiv:2211.13221 (2022). [3] Ni, Haomiao, et al. “Conditional image-to-video generation with latent flow diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

    3. The authors don’t provide videos in supplementary materials, which makes it hard to determine the quality of generated videos. It is difficult to evaluate the temporal consistency just from the provided sampled frames. Also, some details are missing in this paper, e.g., How frames in Fig.2 are sampled? What is the length of generated videos?

    4. It is unclear how to obtain the class label of generated videos when training the video classification model. The authors mention that “operating directly on individual ultrasound images, without requiring additional inputs”. Do the authors determine the class of generated video just by the first provided image?

    5. The authors should also compare their method with other diffusion-models-based methods, such as LFDM (Ni et al.) and LVDM (He et al.)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness part.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to the strength and weakness part.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have solved part of my concerns. I am happy to raise my score.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a method to create realistic video clips of ultrasound videos from a single image. The method is adapted from latent diffusion models, which are well known for image generation. The proposed model additionally models the video’s dynamics by including the time dimension in the latent embedding. The utility of the method is demonstrated by including synthetic images in the training sets of downstream classifiers for breast lesion classification, and significant performance gains are seen when including synthetic videos generated from single images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall, this is a strong paper. It is very well written and clear. The methods are appropriate and novel, and the experimental validation clearly demonstrates the utility of the generated videos for use as synthetic training data. The paper is well motivated since recording of still ultrasound frames is far more common that recording of videos.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness is the lack of details for reproducibility. See specific comments on this below.

    Literature review: outside of ultrasound videos, there are many approaches to the problem of image to video synthesis in the wider computer vision literature. The authors do not mention any such approaches in their related work or explain how their method compares to these approaches. They only discuss other work with ultrasound specifically.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No source code is mentioned.

    There are a large number of missing details, which collectively would make reproducing the experiments impossible. I do understand the challenging space requirements of a MICCAI paper, but many of these details would take very little space, and other could be put into supplementary information. For example:

    • What is the specific architecture of the encoder? 3D ResNet-50, 3D ResNet-101, etc? A custom configuration?
    • What is the size and shape of the embedding z?
    • How many frames does the 3D ResNet process at a time? During training of the encoder, how are videos of an arbitrary length chunked up to pass into the encoder? Are all generated videos of the same length? If so, what is this length?
    • The architecture of D is completely unclear.
    • How exactly is the diffusion process conditioned on the first frame x_0? There are various guiding mechanisms used for diffusion, so I don’t believe there is one obvious way to do this.
    • Are any image preprocessing steps applied to the images for any of the steps?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I have only a few additional comments:

    • What is s_0 in Sec 2.2?
    • It occurs to me that, depending on the exact mechanism, conditioning on the initial frame x_0 brings no guarantee whatsoever that the initial frame of the resulting clip is actually anything close to x_0. Is this correct? Is this observed in practice? Would there be an advantage to more explicitly enforcing that the first frame is closely reconstructed?

    Minor comments/grammar:

    • Minor point but I had to re-read this sentence multiple times to understand what the authors were trying to say. The use of the word “respectively” is incorrect (there are no parallel lists) and therefore confusing: “We create 3 data splits by randomly selecting 70%, 50%, and 30% of the videos for training, respectively”. This process would probably be better described as a k-fold validation (where folds have different sizes in this case).
    • “1e-5” is programming shorthand in some languages and is not appropriate for papers. Use “1 x 10^-5”.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the paper is strong. However the lack of reproducibility is troubling. I would like to accept this paper, but this would require the authors to address the lack of reproducibility in the response.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The author provide the diffusion model to generate ultrasound video from static images. The performance for the model with generated video shows superior performance when only real videos are available.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The diffusion model used for video generation from static images shows the performance improvement. Paper provides the visualizations based on the static images. The generated the videos shows success serving as training data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In Results section, model performances are compared between “real only”, “synthetic only” and “real+synthetic”. Are the training dataset are in same size?

    The position of Table 2 can be adjusted to adhere Results section.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The code for video generation will helpful to reproduce the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For the evlaution parts, if the training datasets are in same size, an extra explanation can be added to clarify the fairness of training data between baseline methods. If not, the methods comparison should be trained with size size of training dataset.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper employ Latent Diffusion model for video generation and shows performance improvements. The results comparison part could be made clear to show method effectiveness.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The explaination on fairness of evaluation is satisfied.




Author Feedback

We sincerely appreciate the reviewers for providing constructive comments.

Code (R3&R4&R5) We promise to make our code publicly available.

Reviewer #3 Q1 Lack of details. 1.The encoder architecture is 3D ResNet-18. 2.The size of z is (T/r, H/r, W/r, C/r), where r is a downsampling factor. This is significantly smaller than the original video size, thereby reducing computational requirements. 3.During training, videos of arbitrary duration are segmented into clips of 48 frames. We then uniformly sample 16 frames from each clip and feed them into the 3D ResNet. All generated videos have a duration of 2 seconds at 8 fps. 4.The architecture of D comprises residual blocks containing ADAIN layers. 5.For the diffusion process, we adopt a commonly used framework that generates images conditioned on text prompts. 6.Yes, we preprocess all frames to have the same dimensions.

We will clarify these implementation details in the final version of the paper.

Q2 Did not mention approaches in computer vision in their related work. We compared our model with cINN [18], and found that other relevant methods in CV typically require additional conditioning information like text or motion direction, which is impractical in our medical scenario. But, we will provide a review of related work in CV in the final version of the paper.

Q3 What is s_0 in Sec 2.2? Sorry for the typo. It should be z_0. We will correct this.   Q4 Does initial frame x_0 guarantee the first frame resembles x_0? Thanks for the questions. In our experiments, the generated initial frame closely resembled x_0 in content, so we did not include an explicit reconstruction constraint for the first frame.

Q5 Minor comments/grammar Many thanks for pointing these issues out. We will ensure that all the points raised are addressed and corrected appropriately in the final version.

Reviewer #4 Q1 Synthetic augmentation is not new in medical image processing. While data augmentation using synthetic samples has been studied extensively in medical image analysis, generating videos from medical images and utilizing them for downstream diagnostic tasks still remains under-explored.

Q2 Latent dynamic diffusion model is not new. While some computer vision works exist, they typically require additional conditioning like text or motion cues besides the initial frame, which is impractical in our medical scenario. Hence, we devise a new framework for the scene of unconditional video generation. Notably, generating ultrasound videos from static images is an under-explored yet useful task in the medical field.

Q3 Did not provide videos in supplementary materials. We promise to make our code publicly available and provide generated videos on the code repository page.

Q4 Some details are missing. Sorry for not being clear on these. The videos in Fig. 2 have a duration of 2 seconds at 8 fps, but due to space limit, only 6 frames are displayed.

Q5 Determine the class of generated video just by the first provided image? Yes.

Q6 Comparison with other diffusion-models-based methods. We attempted to compare against [9] but found it computationally expensive and yielding poor results, so we did not report its performance. Additionally, most diffusion models in computer vision require additional prompts, which is not applicable to our unconditional video generation scenario based solely on the initial frame.

Reviewer #5 Q1 Are the training datasets in same size? The size of the “real+synthetic” training set is the sum of the “synthetic only” and “real only” training set sizes. The quantity of “synthetic only” depends on the size of the BUSI dataset. Our goal is improving classification performance by augmenting the real data with synthetic samples, inevitably increasing the overall training set size. It is worth noting that generating these synthetic videos does not incur any additional cost.

Q2 The position of Table 2 can be adjusted. Thanks for the suggestion. We will adjust accordingly.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces a novel approach to generating synthetic ultrasound videos from static images using latent dynamic diffusion models. Despite initial concerns about reproducibility and the lack of detailed methodological descriptions, the authors addressed these adequately in their rebuttal, committing to providing comprehensive details and public access to the source code upon acceptance. All reviewers agree to accept the paper after the rebuttal. Authors should include all details mentioned and promised in the rebuttal in the camera-ready version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces a novel approach to generating synthetic ultrasound videos from static images using latent dynamic diffusion models. Despite initial concerns about reproducibility and the lack of detailed methodological descriptions, the authors addressed these adequately in their rebuttal, committing to providing comprehensive details and public access to the source code upon acceptance. All reviewers agree to accept the paper after the rebuttal. Authors should include all details mentioned and promised in the rebuttal in the camera-ready version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal is adequate and the reviewer has increased the score. Currently, the reviewers have reached a consensus of accepting this submission. I recommend to accept this submission, however, please include the answers from rebuttal into the camera-ready version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The rebuttal is adequate and the reviewer has increased the score. Currently, the reviewers have reached a consensus of accepting this submission. I recommend to accept this submission, however, please include the answers from rebuttal into the camera-ready version.



back to top