Abstract

In the field of intelligent healthcare, the accessibility of medical data is severely constrained by privacy concerns, high costs, and limited patient cases, which significantly hinder the development of diagnostic models for qualified clinical assistance. Though previous efforts have been made to synthesize medical images via generative models, they are limited to static imagery that fails to capture the dynamic motions in clinical practice, such as contractile patterns of organ walls, leading to vulnerable prediction in diagnostics. To tackle this issue, we propose a holistic paradigm, VidMotion, to boost medical image analysis with generative medical videos, representing the first exploration in this field. VidMotion consists of a Motion-guided Unbiased Enhancement (MUE) to augment static images into dynamic videos at the data level and a Motion-aware Collaborative Learning (MCL) module to learn with images and generated videos jointly at the model level. Specifically, MUE first transforms medical images into generative videos enriched with diverse clinical motions, which are guided by image-to-video generative foundation models. Then, to avoid the potential clinical bias caused by the imbalanced generative videos, we design an unbiased sampling strategy informed by the class distribution prior statistically, thereby extracting high-quality video frames. In MCL, we perform joint learning with the image and video representation, including a video-to-image distillation and image-to-image consistency, to fully capture the intrinsic motion semantics for motion-informed diagnosis. We validate our method on extensive semi-supervised learning benchmarks and justify that VidMotion is highly effective and efficient, outperforming state-of-the-art approaches significantly. The code will be released to push forward the community.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0931_paper.pdf

SharedIt Link: https://rdcu.be/dV1Vs

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72384-1_19

Supplementary Material: N/A

Link to the Code Repository

https://github.com/CUHK-AIM-Group/VidMotion

Link to the Dataset(s)

https://www.nature.com/articles/s41597-021-00920-z https://challenge.isic-archive.com/data/

BibTex

@InProceedings{Li_From_MICCAI2024,
        author = { Li, Wuyang and Liu, Xinyu and Yang, Qiushi and Yuan, Yixuan},
        title = { { From Static to Dynamic Diagnostics: Boosting Medical Image Analysis via Motion-Informed Generative Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {195 -- 205}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper seeks to improve diagnostic accuracy by extending static images to dynamic videos. To this end, the authors propose a Motion-guided Unbiased Enhancement (MUE) that employs an image-to-video (I2V) diffusion model (i.e., SVD) with an unbiased sampling strategy, which can alleviate the biased issue in medical data. Besides, to further enhance the learning of motion, they design a Motion-aware Collaborative Learning (MCL) module. Experiments show the effectiveness of these two modules.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper seeks to enhance the performance of diagnostics aided by generated videos, which is novel in the medical area.
    2. The experiments show that compared with baselines, the proposed method can achieve higher performance on several datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In the MUE module, it is unclear which part contributes more to the final performance gain, the auxiliary information from the generated video, or unbiased sampling. Sometimes, an unbiased sampling on the image level can directly achieve better performance.

    2. The overall pipeline seems more like a general deep learning framework, which can be used for medical data or natural images.

    (a) For the video generation part, if only feeding a single image to the SVD, how to ensure which kind of motion would be generated? It seems uncontrollable and an auxiliary control signal is necessary.

    (b) As a low-quality video is meaningless, how to ensure the quality of the generated video?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to Weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My main concern is the quality of the generated video and whether the information gain from the generated videos is the intrinsic reason for the performance improvement.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work proposes using temporal information to improve self-supervised tasks in medical imaging. Specifically, the MUE module is proposed for image-to-sequence extension and unbiased sampling. Additionally, the MCL module jointly performs SSL on images and synthesised sequences. The MCL training is constrained by two additional losses, one for distilling video information into static images and one for fostering the learning representation of images for different kinds of augmentations. The approach is evaluated for the task of classification on two different datasets with different percentages of labels provided. Overall, the proposed method is able to outperform several baselines on various metrics on these datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The manuscript proposes a novel approach for incorporating temporal information into SSL and a way to synthesise such information using diffusion models. This innovative approach to the problem improves learning with limited data, which can be crucial for the community.

    • The proposed approach significantly outperforms the included baselines on two different datasets, in different settings of available % of labels, and on various metrics.

    • Ablation studies for various parts and hyperparameters of the proposed pipeline are presented, highlighting the performance gains of each sub-part.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The consistency, formatting and clarity of the manuscript have some minor flaws, which I outline in the detailed comments.

    • Lacking inclusion of related work specifically utilising diffusion models for imbalanced RGB data, such as [Allmendinger et al., 2023] and [Frisch et al., 2023]. Whereas other work on style transfer or GAN-based models might not be as relevant for the scope of this work.

    • Page 6: “As video generation does not change the semantic-level role of the given image, we directly assign a consistent label to the generated video frames.” That is quite a strong assumption and is not generally true. Think of a classification for pathologies, which move out of the scene over time.

    • The idea behind the sampling modification for obtaining x tilde could be clearer. Since this sampling is only used on the synthetic videos, it will not influence the sampling of the input images x^l/x^u to start with (these will already be biased). There is also no ablation for this sampling mechanism, leaving the actual impact unclear.

    • Using the approach for Kvasi-Capsule is clearly motivated. However, I am unsure about the motivation for ISIC 2018. Does this dataset contain video sequences? What are video sequences in the dermatology environment? It remains unclear if SVD was trained explicitly on capsular endoscopy and dermatology sequences or whether it infers the motion from its pre-training on natural image datasets. Maybe other sequence datasets, e.g., CholecSeg8k, would better suit the scope of the paper. This should be re-evaluated, considering the conclusions drawn in section 3.3.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • The authors claime to release the code
    • Experiments are conducted on public data
    • Manuscript contains most necessary details to reproduce the experiments
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor suggestions for improving the manuscript’s formatting/consistency:

    • Figure 2 is hard to read if not zoomed in 200% or more.

    • Section 2, Page 3: The methodology description describes “leverages the frozen SVD model…” which is not explained before. I suggest doing so and describing the data. It is unclear if the SVD was fine-tuned or fully trained, etc.

    • Page 4: “…, to synthetic videos from referenced images”

    • Page 4: “…, we use the labeled and unlabeled data as the diffusion condition, …” is misleading and unclear how and to what extent it ensures “semantic and spatial consistency”.

    • The use of x^{l/u} in eq. (1) makes it look like the re-sampled frames were used here, which I think was not the case since the SVD was applied to x^l and x^u individually. Further, the use of small / capital letters for the variables (image samples) is not consistent. Also, the definition of v_i contains itself. I suggest carefully reformulating and reworking the mathematical definitions in section 2.

    • Video-to-Image Distillation: “ MLP projection layer on the image embedding to scale up the dimension for more representative space” can be supported with mathematical definitions.

    • Page 7: “…, we randomly use 5% ratio of data for the video generation” is unclear.

    • Some improvable phrasings in 3.2 (page 7)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a novel approach to an important problem with a potentially significant impact on the MICCAI community. All my concerns are relatively minor and can mostly be addressed before publication. Overall, the strengths outweigh the weaknesses.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addressed all reviewer’s points accordingly. They provided clarification regarding my concerns about the work’s motivation, and the SVD and unbiased sampling parts. They promise to update references and improve the manuscript’s organisation and writing quality before publication.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel framework for medical image analysis by leveraging motion-informed generative videos, a departure from traditional static imagery approaches. This innovative method captures the dynamic nature and motion semantics inherent in certain medical data scenarios, offering a more comprehensive understanding of the underlying dynamics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides enough technical novelty in the proposed method. Using the static image and generating the data which can also capture the motion semantic is really exciting. Paper is well written and results are also convincing.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While I don’t perceive a major weakness in the paper, I believe the author could enhance the study by demonstrating the use of image-to-image consistency loss. By conducting an ablation study and comparing performance with and without this loss term in Motion-aware Collaborative Learning, the study could elucidate the extent to which consistency loss contributes to performance improvements. This would provide valuable insights into the efficacy of incorporating consistency loss in the proposed framework.

    In terms of the data generation process, the paper mentions the creation of 25 frames for the videos. The rationale behind this specific number of frames remains unclear. It begs the question: is there a domain-specific perspective guiding this choice? Furthermore, what is the temporal resolution achieved for the videos as a result?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author claims that the code will be made publicly available, but it remains unclear whether this will occur after the acceptance of the paper or not. Clarity on the timing of code availability would benefit potential users and enhance the reproducibility of the study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The author can expand the ablation study by including the effects of different loss terms used during training.

    Furthermore, providing more details about the computational efficiency of the framework would be beneficial.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper develops a novel framework for the generating the data using the generative motion informed videos. This framework can even be extended to the other communities like remote sensing etc.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We appreciate the unanimous recognition of novelty, highlighting the potentially significant impact on the community.

R1: Which part in MUE is more critical? In MUE, video generation contributes more than unbiased sampling, as justified by our ablation study. Compared with the full model (73.5% MAP, 83.7% AUC), replacing videos with augmented images gives large drops (70.2% MAP, 80.7% AUC), while removing unbiased sampling mainly influences MAP (71.6% MAP and 83.2% AUC). This indicates that video semantics are vital for trustworthy diagnosis.

R1: Motion controllability and quality of generative videos. a) Quality: Our empirical validation shows that generated videos greatly enhance diagnostic accuracy, proving the quality of precision medicine. We also assess the generation quality with Inception Score (IS). The real data achieves IS=2.09, while generated videos give IS =2.01. The close IS between them justifies the high quality of generated videos. b) Controllability: The motion intensity is controlled by the hyperparameter gamma, while motion types are determined by how video foundation models semantically interpret the given image. Note that our focus is to explore how generative videos improve diagnosis. Our future extension will add text guidance to further improve controllability.

R1: Whether the information gain from generated videos inherently improves performance. Yes. Compared with the baseline (68.1% MAP, 80.4% AUC), introducing videos (w/o. unbiased sampling) gives large gains (71.6% MAP and 83.2% AUC), justifying the crucial factor of video-based information gain.

R2: The efficacy of consistency loss. We have validated the losses in Table 2 with ablative settings of image-to-image (I2I) consistency and video-to-image (V2I) distillation losses. Consistency loss greatly improves MAR, making the model more robust by reducing missing errors.

R2: Why generate 25-frame videos. Generating more frames needs significantly larger GPU resources, which is also challenging in natural imaging. It is a common practice to generate 25 frames, including 1 given image and 4 seconds of 6-FPS video. We will explore the generation of more frames in future.

R2: Computational efficiency. In the test, we only use the image classification model, achieving real-time inference with 129 FPS on NVIDIA 4090. This performance level is sufficient for all clinical practices.

R3: More proper references. We will update suggested references for more focus.

R3: The claim is too strong: Video generation does not change the semantics of given images. We will relax it to that generated videos can maintain semantic consistency with reference images to a certain degree, since the disease area may move out of the frame in some cases.

R3: Clarify unbiased sampling. Unbiased sampling enhances rare classes by collecting more frame images, promoting a balanced distribution for diagnostic fairness. Compared with the full model (73.5% MAP), removing unbiased sampling causes a drop (71.6% MAP), justifying its vital role.

R3: Clarify SVD usage and why dermatology videos are generated. The used SVD is pre-trained on large-scale online videos with superior generalization capacity, spreading natural and medical domains. For dermatology images, generated videos simulate rational camera movements, e.g., translation and zoom, which are crucial for performance gains. Since CholecSeg8k is segmentation data, we will explore this fine-grained setting in the future.

R3: Minors. We will improve the manuscripts accordingly: make Fig.2 more readable, correct confusing symbols/typos, and clarify the following details. The used SVD is pre-trained on large-scale online data. In SVD, images act as the condition to guide generation, enabling the videos to have similar semantic content. MUE generates videos and then samples images for balanced distribution. We use 5% images to generate videos as hold-out experiments, which can be further improved with larger ratios.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces an innovative approach, the Motion-guided Unbiased Enhancement (MUE) and Motion-aware Collaborative Learning (MCL) modules, which collectively enhance diagnostic processes by transforming static medical images into dynamic videos. Despite some concerns about the control and quality of generated motion from Reviewer #1, the rebuttal satisfactorily addresses these and reinforces the benefits of the proposed methods.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces an innovative approach, the Motion-guided Unbiased Enhancement (MUE) and Motion-aware Collaborative Learning (MCL) modules, which collectively enhance diagnostic processes by transforming static medical images into dynamic videos. Despite some concerns about the control and quality of generated motion from Reviewer #1, the rebuttal satisfactorily addresses these and reinforces the benefits of the proposed methods.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top