Abstract

We present the first automated multimodal summary generation system, MMSummary, for medical imaging video, particularly with a focus on fetal ultrasound analysis. Imitating the examination process performed by a human sonographer, MMSummary is designed as a three-stage pipeline, progressing from keyframe detection to keyframe captioning and finally anatomy segmentation and measurement. In the keyframe detection stage, an innovative automated workflow is proposed to progressively select a concise set of keyframes, preserving sufficient video information without redundancy. Subsequently, we adapt a large language model to generate meaningful captions for fetal ultrasound keyframes in the keyframe captioning stage. If a keyframe is captioned as fetal biometry, the segmentation and measurement stage estimates biometric parameters by segmenting the region of interest according to the textual prior. The MMSummary system provides comprehensive summaries for fetal ultrasound examinations and based on reported experiments is estimated to reduce scanning time by approximately 31.5%, thereby suggesting the potential to enhance clinical workflow efficiency.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0399_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0399_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Guo_MMSummary_MICCAI2024,
        author = { Guo, Xiaoqing and Men, Qianhui and Noble, J. Alison},
        title = { { MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce a novel framework for automated multimodal summary generation using fetal ultrasound videos. The framework encompasses three key phases: keyframe detection, keyframe caption generation, and final anatomical segmentation and measurement. By leveraging this framework, a comprehensive summary of the fetal ultrasound examination can be generated, reducing the time required for physicians to review the entire video and enhancing the overall clinical workflow.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Introducing a comprehensive framework for automated analysis of fetal ultrasound videos, aimed at improving clinical workflows and enhancing efficiency.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Insufficient analysis has been conducted regarding the interrelationship between the three phases.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. There is insufficient discussion regarding recent articles on keyframe detection.
    2. How is the accuracy of keyframe caption generation ensured for large-scale models? Has there been any fine-tuning and evaluation conducted on relevant datasets?
    3. Would errors in the caption information have an impact on the segmentation accuracy? How can this issue be mitigated? 4.Stage 1 lacks a comparison with any existing methods for keyframe detection. There is no comparison to assess the impact of discarding uninformative frames on the results of the subsequent two stages. Additionally, there is a lack of comparative analysis regarding the quality of frames before and after redundancy removal.
    4. Stage 3 lacks a comparison with any text-guided segmentation methods. It is important to highlight the advantages of text guidance in this stage.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall framework of this paper seems to incorporate three separate components: keyframe detection, caption generation, and segmentation. However, it appears that these individual phases may lack substantial innovation. A matter of concern is the limited extent of experimentation conducted. The interplay between the three phases has not been thoroughly analyzed, giving rise to concerns regarding the potential negative implications that each phase may have on subsequent stages.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper propose a multimodal summary generation system for medical imaging video, with a particular focus on fetal US, including the keyframe detection stage, the keyframe captioning stage and the segmentation and measurement stage.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper propose a novel automated multimodal summary generation system specifically for ultrasound analysis and the experimental results are solid.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method is a little archaic, such as the Video Transformer and GPT-2. This paper does not provide sufficient information for reproducibility.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For the stage 1, maybe the memory based keyframe detection could be more powerful.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper propose a novel automated multimodal summary generation system specifically for ultrasound analysis and the experimental results are solid.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    This paper propose a multimodal summary generation system for medical imaging video, with a particular focus on fetal US, including the keyframe detection stage, the keyframe captioning stage and the segmentation and measurement stage.



Review #3

  • Please describe the contribution of the paper

    Paper describes the first automatic multimodal summary generation system. It is a three-stage system that first performs keyframe detection, then captioning of keyframes, and finally segmentation and measurement. The experimental results show that the system is able to reduce the scanning time by about 31.5%.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article is well structured and the model section is straightforward. The method combines the GPT2 and CLIP model approaches to achieve an automatic summary generation framework, which is currently relatively new and unique. From an application point of view, the processing flow of the framework is more realistic, and the functionality of the framework meets the corresponding needs of diagnosis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are fewer experiments and specifically fewer comparisons with other methods. And the experimental part is not enough to prove the practicality of the framework, such as the model complexity, the number of parameters, and the deployment effect are not proved.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have provided a related video introducing the article. This is a great help in understanding this article.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The experimental part is too simple, which makes the performance of the framework hard to convince. Specifically, the framework ultimately provides functionality for segmentation and measurement, and thus, for the Stage 3 experiments, simply comparing the methods of the other two scholars is a shortcoming of this paper. Although comprehensive frameworks such as this one are relatively rare, comparisons can be made for methods that target a parameter that is individually measured. Although, methods designed to measure only one parameter may perform better, comparing to such methods gives a better representation of your model’s performance. In addition, it is mentioned in the abstract that the framework can improve the efficiency of the diagnostic process, from the application point of view, the experimental part needs to be supplemented with the number of parameters of the model, the complexity, and it would be better if there are relevant deployment experiments, which can better illustrate the application prospects of the framework.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The framework proposed in the article, a multimodal summary generation system, is novel and used. From the methodological point of view, the popular GPT series and CLIP model are used to realize the multimodal segmentation task through textual prompts, which is novel and worthy of other scholars’ study and research.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for their invaluable comments and approval that our method is novel. The code will be published for reproducibility.

Q1/R1: Interrelationship between 3 stages. A1: Three stages—keyframe detection, keyframe captioning, segmentation and measurement—are trained and evaluated separately because: 1) Each stage has unique objectives, making joint training inefficient and potentially detrimental to performance. 2) Only a few frames in US video are keyframes with caption or biometry labels, so misidentified non-keyframes cannot be properly evaluated in subsequent stages due to absent labels. Sequentially processing stages in real deployment might lead to error accumulation, so we aim to ensure high accuracy in each stage individually. In the future, we plan to predict uncertainty for keyframes and captions, providing extra information for sonographers to correct the multimodal summary and reduce error impact.

Q2/R1: Insufficient discussion on stage 1. A2: Recent papers related to keyframe detection have been discussed in Video summarization' of Related Work’. Early non-deep learning (DL) methods focused on keyframe selection but lacked computational efficiency. Recent DL methods prioritize video clips but are unsuitable for MMSummary because: 1) Clips may include too many frames, complicating documentation. 2) Neighboring frames often appear redundant in US video. 3) They cannot handle the extreme class imbalance in keyframe detection. 4) Video clips cannot be directly used for biometry.

Q3/R1: Ensuring caption accuracy. A3: Our model includes frozen image encoder of BiomedCLIP, frozen GPT-2, and a mapping network that bridges visual and textual features. This network inputs visual features to generate prefix embeddings, which are fed into GPT-2. By fine-tuning the mapping network, GPT-2 can produce accurate captions for keyframes.

Q4/R1: Frame quality before/after redundancy removal. A4: There are many redundant frames in p’, so we design Diverse Keyframe Detection to detect representative keyframes p while removing redundancy. As in sec. `Keyframe Detection’ and supp. Fig. 2, properly chosen tau&tau’ ensures keyframes capture essential information while minimizing redundancy and information loss. A too-large tau leads to frame redundancy, while a too-small tau risks information loss. A too-small tau’ includes non-informative frames and a too-large tau’ discards useful frames.

Q5/R1: Compared to text-guided methods. A5: Our goal is multimodal summary generation, addressing all related challenges, with biometry as a crucial part. Stage 2 generates captions that aid segmentation, mimicking the sonographer’s process, so we design a text-guidance method in stage 3. We have shown that text guidance outperforms random prompts in Table 1. Our focus is not solely on text-guided segmentation, so we did not compare with existing ones. But other text-guided methods can also be applicable in stage 3.

Q6/R3: Model complexity. A6: MMSummary performs inference after the entire examination. Despite its complexity, the model processes a video in seconds across all three stages, ensuring efficient inference.

Q7/R3: Simple stage 3 experiments. A7: Our ultimate goal is not only segmentation and measurement but also keyframe detection and captioning, to generate multimodal summary. For stage 3, we have compared our biometry performance with two existing ones. Adding more experiments is not feasible due to the rebuttal policy. Given that our target is broader than segmentation and measurement, the current comparisons should sufficiently demonstrate the framework’s overall performance. (Also see Q5)

Q8/R4: Video Transformer&GPT-2 are a little archaic. A8: Video Transformer&GPT-2 remain effective and are still used in recent papers, e.g. ViECap (ICCV23), SMALLCAP and AutoAD II (CVPR23). They are chosen for their efficiency, especially when computational resource is limited and dataset is small in medical field.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Accept. The authors provided satisfactory answers in the rebuttal. The method is of potential translational value.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Accept. The authors provided satisfactory answers in the rebuttal. The method is of potential translational value.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The provided rebuttal has addressed the questions. It is also with sufficient novelty to be included for a MICCAI poster presentation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The provided rebuttal has addressed the questions. It is also with sufficient novelty to be included for a MICCAI poster presentation.



back to top