Abstract

Detecting standard frame clips in fetal ultrasound videos is crucial for accurate clinical assessment and diagnosis. It enables health- care professionals to evaluate fetal development, identify abnormalities, and monitor overall health with clarity and standardization. To aug- ment sonographer workflow and to detect standard frame clips, we in- troduce the task of Visual Query-based Video Clip Localization in med- ical video understanding. It aims to retrieve a video clip from a given ultrasound sweep that contains frames similar to a given exemplar frame of the required standard anatomical view. To solve the task, we propose STAN-LOC that consists of three main components: (a) a Query-Aware Spatio-Temporal Fusion Transformer that fuses information available in the visual query with the input video. This results in visual query-aware video features which we model temporally to understand spatio-temporal relationship between them. (b) a Multi-Anchor, View-Aware Contrastive loss to reduce the influence of inherent noise in manual annotations es- pecially at event boundaries and in videos featuring highly similar ob- jects. (c) a query selection algorithm during inference that selects the best visual query for a given video to reduce model’s sensitivity to the quality of visual queries. We apply STAN-LOC to the task of detect- ing standard-frame clips in fetal ultrasound heart sweeps given four- chamber view queries. Additionally, we assess the performance of our best model on PULSE [2] data for retrieving standard transventricular plane (TVP) in fetal head videos. STAN-LOC surpasses the state-of-the- art method by 22% in mtIoU. The code will be available upon acceptance at xxx.github.com.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0870_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0870_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Mis_STANLOC_MICCAI2024,
        author = { Mishra, Divyanshu and Saha, Pramit and Zhao, He and Patey, Olga and Papageorghiou, Aris T. and Noble, J. Alison},
        title = { { STAN-LOC: Visual Query-based Video Clip Localization for Fetal Ultrasound Sweep Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, authors propose a visual query-based video clip localization method to retrieve video clips containing required standard anatomical views from ultrasound videos. The effectiveness of the proposed method is evaluated on a private and a public fetal ultrasound dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1)Aiming at the demand of video clip retrieval in clinical situations, this paper introduces a task of Visual Query-based Video Clip Localization and proposes a query-aware spatio-temporal transformer model for this task. (2)This paper proposes a Multi-Anchor, View-Aware Contrastive Loss and a Temporal Uncertainty Robust Localization Loss to solve the problem of uncertain video clip boundary and noisy labels. (3)This paper proposes a Visual Query selection module during inference to select the best visual query for a given video. (4)This study validates its method using limited data and noisy labels on two different fetal ultrasound datasets. On the public dataset, the method surpasses the state-of-the-art method by 22% in mIoU.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1)This paper has some formatting errors and unclear expressions, for example, “Specifically. we……” in section 2.1 or what does the “x” in “Ts(x)” refer to in the description of Temporal Uncertainty Robust Localization Loss in section 2.2 not stated. (2)In section 2.1, this paper points out that the problem existing in the previous work is “This reduces the relevance of visual queries and results in features with less information about the visual query [12] “, but the Query-Guided Spatial Transformer module proposed in this paper is to improve the correlation between video features and visual query features to “ensure that the video features are contextualised by the information contained within the visual query features “, is there an unclear expression logic? (3)What do “negative query”, “positive frames” and “negative frames” in the negative View-Aware Contrastive Loss proposed in Section 2.2 represent respectively? Note that a positive query and a negative query are entered into the network at the same time. If the “positive frames” here have the same class as the positive queries, is the function of this Loss to close the distance between the features of all frames whose class is different from the class of the positive query and the features of the negative query? Does it affect network performance if they have different classes? If the class of “positive frames” here is the same as the class of negative queries, how is this Loss different from the Positive View-Aware Contrastive Loss? (4)The proposed method lacks significant novelty, as the network structure proposed in Section 2.1 and the Multi-Anchor, View-Aware Contrastive Loss proposed in Section 2.2 seem to be just a simple application of cross-attention, self-attention and contrastive loss. (5)In the inference phase described in section 2.3, all frames of the video to be retrieved are first classified and the frame most similar to the desired category is selected, and then the appropriate visual query is selected from the VQ Bank based on the frame. Why not just use the frame classified from that video as a visual query? (6)It seems unreasonable that the visual query input to the network in the training stage is a real image, and in the inference stage it is a feature map obtained by averaging the features of multiple images extracted by an extra network. (7)The results on the private dataset in Table 1 do not state whether their visual queries come from in-distribution or out-of-distribution. (8)The role of using in-distribution and out-of-distribution visual queries on private datasets in the experiments is not explicitly stated, and its related results are not explained in detail.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1)The authors should carefully check the writing before submission. (2)The authors can provide more details about the methods to justify it. (3)The authors can further refine the experimental design and fully explain it. (4)The authors should optimize the expressive logic for clearer and simpler descriptions.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    limited novelty and confusing writting.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Paper introduces a framework for Visual Query-based Video Clip Localization, called STAN-LOC, which can help sonographers reduce their workload. Specifically, the framework selects the best frame in an ultrasound video as a Visual Query, then feeds it into an encoder along with the video to extract features, and fuses the information from the Visual Query into the video by introducing the proposed Query-Aware Spatio-Temporal Fusion Transformer features and introducing Spatio-Temporal Transformer to generate spatio-temporal features, and finally output the predicted video clip start and stop frames through MLP. It is worth mentioning that the authors proposed Multi-Anchor View-Aware Contrastive Loss to enable the model to better categorize the frames in the video.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The task of this paper is relatively new in the medical image/video field. The method proposed by the authors has also proved its effectiveness after experiments. Impressive is the designed loss function, which was proven in the authors’ experiments for its ability to guide model learning. The part of Method is described in a detailed way which makes others well understand.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The article is silent on the applicability of the proposed model/framework, resulting in a poorly established link between the framework and the problems in practice. The graphical description of the proposed framework/model is vague. In the whole framework, the classifier is also a key part, but for this part the authors only mention that it is a pre-trained model, but do not specify which model is used and its pre-training dataset and effects.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It seems like there is a supplementary materials attached to the article to make further introduction of the dataset authors use.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In Abstract Citation is not allowed. In Introduction There is a lack of introduction to the background and significance of this work, and specifically a lack of description of the practical problems to be solved. In Fig.2 The color of the legend is hard to distinguish, and the specifics of the Query Selection are missing from the Method. In addition, information about Negative VQ should be presented in the STAN-LOC Overall Architectute. In Method “Video Clip Retrieval Task Formulation” where “Task Formulation” is inappropriate and could be replaced with a phrase such as “task description”. “Temporal Fusion Transformer” doesn’t match the description in Fig.1. “resulting video features” could be more specific. “Positive View-Aware Contrastive Loss (LPV AC) which aims to bring the visual query features and the ground-truth clip features together while pushing away frames belonging to other classes”: “other classes” is not proper, because it seems to be two classes in your task, (pos and neg).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the application prospects of the proposed framework and the corresponding practical needs are not well elucidated, the framework proposed by the authors focuses on how to effectively use the Attention mechanism as well as the design of new loss functions, and their effectiveness has been demonstrated experimentally, so if the article can elaborate on the background and significance of the research, it will be very likely that it can be accepted.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduce the task of Visual Query-based Video Clip Localization in medical video understanding and propose STAN-LOC to retrieve a video clip that contains frames similar to a given exemplar frame. STAN-LOC consists of three main components: (a) a Query-Aware Spatio-Temporal Fusion Transformer that fuses information available in the visual query with the input video. (b) a Multi-Anchor View-Aware Contrastive loss to reduce the influence of inherent noise in manual annotations ,and (c) a query selection algorithm to select the best visual query for a given video. Results show improvement to other SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using a Visual Query-based method like STAN-LOC to locate the video clip of interest holds promising applications.
    • Experiments are well designed and conducted not only in private data but also in public data.
    • Supplementry material clearly and intuitively shows the effectiveness of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors did not conduct an evaluation of the model’s efficiency and resource consumption, it is crucial in video analysis.
    • Fig. 2 is confusing, please carefully revise.
      • The Input of the training stage is on top-right of the figure. The arrangement doesn’t adhere to typical reading conventions.
      • What’s the usage of the MLP in Query Selection Block, it wasn’t mentioned in the article.
      • The color of ‘Negative Visual Query’ in figure and in legend are different.
      • Legends of blocks and variables are mixed together which makes understanding difficult.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Revise Fig.2 (refer to Q6)
    • What does the classifier classify, or what’s the specific meaning of confidence?
    • The training dataset size is relatively small. How many trainable parameters of the model? Is overfitting likely to occur?
    • How you adjust the hyper-perameters of the loss function?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The article has a certain level of innovation, well-designed experiments and good organization.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank reviewers for their feedback. 1.Background and Significance of Research (R1): When a pregnant woman undergoes an ultrasound scan, the sonographer checks for fetal anomalies by scanning through each anatomy of the fetus to find standard frames that contain all anatomical landmarks in correct orientation, position. This process can take up to an hour. To streamline this, we introduce STAN-LOC, where the sonographer can input an image of the desired view, and the model returns the corresponding video clip for analysis. This allows for faster scanning, enabling sonographers to focus on anomaly detection and consult more patients. Furthermore, STAN-LOC can segment the input video into view-specific clips, facilitating the use of view-specific anomaly detection models. For instance, in our heart sweep project, STAN-LOC can divide input videos into various heart view clips which are fed into a view-specific anomaly detection model to assess fetal heart health. 2.Classifier Details (R1, R3): The classifier is a ConvNext-small CNN that was pretrained (and subsequently frozen) to classify ultrasound views for each dataset. The MLP is part of classifier and follows the classifier’s feature extractor. It is illustrated separately to emphasize extracted features used for distance-based Query Selection. 3.“x” in “Ts(x)” (R4): x here is the binary ground-truth for the video, depicting whether a frame belongs to the ground-truth clip (1) or not (0). 4.Intuition behind Query-Guided Spatial Transformer (R4): Existing methods concatenate the query (text query) directly to video features and perform self-attention, diminishing the query’s influence as it is only a single token in the sequence. To counter that, we introduce Query-Guided Spatial Transformer, where cross-attention is performed ensuring each frame is enriched by the visual query information. 5.Averaged image during inference (R4): In both the training and inference phase, a real image (visual query) is fed to the model. During training that visual query is randomly sampled from VQ database while inference is selected using the VQ selection module. 6.Results in Table 1 (R4): The results presented in Table 1 pertain to the In-Distribution visual query database. 7.The role of using ID and OOD VQ database (R4): Our model is aimed to solve a real-world problem and assist sonographers in ultrasound scanning. In clinical settings, it is highly likely the sonographer might enter a visual query which is captured using a different protocol leading to domain gap. 8.Novelty/application of cross-attention, self-attention and contrastive loss (R4): We respectfully disagree. These operations and learning schemes have been widely utilised in published literature. The contribution of our work (and others using these concepts) lies in their unique design and modification suited to target application. We exploit (a) cross-attention to effectively fuse information between video and visual query to enrich video features and (b) self-attention to further model temporal relationship in resultant visual query aware features. Additionally, our MVAC loss introduces the concept of dual anchor where along with the positive visual query, a negative visual query is introduced during training to utilize the negative query relationship and push the positive/negative samples further away in feature space. 9.Negative View Aware Contrastive Loss (R4): In this loss, a negative visual query is an image from a different class than the visual query. Positive frames are those outside the ground-truth clip, while negative frames are part of the ground-truth clip. a.Loss aims to close distance between features of frames whose class is different from positive query: Yes. b.Effect on Performance when class is different: No difference observed. 10.Figure 2 (R1, R3): We will improve Figure 2 according to reviewer’s comments. 11.Writing (R1, R3, R4): We will incorporate the suggested changes to enhance paper clarity.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The issues highlighted by the reviewers must be carefully addressed in the revised version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The issues highlighted by the reviewers must be carefully addressed in the revised version.



back to top