Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Given an untrimmed medical instructional video and a textual question, medical video answer localization is to locate the precise temporal span that visually answers the question. Existing methods primarily rely on supervised learning to tackle this problem. This requires massive annotated data for training and shows limited flexibility in generalizing across different datasets, especially in the medical domain. With the remarkable advancements of large language models (LLMs) and their multimodal variants (MLLMs), we explore a Socratic approach to compose LLMs and MLLMs to achieve zero-shot video answer localization. Our method effectively takes advantage of the rich subtitles and visual descriptions in instructional videos to prompt LLMs. We also develop a subtitle refinement and early fusion strategy for better performance. Experiments on MedVidQA and COIN-Med show that our method outperforms existing state-of-the-art (SOTA) zero-shot multimodal models significantly by 41.0% and 20.3% in mIoU, respectively. It even surpasses SOTA supervised methods, suggesting the strength of our approach.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5153_paper.pdf

SharedIt Link: https://rdcu.be/eHwXu

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_63

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XiaJun_Unleashing_MICCAI2025,
        author = { Xiao, Junbin AND Li, Qingyun AND Yang, Yusen AND Qiu, Liang AND Yao, Angela},
        title = { { Unleashing the Power of LLMs for Medical Video Answer Localization } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {669 -- 679}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors provide a zero-shot approach for video-based QA. For this, they really on automatically generated subtitles and captions, which are then processed via a LLM to determine that relevant frames of the original video.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- novel method of subtitle refinement strategy
- evaluation with benchmark dataset
- concise ablation study demonstrates component efficiency clearly
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Confusing explanations/textual flow difficult to follow
- I feel that many details for reproducibility are lacking, e.g. how were the subtitles generated from the video? What parameter values were used for the different components/what was optimized via the validation dataset
- how significant are the differences in performances? How repeatable are the results, i.e. what is the fluctuation over multiple runs?
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
As mentioned above, I feel that some major information relevant for reproducibility are missing from the paper. Furthermore, I found the flow/organization of the paper lacking, e.g. detailed description of how EMScore is used (mainly what the inputs exactly are) in a different section than it which it is introduced, caption and subtitle sometimes used as synonyms
- Looking at Table 2, it seems a large improvement was switching from GPT-4o-mini to GPT-4o, why wasn’t this explored further, e.g. using the baseline but with GPT-4o?
- How robust are the presented results? I.e. if I run each experiment multiple times, is there a significant change in performance? Do the differences hold up?
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors addressed most of my concerns, prompting me to vote for acceptance.

Review #2

Please describe the contribution of the paper

This paper uses of large language models (LLMs) for zero-shot answer localization in medical instructional videos. It introduces a subtitle alignment strategy and an early fusion approach to effectively integrate subtitles and visual information, significantly improving temporal localization accuracy. Extensive experiments on two benchmark datasets demonstrate that the proposed method substantially outperforms existing approaches, highlighting its scalability and practicality.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents a novel zero-shot approach for medical video answer localization, using LLMs and MLLMs in a Socratic framework. Its subtitle refinement and early fusion strategies effectively tackle subtitle-visual misalignment. Furthermore, this paper focus on medical video answer localization addresses a critical need in healthcare for efficient knowledge retrieval from instructional videos. The zero-shot nature of the approach reduces dependency on costly expert annotations, making it highly relevant for clinical applications like medical education and decision support.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

There are some grammar and symbol errors in this paper; for example, “Scoratic approach” should be corrected to “Socratic approach” on page 2, and “broomsteak” should be “broomstick” in Fig. 1. The experimental evaluation is insufficient. The omission of comparisons with recent high-performing zero-shot methods (such as Video-LLaVA) reduces the comprehensiveness of the evaluation and may lead to an underestimation of the current state-of-the-art performance.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There are some grammar errors in this paper; for example, “Scoratic approach” should be corrected to “Socratic approach” on page 2, and “broomsteak” should be “broomstick” in Fig. 1. All in all, this paper falls short of MICCAI’s standards due to limited novelty, insufficient methodological clarity, weak baseline comparisons, inadequate result analysis, grammar errors, limited clinical impact discussion, and poor literature contextualization. These deficiencies collectively justify rejection, as the paper does not demonstrate the level of innovation, rigor, or polish expected for MICCAI acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Although this paper has some grammar and punctuation issues, overall, the authors propose a well-designed framework specifically tailored for the challenging task of medical video answer localization. The approach is technically robust, and the paper includes extensive experimental validation.

Review #3

Please describe the contribution of the paper

This paper presents a zero-shot framework for medical video answer localization using LLMs and MLLMs. It combines refined subtitles and visual captions via a Socratic prompting approach to locate answer-relevant video segments without supervised training. The authors introduce subtitle refinement and an early fusion strategy to align and integrate multimodal information. Experiments on MedVidQA and Coin-Med show strong performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- It is one of the first works to leverage LLMs and MLLMs for zero-shot temporal answer localization in medical instructional videos.
- The proposed approach combines subtitle-based querying and vision-language captioning to bridge the gap between video and question domains. The early fusion of refined subtitles and visual captions into symbolic prompts is new and technically sound.
- The proposed method outperforms both zero-shot and supervised baselines on MedVidQA and Coin-Med datasets, achieving state-of-the-art performance.
- The authors conduct thorough ablations to validate the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The subtitle refinement is based on standard techniques such as local maximum EMScore and Dynamic Time Warping (DTW), and the early fusion of multimodal inputs. These components, while effective, are adaptations of existing methods rather than novel algorithmic contributions.

Additionally, the use of MLLMs for visual captioning raises concerns about potential hallucinations in the generated descriptions. It is unclear how such inaccuracies might impact the overall model performance.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors present a well-executed framework tailored to the challenging task of medical video answer localization. Overall, the proposed approach appears technically sound and extensive experiments are provided.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Thanks for the valuable feedback. We address key concerns and clarify our main contributions: Reviewer #1 Q1 - Clarity and Textual Flow: We apologize for the confusion and will improve the presentation. Subtitles refer to audio transcripts with timestamps attached, and captions refer to visual appearance descriptions generated from videos. We use both subtitles and captions for better performance. For EMScore, it is a reference-free metric that computes semantic similarity between two inputs: a subtitle sentence (text), and a video segment, represented by CLIP embeddings. In our method, we use it to refine subtitle-video alignment by sliding a temporal window and scoring candidate segments. EMScore combines coarse-grained (sentence–clip) and fine-grained (word-frame) embedding matching. The segment with the highest score is selected as the refined timestamp. Q2 - Reproducibility Details: For subtitles, we use established tools following a standardized pipeline: the Whisper-large model for MedVidQA and the YouTube Transcript API for Coin-Med (Section 3.2). As for parameter values: Subtitle refinement window size: N = 5 seconds for MedVidQA (higher subtitle density); CLIP similarity threshold τ = 0.25 for caption filtering in both datasets; Clip length for visual captioning: ~5 seconds for both datasets. We will include such details in the revision and open source our code for better reproducibility. Q3 - Robustness of Results: We agree and wish to reaffirm that all reported results in the paper (mIoU and IoU metrics) are averaged over 3 independent runs. We analyzed the relative performance gains across different runs and observed that the improvements are consistently significant, and the fluctuations are small and around 0.55%. Q4 - Effect of LLM Choice (GPT-4o vs GPT-4o-mini): We acknowledge that a larger and more powerful LLM is key for better performance, but this does not compromise our major contribution to compose different LLMs and MLLMs for zero-shot medical video QA. In fact, our method with 4o-mini already largely surpasses all baseline performances in zero-shot setting. We have validated the effectiveness of our core components with both 4o-mini and 4o, and the main performance gains come from our fusion and refinement strategies. Finally, we wish to clarify that the baseline TFVTG uses GPT-4 turbo, an LLM generally stronger than 4o-mini. Reviewer #2 Q1 - Technical Novelty: Our contribution (related to EMScore and DTW) is that we find that refining subtitles improves LLM reasoning about query locations. Additionally, we design dynamic selection of refinement strategy to cope with different datasets. These are the first in both general-domain video sentence localization and in medical domain. Both EMScore and DTW are only used for calculating cross-modal similarity to achieve better refinement. Q2 - Hallucination in Visual Captions: Thanks for the question. We find hallucination to be limited, especially given the structured nature of medical instructional videos. Moreover, we mitigate hallucinations by filtering visual descriptions based on CLIP similarity to the query and structuring prompts to separate modalities, allowing the LLM to assess their trustworthiness individually rather than conflating modalities. Reviewer #3 Q1 - Grammar and Symbol Issues: We sincerely apologize for the typographical and grammatical errors. We will carefully polish the writing before the camera-ready version. Q2 - Insufficient Comparisons with Recent Models: We conducted experiments using Video-LLaVA and other advanced Video-LLMs, but found that their performances are poor (mIoU ~9% - ~12% on MedVidQA), likely due to that they are designed for video captioning and question answering not answer localization. Another reason could be attributed to their limited length of visual inputs (32-128 frames). We also tested it as a caption generator(mIoU ~18%), but found that Qwen2-VL produced more accurate and domain-relevant captions.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Unleashing the Power of LLMs for Medical Video Answer Localization

Author(s):