Abstract

Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pre-trained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22\% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts’ prior knowledge and automatic annotations for downstream vision-language tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3273_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Yininnnnnng/DualPrompt-MedCap

Link to the Dataset(s)

RAD dataset: https://huggingface.co/datasets/flaviagiammarino/vqa-rad SLAKE dataset: https://huggingface.co/datasets/BoKelvin/SLAKE

BibTex

@InProceedings{ZhaYin_DualPromptMedCap_MICCAI2025,
        author = { Zhao, Yining and Prasad, Mukesh and Braytee, Ali},
        title = { { DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {207 -- 217}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce DualPrompt-MedCap, a framework targeted at improving vision language model (VLM) performance on medical image datasets. The proposed approach is a mutli-step, multi-component pipeline for medical image captioning, leveraging pretrained vision language models, a semi supervised modality classification approach, and pretrained biomedical language embeddings. The authors also propose a novel evaluation framework in the absence of relevant ground truth.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The work is well-motivated, with the authors pointing out significant challenges faced by current VLMs adapted for medical image captioning and a method that addresses these challenges. Results demonstrate strong improvement over other methods, supporting the use of the proposed framework. The (dual) prompt enhancement is also a novel and interesting contribution to the problem of medical image captioning, raising new potential research questions circumventing the need for retraining a VLM for med image captioning by instead augmenting it with extra information.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In several places, the manuscript lacks specificity, especially when it comes to describing the method and evaluation metrics. These need to be addressed:

    • What is the relationship between the BLIP3 visual encoder and the ResNet trained in a semi-supervised manner? Why not train the BLIP3 encoder without the ResNet? What features go into the ResNet? Since the text specifies the ResNet is initialized with ImageNet weights it expects and input conforming with ImageNet image shaped (2 spatial dimensions, 1 channel dimension), how does this line up with the BLIP output?
    • The multiscale feature extraction seems dubious. Why not increase the kernel sizes in tandem with the step size, which, intuitively, seems a better way to extract large and small scale information - through large and small kernels.
    • What are the modality-specific class weights for the semi supervised portion? How are they decided?
    • How is the confidence threshold for pseudo-labels (0.95, which seems high) chosen? Do the authors target a specific rejection rate or some other metric?
    • Are the anatomy and texture attention parameters motivated by the literature? How are these decided on?
    • Why can BioMedCLIP not be used for this application? Instead of just letting it rate caption-to-image similarity? Why not train a decoder on top of the BioMedCLIP encoder.
    • Section 2.5, aside from the relevance score metrics (eq 6) it is not sufficiently explained how any of these metrics are derived or evaluated. As it is currently presented, it is hard to assign any objective meaning to these metrics.
    • Section 3.2: no explanation is given for how hyperparameters were chosen.
    • It would be helpful to give a quick overview of the datasets used for evaluation (e.g. SLAKE, RAD).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As it is, the manuscript lacks enough detail to be accepted, especially regarding metrics in section 2.5 and hyperparameter choices. As such, reproducing the work is not possible and the results are pulled into question (i.e. is there overfitting)? However, this is easily addressable and, given that is done, the results and the novelty of the framework are enough to justify acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed many of the concerns from the review. Between the rebuttal and the other reviews, I’m content deferring to the majority opinion and voting accept.



Review #2

  • Please describe the contribution of the paper

    This paper introduces three significant methods to address the challenges in medical image caption generation and assessment. For caption generation, the authors extend the BLIP-3 model by incorporating two novel modules. First, a modality prompt module classifies the image modality (CT, MRI, or X-ray) using a semi-supervised learning approach to generate modality-specific prompts. Second, a clinical focus prompt module employs PubMedBERT to calculate cosine similarity, retrieve predefined prompts based on specific aspects of the medical image, and classify the query type to construct a clinically focused prompt. These prompts enhance BLIP-3’s ability to generate detailed and clinically relevant reports based on the input image and query. For caption quality assessment, the paper proposes a composed scoring framework that evaluates the generated captions based on BiomedCLIP’s image-text similarity, the inclusion of medical terminology, clinical correctness, and report structure. These contributions enable the generation of accurate, context-aware clinical reports while providing a robust framework for assessing caption quality.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed generation methods demonstrate promising results by leveraging prompt-based techniques to enhance vision-language models without requiring extensive fine-tuning of the models themselves. This approach significantly reduces the computational burden and allows for broader scalability. The framework has strong potential to be applied to other large vision-language models, including commercial systems like GPT-4o with image and curated prompt input. Another advantage is the multi-aspect evaluation framework, which is valuable for advancing medical report captioning. Current methods for evaluating medical text are still evolving, and this framework represents a step forward by incorporating multiple dimensions, such as BiomedCLIP’s image-text similarity and clinical correctness. Additionally, the use of medical terminology aligned with the UMLS standard strengthens the validity and reliability of the assessment process, making it more applicable to real-world clinical scenarios.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There are two primary weaknesses in the paper: missing key details about the proposed prompting strategies and issues with the experimental design. For the prompting strategies, the paper does not provide sufficient details about how the “predefined clinical concept embeddings” are constructed or how they are related to the prompt construction process beyond query type classification. This lack of clarity makes it difficult to evaluate the robustness of the approach. In the assessment sections, the metrics for “clinical correctness” and “report structure” are not well explained. For example, it is unclear how the quality scores (e.g., Figure 3) are generated—whether through manual annotation or automated methods (e.g., GPT). Furthermore, the paper does not specify how these metrics are calculated. Do all samples use the same accuracy and structure metrics (e.g., three for each), or do they vary? Is the evaluation based on exact word matching, fuzzy matching, or some other automatic method? Additionally, the qualitative examples do not provide visibility into how the prompts are integrated with the generation query. For the experimental design, the paper does not clarify whether the comparative baselines (e.g., BLIP-2, Tag2Text) were evaluated in zero-shot settings or if they had prior exposure to the dataset. If the baselines are tested in zero-shot settings while the proposed method underwent additional training or prompting, the comparison is not very fair. In the classification task, the number of labeled and unlabeled examples used in the semi-supervised learning process is not specified. This information is critical for understanding the semi-supervised capabilities of the proposed method. Additionally, it is unclear whether BLIP-3 performs zero-shot image modality classification or if FixMatch is implemented on the same ResNet50. Other issues include the lack of ablation studies to isolate the contributions of individual components (e.g., modality-aware prompts and question-guided prompts). There is also no comparison with other commercial models and lack of open-source large vision-language models under the same prompting/no-prompting conditions. Furthermore, the qualitative examples show that the raw input consists of a medical image and a very brief query (e.g., asking for the location of an abnormality). In such cases, it is reasonable for language models to generate brief answers unless explicitly instructed to provide detailed responses. Without fine-tuning or explicitly forcing the baselines to generate longer, more detailed responses, the comparison remains unfair, as the proposed method relies on complex prompting strategies.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    To improve the experiments, the authors should clarify whether the baseline models (e.g., BLIP-2, Tag2Text) were evaluated in zero-shot settings or trained on the same dataset to ensure fair comparisons. Including ablation studies to isolate the contributions of the modality-aware and question-guided prompts would help demonstrate the significance of each component. The semi-supervised classification task requires more detailed reporting, such as the number of labeled and unlabeled examples, and clarification on whether BLIP-3 performs zero-shot classification. Additionally, comparative experiments with other state-of-the-art models, such as commercial systems like GPT-4 with image inputs or open-source vision-language models, would strengthen the evaluation. Finally, fairness in prompting comparisons should be addressed by ensuring that baseline models are also prompted or fine-tuned to generate detailed responses, aligning them with the proposed method’s setup.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a novel prompting method that enables detailed medical image captioning without requiring extensive or sophisticated training processes. Additionally, it proposes a promising multi-aspect evaluation framework for assessing the quality of generated captions, which is a valuable contribution to the medical image captioning field. While there are several areas that require further clarification—such as the construction of predefined clinical concept embeddings and the evaluation metrics for clinical correctness and report structure—and the experimental design could be improved, the paper still offers meaningful insights and demonstrates potential for scalability. These contributions justify a weak accept, as the work provides a solid foundation for future research in this domain.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The key details about the proposed prompting strategies and the experimental design has been clarified.



Review #3

  • Please describe the contribution of the paper

    The paper proposes “DualPrompt-MedCap”, a dual-prompt framework for medical image captioning. The main idea is that it enhances vision-language models with modality-aware and question-guided prompts. It improves modality recognition and generates clinically relevant, question-aligned captions without relying on ground truth annotations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Introduces a novel combination of modality-aware and question-guided prompts, enhancing context relevance and clinical specificity in captions. 2) Shown good improvemet interms of modality recongnision. 3) Compares with other models like BLIP2/3. 4) Can have a very high clinical impact, as the descriptions generated aligns with diagnostic intent.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Minor Comments (no major concerns): The authors mention: “For MRI images, we apply stronger geometric transformations (i.e., ±15° rotation, 15% translation) to address their unique characteristics and mitigate misclassification with CT scans.” — This line could use clarification. Is this augmentation strategy a primary factor contributing to the improvement in modality recognition? If so, it would be helpful to provide supporting evidence or ablation. 2) The use of gamma_1 = gamma_2 = 1 may be a typographical error or could benefit from a brief justification if intentional. I beleive it should be 0.5.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Paper is well writtern and the work could have a good impact.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have put considerable effort into addressing the reviewers’ comments. Some of the questions from all reviewerers have been satisfactorily answered.




Author Feedback

We consolidated common comments below. The acronyms mean R1: reviewer 1, Q7: comment 7.

Architecture & Integration: (R2-Q7, R1-Q7) Regarding ResNet-BLIP3 integration: We implemented a parallel pathway approach. ResNet50 with our Medical Modality Attention layers processes standard RGB inputs (224×224) for modality classification, while BLIP3 handles image understanding. These systems merge only at prompt construction, preserving BLIP3’s capabilities while providing explicit modality awareness.

Clinical Concept Embeddings: (R1-Q7) As in Section 2.2, we build embeddings by: (1) defining dictionaries for medical categories (anatomy:lung; pathology:tumor; location:left; comparison:larger); (2) encoding with PubMedBERT; (3) computing question-concept similarity to determine clinical focus, generating appropriate prompt text (e.g., “examine anatomical structures”).

Prompt Integration: (R1-Q7) As in Section 2.3: Our dual prompt combines into a template that inserts modality results and clinical focus into BLIP3’s instruction context. This feeds directly into BLIP3’s tokenizer as prefix text, without requiring model changes.

Evaluation Framework: (R1-Q7, R2-Q7) Regarding automatic evaluation: As outlined in Section 2.5, all metrics are calculated programmatically. “Clinical correctness” uses string matching with equal weights for fairness: findings (25%), anatomical-location (25%), measurements (25%), and comparisons (25%). “Report structure” evaluates basic structure, completeness, and logical flow (1/3 each) to avoid bias. We utilize ScispaCy’s UMLS entity linker to map text to medical concepts, calculating entity density (entities/words) and diversity (unique/total entities). Final score equally weights (γ₁=γ₂=0.5, not 1 as R3 noted) relevance (image-text and question-text similarities) and quality (medical quality, clinical accuracy, structure).

Experimental Fairness: (R1-Q7, R1-Q10) Regarding experimental settings: All models were evaluated under identical zero-shot conditions with consistent parameters (beam size=5, top_p=0.9) and appropriate modality prompts. No models were fine-tuned on target datasets, ensuring fair comparison. While space limitations prevented a dedicated ablation section, we demonstrated each component contributions throughout the paper: Table 1 validates our modality recognition module’s effectiveness compared to BLIP3, especially for MRIs. Table 2 showcases the question-guided component’s impact through improved question similarity and clinical accuracy metrics. Together, these results demonstrate each component’s value. We focused on open research models to ensure reproducibility and detailed analysis not possible with closed API systems.

Semi-supervised Learning: (R1-Q7) Regarding modality classification: As described in Section 2.1, our approach uses two systems. BLIP-3 performs zero-shot classification by analyzing generated captions. Separately, our classifier uses FixMatch with 202 labeled samples (9%: 75 MRI, 49 CT, 78 X-ray) and 2,042 unlabeled samples (91%), addressing limited modality labels in medical contexts.

Design Choices: (R2-Q7) Regarding hyperparameters: We apologize for not detailing parameter selection due to space limits. Learning rates (1e-5 backbone; 1e-4 attention) balance feature preservation and adaptation. Modality weights [1.5, 1.0, 1.0] address MRI-CT confusion. For the questioned 0.95 threshold: tests with values 0.8-0.99 showed 0.95 optimally balanced reducing incorrect labels while maintaining sufficient unlabeled data.

Regarding BiomedCLIP (R2-Q7): We use it for evaluation as it excels at embedding but lacks generative capabilities without substantial training.

Regarding feature extraction: (R2-Q7) Dilated convolutions chosen over larger kernels for efficiency while capturing diverse receptive fields. MRI augmentation (±15° rotation, 15% translation) improved recognition of MRIs often confused with CT scans (R3-Q7).

Code will be released upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After rebuttal, all reviewers suggest to accept this paper, I also select ‘accept’.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top