Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced subtexts through large language models, (2) a fine-grained alignment mechanism that connects subtexts with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different subtexts based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from the internet, MAKE significantly outperforms state-of-the-art VLP models on seven datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code is available at https://github.com/SiyuanYan1/MAKE.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0976_paper.pdf

SharedIt Link: https://rdcu.be/eHwTH

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_35

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/SiyuanYan1/MAKE

Link to the Dataset(s)

Derm1M dataset: https://github.com/SiyuanYan1/Derm1M

BibTex

@InProceedings{YanSiy_MAKE_MICCAI2025,
        author = { Yan, Siyuan AND Li, Xieji AND Hu, Ming AND Jiang, Yiwen AND Yu, Zhen AND Ge, Zongyuan},
        title = { { MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {369 -- 379}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework tailored for zero-shot dermatological assessment. Addressing challenges such as text length limitations and the lack of structured clinical descriptions, MAKE leverages large language models to decompose dermatology reports into multiple knowledge-rich sub-texts, aligns them with relevant image regions, and adaptively weights them based on diagnostic importance. Trained on over 400,000 image-text pairs, the model achieves state-of-the-art performance across multiple dermatology benchmarks, including disease classification, concept annotation, and cross-modal retrieval.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The motivation behind the method is highly reasonable. Firstly, dermatological diseases lack structured clinical image-text pairs. Therefore, using LLMs for structured organization is essential.
2. The post-processing of knowledge-enhanced features seems very necessary. My personal understanding is that it is akin to feature clustering, which makes the feature distribution more uneven, and this is beneficial for feature representation.
3. The paper includes numerous quantitative experiments, and the performance advantages are substantial, making the results very convincing.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Some important details are missing in the paper, such as what the prompt for the LLM is, and the specific definition of knowledge. Including these details, along with more comprehensive information, could significantly enhance the paper’s persuasiveness.
2. The paper uses predefined knowledge, but the long-tail problem in dermatology is quite severe. If the model is left to learn the knowledge weights on its own, could this lead to significant bias? Perhaps more explanation and experiments could help clarify this point.
3. Some visual results could potentially enhance the experimental section of the paper, such as examples of LLM usage or T-SNE plots, etc. Of course, I also understand that this might be due to space limitations.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the paper’s approach and reasoning are very clear, and its performance is convincing. However, some details in the paper left me with doubts. If the authors could clarify these details, I would be able to further improve my score.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces MAKE, a novel vision-language pretraining (VLP) framework specifically designed for dermatological assessment. The authors address limitations of conventional VLP models like CLIP by decomposing complex clinical descriptions into multiple knowledge-enhanced sub-texts using large language models. MAKE employs three key innovations: multi-aspect knowledge-image contrastive learning, fine-grained alignment between sub-captions and image patches, and a diagnosis-guided weighting scheme. Experiments on eight datasets demonstrate significant performance improvements over SOTA VLP models for zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The framework effectively addresses the text length constraint of conventional VLP models by decomposing complex dermatological descriptions into multiple knowledge-enhanced sub-texts that capture different aspects of clinical knowledge.
- The approach offers a more nuanced way to handle specialized medical knowledge by separately modeling disease terminology, clinical concepts, and raw descriptions.
- The diagnosis-guided weighting scheme adaptively prioritizes different aspects of knowledge based on their diagnostic relevance, better reflecting how dermatologists assign varying importance to clinical attributes.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the paper claims to be the first vision-language pretraining framework for dermatology, it doesn’t adequately distinguish itself from existing multimodal large language models (MLLMs) in dermatology such as those in [1,2]. The paper fails to clearly articulate the essential differences between vision-language models (VLMs) and MLLMs, raising questions about the novelty of their “first VLM framework” claim.
2. The paper lacks justification for why pretraining on the large DermMM dataset is necessary instead of directly fine-tuning CLIP on each downstream dataset (e.g., F17K). A comparative analysis demonstrating the advantages of their pretraining approach over direct fine-tuning on target datasets would strengthen their methodology’s rationale and provide clearer evidence of its benefits.
3. Despite requiring pretraining on over 400,000 image-text pairs, the paper doesn’t adequately compare MAKE against simpler unsupervised domain adaptation (UDA) approaches [3] that can achieve competitive results with significantly less data. For instance, UDA methods appear to outperform MAKE on F17K while using only thousands of images rather than hundreds of thousands. The paper should have addressed this efficiency gap and explained why their more data-intensive approach is justified.
4. While the paper mentions using large language models to extract specialized knowledge aspects from original text, it lacks details about the specific prompting strategies or LLM configurations used for this critical knowledge extraction process.
[1]. Yan, et al., “A Multimodal Vision Foundation Model for Clinical Dermatology,” (not peer-reviewed but related)

[2]. Zhou, et al., “Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4,” Nature Communications 2024

[3]. Wang, et al., “Achieving reliable and fair skin lesion diagnosis via unsupervised domain adaptation,” CVPR Workshop, 2024.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

MAKE’s idea raises zero‑shot accuracy and retrieval across eight datasets, outscoring strong CLIP/SigLIP baselines without task‑specific tuning. The method is elegant and empirically convincing, so the paper is worth accepting, but only narrowly: it doesn’t distinguish itself from recent multimodal LLMs, gives no comparison to simpler CLIP fine‑tuning or UDA baselines, and omits key LLM‑prompt details. If these gaps are addressed in revision, the contribution will be clearer and more impactful.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
1. The paper proposed the MAKE framework, which introduces a multi-dimensional clinical knowledge decomposition mechanism into dermatological vision-language pre-training for the first time.
2. The paper proposes three key technologies: a multi-aspect knowledge-image contrastive learning strategy, a fine-grained alignment mechanism, and a diagnosis-guided weighting scheme. These innovations significantly enhance the learning of fine-grained knowledge in dermatological vision-text integration.
3. Pretrained on 403,563 dermatological image-text pairs, MAKE achieves state-of-the-art performance across eight benchmarks
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed methodology introduces a novel approach that effectively addresses the text input limitations inherent in conventional dermatological Vision-Language Pretraining (VLP) frameworks. By leveraging large language models to parse comprehensive clinical narratives into structured multi-aspect knowledge-enhanced sub-texts (e.g., disease attributes, morphological features), the method overcomes the 77-token text length constraint inherent in conventional VLP frameworks while preserving critical diagnostic information.
2. The proposed method achieves significant improvements across all eight benchmarks, demonstrating a 5.09% higher average classification accuracy than the best-performing baseline, a 1.11% improvement in AUROC for concept annotation, a 6.57% enhancement in image-to-text retrieval performance, and a 6.27% increase in text-to-image retrieval accuracy compared to the strongest baseline.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper appears to be free from significant methodological limitations.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work warrants acceptance due to its three-fold contributions: (1) The novel MAKE framework effectively addresses the text length limitation and unstructured data challenges in dermatological vision-language pre-training by leveraging large language models to perform multi-aspect clinical knowledge decomposition. (2) Comprehensive evaluations across eight benchmarks demonstrate significant advancements: 5.09% higher cla ssification accuracy, 1.11% AUROC improvement in concept annotation, and 6.57%/6.27% gains in cross-modal retrieval over state-of-the-ar baselines. (3) The methodology shows strong clinical relevance through zero-shot adaptability to diverse data sources (PubMed, social media, institutional datasets) and diagnostic interpretability features. Rigorous ablation studies and baseline retraining protocols ensure validity, while code availability promises reproducibility. Overall, this work extends a new perspective for knowledge enhanced medical VLPs through its innovative architecture and clinically validated performance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all three reviewers for their constructive feedback and positive scores (R1:5, R2:4, R3:4). We are encouraged by the recognition of our key contributions: (1) high motivation and technical novelty; (2) substantial performance gains; and (3) strong clinical relevance for zero‑shot dermatology assessment. Below, we address each concern in turn:

Q1: Lack of details of prompt and knowledge examples (R2 & R3). We will include in the final manuscript a concise table showing two representative structured prompts for Qwen2‑72B (with in‑context examples from forum and PubMed sources) and the corresponding outputs for different knowledge aspects such as “lesion morphology” and “hierarchical diagnosis.” Due to MICCAI page limits, the full set of prompts and examples will be released in our open‑source repository alongside the code to ensure reproducibility.

Q2: Long‑tail problem (R2). Our multi‑aspect knowledge extraction method groups hundreds of disease labels into hierarchies, guiding the model to learn shared patterns across related conditions. On top‑10 least frequent classes, our MAKE improves accuracy on F17K from 0.266 to 0.424 (+15.8 %) and on SD‑128 from 0.143 to 0.227 (+8.4 %) compared to CLIP pretrained on the same DermMM†. These results clearly demonstrate MAKE’s superior generalization to under represented skin conditions.

Q3: Distinction from multimodal LLMs (R3). Vision-language pretraining (VLP) models and multimodal large language models (MLLMs) represent entirely different technical paradigms. Our MAKE is the first CLIP-like contrastive VLP framework for dermatology, focusing on representation learning. In contrast, work [2] cited by R3 is generative MLLM with fundamentally different architectures and objectives. Our claim of being the “first VLP framework” therefore remains valid and distinct from the MLLM approaches. Comparing these different technologies is outside the scope of our current work.

Q4: Justification for pretraining vs. direct CLIP fine-tuning (R3). Pretraining with our MAKE produces dermatology-specific embeddings that generalize across datasets and tasks, whereas fine tuning CLIP on a single dataset risks overfitting. Empirically, in zero shot tests on F17K rare classes, MAKE achieves 0.466 versus 0.447 for CLIP fine tuned on F17K (+1.9 %). After identical fine tuning, MAKE initialization yields 0.495 compared to CLIP’s 0.447 accuracy (+4.8 %). Furthermore, a single MAKE model excels at disease classification, concept annotation, and multimodal retrieval without per task adjustments, underscoring the broad utility of large scale dermatology pretraining.

Q5: Comparative analysis with unsupervised domain adaptation (UDA) (R3). UDA addresses downstream domain adaptation during fine tuning, while our MAKE focuses on vision–language pretraining. These foundational different technique topics solve complementary problems and are not directly comparable. They can be combined—using MAKE as the backbone for UDA methods would likely outperform the backbone used in [3] for UDA, since MAKE provides stronger dermatology specific representations. However, a direct comparison falls outside our current study’s scope on pretraining frameworks.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Author(s):