Abstract

The transparency of deep learning models is essential for clinical diagnostics. Concept Bottleneck Model provides clear decision-making processes for diagnosis by transforming the latent space of black-box models into human-understandable concepts. However, concept-based methods still face challenges in concept capture capabilities. These methods often rely on encode features solely from the final layer, neglecting shallow and multiscale features, and lack effective guidance in concept encoding, hindering fine-grained concept extraction. To address these issues, we introduce Concept Prompting and Aggregating (CoPA), a novel framework designed to capture multilayer concepts under prompt guidance. This framework utilizes the Concept-aware Embedding Generator (CEG) to extract concept representations from each layer of the visual encoder. Simultaneously, these representations serve as prompts for Concept Prompt Tuning (CPT), steering the model towards amplifying critical concept-related visual cues. Visual representations from each layer are aggregated to align with textual concept representations. With the proposed method, valuable concept-wise information in the images is captured and utilized effectively, thus improving the performance of concept and disease prediction. Extensive experimental results demonstrate that CoPA outperforms state-of-the-art methods on three public datasets. Code is available at https://github.com/yihengd/CoPA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1919_paper.pdf

SharedIt Link: https://rdcu.be/eHxbV

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_7

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/yihengd/CoPA

Link to the Dataset(s)

Derm7pt dataset: https://derm.cs.sfu.ca/Welcome.html PH2 dataset: https://www.fc.up.pt/addi/ph2%20database.html SkinCon dataset: https://skincon-dataset.github.io/

BibTex

@InProceedings{DonYih_CoPA_MICCAI2025,
        author = { Dong, Yiheng AND Lin, Yi AND Yang, Xin},
        title = { { CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {67 -- 76}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes CoPA, a hierarchical concept prompting and aggregation framework designed to enhance the interpretability of medical image diagnosis through improved concept bottleneck modeling. The key technical contributions are twofold: (1) the Concept-aware Embedding Generator (CEG), which extracts concept-specific visual features from multiple layers of the vision transformer, and (2) Concept Prompt Tuning (CPT), which uses these embeddings as prompts in subsequent layers to focus attention on semantically relevant features. These components are integrated with a gated fusion mechanism to align concept-level representations with textual embeddings, enabling improved transparency in diagnosis. The framework is evaluated on three datasets—PH2, Derm7pt, and SkinCon—with competitive results and interpretability analysis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method is a thoughtful and well-motivated extension of concept bottleneck models for clinical diagnosis.

The experimental evaluation is thorough, including ablation studies that validate the contribution of each module, comparisons with strong baselines (e.g., MICA, Explicd), and interpretability analyses via intervention, faithfulness, and plausibility.

Results show consistent improvements in diagnosis across datasets. The interpretability workflow, including concept heatmaps and gated fusion visualizations, offers useful insights into the model’s decision-making process.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The proposed Concept Prompt Tuning (CPT) is primarily adapted from the Visual Prompt Tuning (VPT) framework. While the idea of integrating concept-specific prompts is conceptually interesting, the actual performance gains in concept prediction are marginal—for example, only 0.4–0.6% improvement on PH2 and Derm7pt as shown in Table 3—raising questions about its practical utility. A more compelling justification would require comparison against learned prompt tokens without concept alignment to isolate the true benefit of CPT.

the text-concept alignment is handled via a frozen BiomedCLIP model, which may not be optimized for capturing subtle variations in clinical descriptors. Given that many clinical concepts are fine-grained and overlapping (e.g., typical vs. atypical networks), it is unclear whether CLIP can adequately differentiate these concepts. No quantitative evaluation of the concept embedding space (e.g., similarity scores or t-SNE plots) is provided to verify alignment quality.

clarity regarding the structure of the concept set is lacking. It is not clearly explained how many concepts are associated with each image, how the candidate sets (e.g., “typical,” “atypical”) are constructed and annotated, and whether a single image can possess multiple concept labels across different hierarchies. These details are important for reproducibility and for assessing the complexity of the learning task.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a thoughtful and well-motivated approach to enhancing concept-based interpretability in medical image diagnosis through hierarchical prompting and multilayer feature aggregation. The proposed CoPA framework is technically sound, and its design—particularly the integration of Concept-aware Embedding Generator (CEG) and Concept Prompt Tuning (CPT)—is well aligned with recent advances in vision-language modeling. While the performance improvements in concept prediction are modest, the gains in disease classification and the thorough interpretability analysis support the practical relevance of the method. The experiments are comprehensive and well-executed, covering multiple datasets, strong baselines, ablation studies, and qualitative insights. Some aspects, such as the effectiveness of concept alignment and clarity on the structure of the concept sets, could benefit from further elaboration. However, these do not substantially detract from the overall contribution.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This manuscript proposes an explainable image-based architecture for interpretable disease diagnosis. The architecture is built around image concept recognition and builds explainable diagnosis through aggregation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors introduce the concept-aware embedding generator, a mechanism for aggregating multi-layer concept-related features. Secondly, the concept prompt tuning strategy is another key strength, as it allows the model to retain the benefits of pre-trained features while minimizing performance degradation during fine-tuning. The authors substantiate their framework’s effectiveness through evaluations on three datasets, including the concept-rich SkinCon dataset. Lastly, the paper summarizes the model’s interpretability, an essential step toward clinical adoption.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The novel components, such as concept-aware-embedding generator, still make use of contemporary machinine learning techniques such as cross attention, concept alignment through contrastive learning, aggregation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The underscore in equation (3) is not necessary as it is explained earlier that concept embeddings are used to generate Zl. In Figure 3, diagrams (a) and (b) are unspecified. Also, it’s not clear what the acronyms are. Moreover, It’s not clear what the righthand diagram is portraying.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall this is a well written paper with minor writing and figure errors. The novelty of the framework is somewhat lacking, but the authors convincingly combine contemporary methods to rival the state-of-the-art.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors proposed a “Concept Prompting and Aggregating Network” whose encoder contains “Multilayer Visual Concept Encoder” (MVCE). In this MVCE they proposed plural concatenation of image embedding and output (concept embedding) of “Concept-aware Embedding Generator”. They called this concatenation “Concept prompt tuning” and claimed that it helps model to focus on target visual concepts and reduce the knowledge forgetting. They did several experiments to compare with SOTA methods. This method outperforms SOTA on quantitative results (both for disease prediction and concept prediction). They did also Ablation study to prove that using all three techniques MLA, CPT and FVB is the best. Finally, through Interpretability Analysis, they proved that the model indeed takes into its account concepts to it final output scores.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This method allows detailed and multi-level feature extraction and distinction for the concept while SOTA method only works with high level (last layer). Comprehensive experiments show promising results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Not comparing speed and parameters

The process to obtain the input for gated aggregation (denoted D in the paper) is not clear

The mechanism of Concept Prompt Tuning is not novel (the algorithm is the same as from Visual Prompt Tuning, and the improvement is not clear) [https://doi.org/10.1007/978-3-031-19827-4_41] )

The concept of learnable query in CEG is novel, but its contribution to the system is not shown in ablation (maybe it is CPT?).

Fig. 1: texts are too small, I could not read

What are T, d_k, k? (eq 1 and 2). What are k_p, N_p (in definition of P_l)?

Improvement over MICA is not significant.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The provided link to code does not work. It leads to an error website.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There are novel things in the paper, but the explanation is unclear. I will accept if the authors give more details.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely appreciate the constructive suggestions from reviewers. Below we address the reviewer’s concerns. Q1: Questions about the comparison against methods without concept alignment and questions about whether CLIP can distinguish fine-grained and overlapping visual concepts (R1). A1: We appreciate and agree with these opinions and we will explore these in future work. Q2: About the dataset settings (R1). A2: The labels used in our study are based on official annotations. Due to space constraints, we did not elaborate on the labeling details in the main text. The label structures of the three datasets are generally consistent. For example, in PH2, each image is annotated with one disease label and five concept labels assessed by experts, and each concept has 2-3 possible categories (e.g., Pigment Network: Atypical/Typical, Dots/Globules: Absent/Atypical/Typical, Streaks: Absent/Present, Regression Areas: Absent/Present, Blue-Whitish Veil: Absent/Present) Q3: The underscore in equation (3) (R2). A3: The form of equation 3 is based on [1], where the use of an underscore is necessary. This underscore indicates that concept-aware visual embeddings Z_l at layer l are computed using eq1 and 2 rather than eq3, and the output embeddings of phi_l between a_l and P_l is meaningless and do not participate in subsequent computations. Q4: About figure 1 and 3 (R2, R3). A4: We appreciate the reviewer for pointing out the incompleteness of these figures, which will be improved in final version. Q5: About speed and parameters (R3). A5: We fully acknowledge the necessity of the experiment on speed and parameters, as such analysis provides strong evidence of the cost-effectiveness of our proposed components. In our view, these comparison should be present within ablation studies and comparisons with prior works. We will incorporate this analysis in our future research. Q6: Input for gated aggregation (R3). A6: The input to the gated aggregation module is derived from the fusion of textual embeddings, weighted by the normalized alignment scores. We have revised the corresponding statements in the paper to ensure accuracy and clarity in the description. Q7: Concern about the novelty of Concept Prompt Tuning (R3). A7: The design of Concept Prompt Tuning (CPT) is inspired by Visual Prompt Tuning (VPT), which we have appropriately cited in our paper. The core innovation of our method lies in binding visual prompts with explicit semantic concepts rather than random initialization. Each prompt embedding in CPT is associated with a well-defined and human-interpretable visual concept (e.g., “pigment network”). In contrast to VPT’s static prompts, CPT introduces learnable concept anchors (Sec. 2.2) to query multi-scale visual features, thereby generating task-specific prompts that align with human-defined medical concepts. Q8: The meanings of each symbol in equations (R3). A8: Equation 1 represents the cross-attention mechanism. T denotes matrix transpose and d_k denotes the dimension of key vectors. The variable k_p is the index of each patch embedding, and N_p is the total number of patch embeddings. The subscripts ‘p’ in k_p and N_p signifies patch embedding. We will include explicit definitions of T, d_k, k_p and N_p in the final version of the paper to ensure clarity and avoid confusion. Thank you for the helpful suggestion. Q9: code availability (R3). A9: Code will be available once the paper is published.

[1] Jia, Menglin, et al. “Visual prompt tuning.” ECCV 2022

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis

Author(s):