Abstract

Concept bottleneck models (CBMs), which predict human-interpretable concepts (e.g., nucleus shapes in cell images) before predicting the final output (e.g., cell type), provide insights into the decision-making processes of the model. However, training CBMs solely in a data-driven manner can introduce undesirable biases, which may compromise prediction performance, especially when the trained models are evaluated on out-of-domain images (e.g., those acquired using different devices). To mitigate this challenge, we propose integrating clinical knowledge to refine CBMs, better aligning them with clinicians’ decision-making processes. Specifically, we guide the model to prioritize the concepts that clinicians also prioritize. We validate our approach on two datasets of medical images: white blood cell and skin images. Empirical validation demonstrates that incorporating medical guidance enhances the model’s classification performance on unseen datasets with varying preparation methods, thereby increasing its real-world applicability.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1786_paper.pdf

SharedIt Link: https://rdcu.be/dY6iA

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_23

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1786_supp.pdf

Link to the Code Repository

https://github.com/PangWinnie0219/align_concept_cbm

Link to the Dataset(s)

https://data.mendeley.com/datasets/snkd93bnjr/1 https://raabindata.com/free-data/ https://www.nature.com/articles/s41598-023-29331-3 https://github.com/mattgroh/fitzpatrick17k https://ddi-dataset.github.io/index.html#paper https://rose1.ntu.edu.sg/dataset/WBCAtt/ https://skincon-dataset.github.io/index.html#dataset

BibTex

@InProceedings{Pan_Integrating_MICCAI2024,
        author = { Pang, Winnie and Ke, Xueyi and Tsutsui, Satoshi and Wen, Bihan},
        title = { { Integrating Clinical Knowledge into Concept Bottleneck Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {243 -- 253}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper “Concept Alignment: Integrating Clinical Knowledge into Concept Bottleneck Models” presents a method to enhance Concept Bottleneck Models (CBMs) by incorporating clinical insights, improving their interpretability and accuracy in medical image analysis. It validates this approach with two medical image datasets, demonstrating improved performance, especially on out-of-domain data. The work is notable for its novel integration of fine-grained clinical knowledge, aligning model predictions more closely with clinician decision-making processes. This advancement enhances the practical applicability of CBMs in clinical settings, promoting greater model acceptance and trust.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The intergration of clinical knowledge in CBM setting is novel.
    2. Code is given, so high reproducibility
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Inadequate literature review: The authors should include these modern CBMs to make their literature review holistic. Related work should be on CBMs and its variants (in medical images) Medical imaging:

    1. Distilling blackbox to interpretable models for efficient transfer learning .MICCAI 2023
    2. Concept Bottleneck with Visual Concept Filtering for Explainable Medical Image Classification. MICCAIw 2023

    Posthoc-concept bottleneck:

    1. Post-hoc Concept Bottleneck Models ICLR 2023
    2. Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat. ICML 2023

    Using vision language:

    1. Label free CBM ICLR 2023

    More CBMs:

    1. Probabilistic Concept Bottleneck Models ICML 2023
    2. Concept Embedding models Neurips 2022

    Weakness:

    1. The aligment loss is similar to shapley values. Whats difference and why no experiment. This enumeration can be tough for problems where high number of concepts exist. Ex in Chest x rays, 50+ anatomical and observational concepts exist. How this method is genralized for those settings?

    2. What if the concept set is incomplete? How can the authors incorporate that clinical knowledge into the model?

    3. CBMs are trained end to end. So during the initial phase of training the \delta{Y} will not be that good as the model will still map the concepts to the label. So the \L_{align} may confuse the model because it is still learning. Either the author can do it in two stages or using posthoc CBMs. Also, if the author use posthoc cbm, it can enforce the clinical knowledge well as the author can get the importance of each concept via shapley values directly. So one experiment is needed to evaluate it in Posthoc setting.

    4. For out of domain problem, comparison is need with this paper (Distilling blackbox to interpretable models for efficient transfer learning). As far as i know, the concepts are domain invariant and the authors does not include spurious concepts like race, gender etc. So the indomain CBM should be able to predict the class label with the clinical concepts especially if they use a sparse mixture model like the paper i refrered. Even after using that, if the author’s method will perform better, this will make the paper very strong.

    5. Inadequate baselines. There is a plethora of CBM i listed in the literature review which performs much better than the vanilla CBM. So, the authors should compare and use the SOTA CBM not the vanilla one. I can agree Vision language setting may not be applicable for skin and WBC as specific VLM in these domain is not there. But the authors should perform against atleast one posthoc method. Also in CBM setting people are interested in the concept association like how the concepts are associated to reach to the final prediction. The authors did not perform any of such experiment.

    6. No qualitative results in domain and out of domain experiments. See any paper on CBM, the authors should get the idea.

    7. As the author is making a generalized model, i would

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weakness

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the claims of the paper will be benefitted with further experiments, especially in a large scale dataset like chest-x-rays with a plethora of concepts.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a method to guide Concept Bottleneck Models (CBMs) to align better with clinicians’ perspectives during training. This domain knowledge is injected into CBMs using a perturbation-based method where prediction probabilities are constrained to drop for “important” concepts and vice-versa. The method is evaluated on 2 different problems, one of classifying WBS in pathology images and the other for identifying skin abnormalities from images. The results show improved performance on both problems, specially on OOD datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The general idea of guiding the training of CBMs towards known biological priors is a very useful one. Even though CBMs ensure that the model uses specified concepts to make predictions, it doesn’t provide a way to add constraints on how these contributions align with clinicians.
    2. The paper provides a simple perturbation based approach to ensure there’s a larger drop in probability if an “important” concept is dropped from the CBM and a smaller drop in probability if the opposite is the case. This formulation provides an implicit way to provide priors.
    3. Even through the ID results are comparable to the baseline CBM, the OOD results show a significant improvement in performance across model types and datasets. This is a strong result.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. One limitation of this method is the additional requirement on getting the concept importance data for each class. More importantly, many problems might not fit into this framework where we can exhaustively define all concepts and how they linearly relate to the classes.
    2. The paper can do a better job at motivating the use of perturbation-based method instead of other alternatives like constraining the classifier weights to conform to the known concept-class importance or something even simpler like enforcing these clinician priors in the concept probability space. The classifier is learning a non-linear function mapping each concept to each class and the relative values of high, moderate, low are already known for this mapping through the clinician data. What makes the perturbation approach better than directly making interventions on the concept probabilities or the concept-class mapping in the classifier weights.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Would love to hear author’s opinion on the 2 points from the weaknesses section, specially why the perturbation approach makes more sense than something more direct.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes an interesting way to enforce prior clinician knowledge into CBMs. The formulation is simple, well-described, and shows good improvements on OOD samples across two datasets.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I am satisfied with the rebuttal and would like to maintain my rating of a weak accept. The paper proposes an interesting an useful way to incorporate physician feedback into CBMs. This use-case might be not be as generally applicable due to requiring complex labels, but whenever possible, such methods can help improve the model’s alignment to clinicians.



Review #3

  • Please describe the contribution of the paper

    The paper demonstrates an interpretable method for incorporating clinical prior knowledge and enhancing concept bottleneck models. The method would prioritize the concepts that clinicians also prioritize.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Enhancing the interpretability of the model in the medical field is an interesting and important topic. Based on CBM, the paper promoters go one step further, aligning predefined concepts with clinicians and being able to incorporate more fine-grained knowledge.
    2. The method is well validated on two datasets and shows promising results.
    3. The adaptability and robustness of the model are improved.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Regarding interpretability, there is a lack of visual evidence to demonstrate whether the importance of concepts learned is consistent with clinicians.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The experimental settings are also detailed, which should ensure the reproducibility of the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    a. It would be better to introduce the evaluation metrics. b. More discussion on the performance degradation when intergrating clinical knowledge on in-domain datasets. c. Add one more figure to illustrate that the importance score distribution learned by the model is well aligned with clinicians.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    a. Integrating clinical knowledge and enhancing its interpretability is an interesting topic. b. The paper is well organized and easy to understand.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The study of integrating clinical knowledge into the model is an interesting topic. The figure I was concerned about was easy to add to the final draft. So, I stick to my original judgment.




Author Feedback

We thank the reviewers for their constructive feedback and suggestions. We are pleased that they appreciated our work as very useful (R1), a strong result (R1), an interesting and important topic (R3), and that integrating clinical knowledge in a CBM setting is novel (R4).

Why perturbation-based method (PBM)? (R1): We use PBM for the following reasons: 1. PBM is a widely adopted approach for explaining model predictions (arXiv:1311.2901, 1602.04938 with 15000+ citations). 2. It offers a flexible framework for handling complex, non-linear mappings between concepts and classes, and is applicable to any model architecture. Alternatives like weight constraints or priors in probability space often only work well with linear mappings, relying on predefined thresholds that are challenging to justify and require extensive tuning (DOI: 10.1016/j.patrec.2021.06.030). 3. The importance scores obtained from PBM not only guide the model to align with clinical priorities but also offer insights into each concept’s contribution to the final prediction, thereby enhancing interpretability. Alternative methods lack this capability.

Inadequate literature review (R4): We value the feedback and agree that citing modern CBM variants helps readers understand the broader context of our work. We wrote the related work primarily focusing on integrating clinical knowledge into models, which we consider the main novelty of our approach, rather than improving CBMs for better performance. In this regard, we note that the papers listed by R4 do not introduce clinical prior knowledge into the models. To our best knowledge, there is no existing work that incorporates clinical knowledge into CBMs.

Relevance to Shapley value & generalization to numerous concepts (R4): Shapley value calculates the average marginal contribution of features across all possible combinations, which is computationally expensive. Our method can be viewed as a simplified version of this, measuring concept importance through individual concept removal and observing resulting changes in probabilities. Our approach greatly reduces computational complexity, while it remains effective for our objectives of maximizing or minimizing concept importance based on expert rankings via alignment loss. In this way, our model can generalize well, even in settings with numerous concepts (>50), at a significantly lower computational cost than Shapley values. Our promising results on datasets with 11 (WBC) and 22 (skin) concepts demonstrate significant improvements in OOD performance, indicating the potential for our model to efficiently handle larger concept sets.

Comparison with SOTA CBMs (R4): While we wish to simply perform additional experiments with some SOTA CBMs, the rebuttal rules this year prohibit it. However, we note that our alignment loss can be easily plugged into any CBM models, and we believe our methods can benefit other CBMs, including Post-hoc CBM (PCBM). Our loss focuses on the class predictor (c->y), where we guide it to make predictions based on concepts prioritized by experts. PCBM differs from vanilla CBM by utilizing concept activation vectors or multimodal learning to learn concept representations without annotations (x->c), yet it still employs a class predictor for final predictions based on these concepts (c->y), which can benefit from our method.

Comparison with Distilling BlackBox [Ghosh et al, MICCAI23] paper (R4): While we agree it is interesting to compare, we are afraid that the comparison may be unfair as the method needs fine-tuning on OOD datasets and still requires a small portion of class labels from OODs. Our approach improves the performance on OOD without further training or fine-tuning.

Additional figures (R3, R4): We will include a heatmap plot of importance scores with clinicians’ rankings and examples showing how our method corrects incorrect predictions. This is easy to do, and we do have some space left in the supplementary.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers agreed that the integration of clinical knowledge is valuable. However, the reviewer also pointed out the lack of a related literature review and insufficient validation. Given these concerns, the current manuscript is not ready for presentation at MICCAI.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers agreed that the integration of clinical knowledge is valuable. However, the reviewer also pointed out the lack of a related literature review and insufficient validation. Given these concerns, the current manuscript is not ready for presentation at MICCAI.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There are mixed review comments (4->4, 4->4, 2->no post rebuttal). The paper proposes an interesting an useful way to incorporate physician feedback into CBMs. The paper is missing important papers and the reviewer has concerns with the baseline selection. Nevertheless, the method looks novel and clinically meaningful. I think it is worthy to discuss in MICCAI community.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There are mixed review comments (4->4, 4->4, 2->no post rebuttal). The paper proposes an interesting an useful way to incorporate physician feedback into CBMs. The paper is missing important papers and the reviewer has concerns with the baseline selection. Nevertheless, the method looks novel and clinically meaningful. I think it is worthy to discuss in MICCAI community.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewers and meta-reviewers agreed that this paper presents an interesting idea/method to incorporate clinical knowledge into deep learning, which is an important aspect to enable trust to models. The insufficiency of literature review is valid but I thought it can be somewhat addressed in the final version and also due to limited space, it is not realistic to include a prolonged literature review in the MICCAI paper. So I would suggest acceptance to this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Reviewers and meta-reviewers agreed that this paper presents an interesting idea/method to incorporate clinical knowledge into deep learning, which is an important aspect to enable trust to models. The insufficiency of literature review is valid but I thought it can be somewhat addressed in the final version and also due to limited space, it is not realistic to include a prolonged literature review in the MICCAI paper. So I would suggest acceptance to this paper.



back to top