Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Deep Learning (DL) has emerged as a powerful tool in neuroimaging research. DL models predicting brain pathologies, psychological behaviors, and cognitive traits from neuroimaging data have the potential to discover the neurobiological basis of these phenotypes. However, these models can be biased from information related to age, sex, or spurious imaging artifacts encoded in the neuroimaging data. In this study, we introduce a lightweight and easy-to-use framework called ‘DeepRepViz’ designed to detect such potential confounders in DL model predictions and enhance the transparency of predictive DL models. DeepRepViz comprises two components - an online visualization tool (available at https://deep-rep-viz.vercel.app/) and a metric called the ‘Con-score’. The tool enables researchers to visualize the final latent representation of their DL model and qualitatively inspect it for biases. The Con-score, or the `concept encoding’ score, quantifies the extent to which potential confounders like sex or age are encoded in the final latent representation and influences the model predictions. We illustrate the rationale of the Con-score formulation using a simulation experiment. Next, we demonstrate the utility of the DeepRepViz framework by applying it to three typical neuroimaging-based prediction tasks (n=12000). These include (a) distinguishing chronic alcohol users from controls, (b) classifying sex, and (c) predicting the speed of completing a cognitive task known as ‘trail making’. In the DL model predicting chronic alcohol users, DeepRepViz uncovers a strong influence of sex on the predictions (Con-score=0.35). In the model predicting cognitive task performance, DeepRepViz reveals that age plays a major role (Con-score=0.3). Thus, the DeepRepViz framework enables neuroimaging researchers to systematically examine their model and identify potential biases, thereby improving the transparency of predictive DL models in neuroimaging studies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2088_paper.pdf

SharedIt Link: https://rdcu.be/dV53W

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_18

Supplementary Material: N/A

Link to the Code Repository

https://github.com/ritterlab/DeepRepViz https://deep-rep-viz.vercel.app

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Ran_DeepRepViz_MICCAI2024,
        author = { Rane, Roshan Prakash and Kim, JiHoon and Umesha, Arjun and Stark, Didem and Schulz, Marc-André and Ritter, Kerstin},
        title = { { DeepRepViz: Identifying potential confounders in deep learning model predictions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {186 -- 196}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes Con-score as a metric to qualitatively identify confounded predictions. They demonstrate that it identify and quantify potential sources of confounding on the task of predicting chronic alcohol users from MRI.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Quantifying confounding bias is a critical step for employing machine learning models in the clinical safely.
- I appreciate that a web-based interface has been provided.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The discussion of related work is inadequate.
- A precise definition of which types of confounding bias can be detected is missing.
- The experiments are not insightful.
- No quantitative evaluation of the proposed approach.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

No details about the network architecture or training setup have been provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
In the era of deep learning, confounding bias has a major role. I appreciate the authors raising awareness and trying to help practitioners by proposing a score to quantify to which extent the predictions of a model are biased. Unfortunately, the paper has major flaws that prevent me from recommending acceptance at MICCAI.

First, the paper does not discuss related metrics to quantify bias adequately. There is a ridge literature on metrics to quantify bias in the literature on fairness in machine learning (e.g. https://ojs.aaai.org/index.php/AAAI/article/view/11564; https://arxiv.org/abs/2207.11385). It is unclear what the proposed Con-score provides that existing quantitative scores are lacking.

Second, the assumptions the proposed Con-score relies on are not discussed. The paper seems to indicate that only the basic scenario applies, i.e. where the confounder is a direct parent of the MRI and the outcome. However, confounding can arise due to more complex relationships (see e.g. https://doi.org/10.1080/26939169.2023.2276446). Let’s consider M-bias: would Con-score be able to detect it? Related to that: what if a variable is not a confounder but a mediator? Would he Con-score wrongly attribute confounding bias to it? In addition, a critical assumption for Con-score to work is that all confounders are known and have been measured. For practical purposes, this is a very strong assumption. I believe a practitioner wants to know what the overall impact of confounding (observed or unobserved) on prediction has. Is 1% of my prediction due to confounding, or 90%? Maybe the authors could find inspiration from the literature on short-cut learning that does not require knowing all confounders:
- https://proceedings.neurips.cc/paper/2020/hash/eddc3427c5d77843c2253f1e799fe933-Abstract.html
- https://proceedings.neurips.cc/paper/2021/hash/d360a502598a4b64b936683b44a5523a-Abstract.html
- https://proceedings.neurips.cc/paper/2021/hash/0987b8b338d6c90bbedd8631bc499221-Abstract.html
- https://proceedings.neurips.cc/paper_files/paper/2021/hash/64ff7983a47d331b13a81156e2f4d29d-Abstract.html
- https://proceedings.mlr.press/v139/liu21f.html
Third, the provided experiments provide little insight. The experiments should help to understand in which scenarios Con-score works well and where it fails. Unfortunately, due to a lack of a quantitative evaluation, its strengths and weaknesses remain elusive. The experiments on synthetic data are too simple: they only consider a single variable to be checked and a 2D feature space. With respect to the experiments on real data, the paper misses a great opportunity to illustrate how Con-score could be used to assess and compare bias-mitigation techniques (e.g. ref. 23), which could be of high impact.

Minor issues:
- The paper correctly mentions that what confounding is always depends on the research question. I would encourage the authors to use the same language in the introduction: instead of saying “age is confounder”, it would be better to say: age is confounding the relationship between MRI and diagnosis of Alzheimer’s disease.
- What is the operator ⋅ in equation (1)?
- How can the cosine distance be computed when the number of parameters differs between the last DL layer and the linear layer?
- When making use of the UK Biobank data, please make sure you acknowledge their efforts by citing them.
- “using a state-of-the-art DL architecture, 3D ResNet-50” (p. 4): please add a citation.
- It would be interesting to understand if Con-score is also applicable to other machine learning models such as gradient boosting or random forest?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Strong Reject — must be rejected due to major flaws (1)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- No discussion of related work on quantitative scores.
- No discussion on assumptions embedded into the framework.
- No quantitative evaluation.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Strong Reject — must be rejected due to major flaws (1)
[Post rebuttal] Please justify your decision

I appreciate the authors’ dedication to improving their manuscript. Unfortunately, I have to judge the manuscript as submitted and cannot consider potential future updates, which would need to be substantial for me to recommend acceptance. The lack of a comparison to existing metrics to quantify bias and the fact that the experiments are too simple to generate valuable insights regarding the pros and cons of Con-score are severe issues.

Review #2

Please describe the contribution of the paper

The article introduces a tool named DeepRepViz, designed to assess the impact of various confounders on a model. The framework consists of two parts: the first part visualizes the data’s latent representation, while the second part introduces a metric called “Con-score.” This score is derived from the coefficient of multiple determination of the linear model predicting confounder k and the cosine similarity between the linear model predicting confounder variables and the final deep learning layer predicting the target variable. The Con-score is used to show the specific impact of a confounder on model predictions.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed tool is well-developed and can be accessed directly through a web interface.
2. It potentially demonstrates the extent to which various confounders affect model decisions. After identifying confounders, other methods could be used to reduce their impact on model predictions.
3. It helps people better understand how models make decisions. A high Con-score may indicate that the model’s predictions are influenced by the confounding factor c, or it might suggest that c is a “crucial feature” upon which the model’s predictions rely.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Although qualitative experiments on synthetic datasets seem to show that the Con-score can reflect the impact of confounders on model predictions, there is no rigorous theoretical proof that the Con-score can accurately measure the impact of confounders on model predictions. Real-world scenarios are significantly more complex than the two-dimensional synthetic datasets presented in the article.
2. While the title and main text emphasize Identifying Confounders, the proposed method does not guarantee the identification of confounders; a high Con-score might also indicate a crucial feature.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. It would be beneficial to provide more rigorous and detailed evidence to substantiate the efficacy of the Con-score in measuring the impact of confounders on model predictions.
2. The paper should clearly emphasize that a high Con-score might indicate that c acts as a potential confounder affecting model predictions, or it might reveal that c is a crucial feature for model predictions. The title and abstract could also highlight the focus on identifying “potential” confounders, underscoring the preliminary nature of the identification process.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The theoretical framework of the paper is not rigorous, and a high Con-score might also indicate a crucial feature, not just a confounder. Additionally, the experiments conducted are relatively simplistic, and the synthetic datasets used are very simple compared to real-world scenarios.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

I think this paper has merits and useful contribution to the community. However, the “weak” accept is because there are some overclaims on whether the visualization are actual confounders or ‘could be confounders”.

Review #3

Please describe the contribution of the paper

The authors introduce quantitative metric (con-score) that can identify variables in a study with a risk of confounding deep learning models, and a web-based visualization tool for inspecting learned representations in the penultimate layer of models to qualitatively examine confounders. The metric is based on the assumption that if the model is using a confounder pathway for prediction, the confounder will be linearly predictable from the feature space, and that this linear predictor will correspond with the prediction from final layer of the deep learning model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

• Quality – The paper is very well-written, enjoyable to read, easy to follow, and with nice figures. • Contribution – Valuable contribution of an easy-to-use open source tool and a simple, intuitive, generalizable metric. • Experiments – The experiments nicely show the intuition behind the metric with the synthetic data, and show an applicable deep learning scenario with a diverse selection of prediction tasks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

• None to note.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

• ResNet training hyperparameters, and UKBiobank data splits are not shared. Although I don’t think this is necessarily crucial, since the authors only use this as a case example and they seemingly provide the features for this model on the online tool. • While relatively simple, it would still be nice for the authors to share an example notebook that implements con-score.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

• Is there a typo in equation (1) where cos(theta) is supposed to be an absolute value? (If not could the authors explain the single vertical bar?) • Check for typos in discussion - “control control” is written a couple times. • Future work/extensions could benefit from a fairness analysis to complement con-score. This would verify that a particular confounder is tangibly harmful/”unfair”when used in predictive modeling tasks (e.g., since the con-score indicates that sex is confounding alcohol user classification, is there a high FNR among female alcohol users?)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A very well-written paper that provides a useful and valuable contribution to multiple sub-domains within MICCAI (fairness/generalizability, interpretability, population studies).
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Accept — should be accepted, independent of rebuttal (5)
[Post rebuttal] Please justify your decision

To me, Con-score/DeepRepViz was clearly “marketed” as a data exploration tool, and it appears their revisions will clarify this even more as well as soften the language around definitively labeling confounders. I like the relative simplicity and potential utility of the author’s work and will keep my score at a 5.

Author Feedback

In this study, we share a lightweight framework called DeepRepViz for the neuroimaging research community. The goal is to provide practitioners analyzing large (n>10000) neuroimaging datasets, with a tool to debug, understand, and diagnose their predictive deep learning (DL) models. We offer the tool as a Pytorch wrapper that can be easily integrated with any ongoing Pytorch DL project to visualize the model’s latent (penultimate) representation and draw qualitative insights, or compute Con-scores. The large neuroimaging datasets such as UK Biobank, ENIGMA, and NAKO contain a battery of psychometric, sociodemographic, and data acquisition information related to the subjects of the dataset. With DeepRepViz we offer a way to use all of this available information to systematically probe and visualize the DL model’s latent space. Therefore, for the research field using predictive DL models with large neuroimaging datasets, we provide a simple and easy-to-use add-on tool to train DL models transparently (please refer to Fig 1 where we demonstrate it with a clinically relevant application). We greatly appreciate the reviewers for their constructive and valuable feedback. We would like to acknowledge the key criticisms individually below and provide justifications where possible:

Reviewer 3 points out that the discussion of related metrics is missing. Thank you for identifying this. In the revised manuscript we would like to add a paragraph before methods, contrasting the Con-score with other other existing bias detection metrics from the fairness and causal literature. We are developing DeepRepViz with the goal of making it a generic tool containing a collection of complimentary metrics from the fairness literature (https://aif360.res.ibm.com/) such as the Statistical Parity Difference and FNR. Indeed, DeepRepViz already contains the Silhouette Coefficient to detect clusters for categorical variables and distance correlation for continuous variables (please click on ‘other metrics’ at https://deep-rep-viz.vercel.app/ ). In the revised manuscript, we shall mention this under future work as kindly suggested by Reviewer 5.

Reviewer 3 also points out that a discussion regarding which types of confounding bias can be detected by our method is missing. As also mentioned by Reviewer 6, a high Con-score for a variable ‘m’ indicates that either ‘m’ is a crucial feature or a confounder in the analysis. It does not differentiate between a mediator and a confounder. Thus, a high Con-score is a necessary condition for ‘m’ to be a confounder but not a sufficient condition. Nonetheless, it indicates the presence of a tertiary variable that can be used by the practitioner, along with the knowledge of the structural causal graph of their application, to either control for ‘m’ or interpret it as a crucial feature. Nevertheless, we acknowledge that this clarification is missing in our text. We will revise the text including abstract and title and clarify that our framework can be used to identify only “potential” confounders as mentioned by Reviewer 6.

Additionally, we would like to acknowledge the limitations of the experiments performed on the synthetic data as well as the MRI data. The goal of the synthetic experiment was to only demonstrate the rationale behind the two components of the Con-score. In our future work, we shall validate DeepRepViz on more complex synthetic data with diverse confounding scenarios.

To improve the reproducibility, we have added a link to the exact model architecture. We will also make the code repository containing the training process available along with a tutorial notebook as requested by Reviewer 5.

We would also like to acknowledge all the other minor points such as the missing citations and typographical error in equation1, have been considered and will be updated in the revision. We believe the planned revisions will improve the paper significantly.We thank all three reviewers again for their time.

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Looking across all reviews and rebuttal details, I’d like to champion the paper. I understand that the method needs a more thorough evaluation and more in-depth analysis on the relationship to existing bias metrics, but I do think this paper proposes an interesting idea, reflects original thinking, and has great potential in actually being used in routine analysis. I personally would love to have a tool/metric like this to quantify confounding effects in trained models.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

Looking across all reviews and rebuttal details, I’d like to champion the paper. I understand that the method needs a more thorough evaluation and more in-depth analysis on the relationship to existing bias metrics, but I do think this paper proposes an interesting idea, reflects original thinking, and has great potential in actually being used in routine analysis. I personally would love to have a tool/metric like this to quantify confounding effects in trained models.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers identified a couple of issues with the paper. The most severe one is the theoretical foundation of the Con-Score. In causality, the theoretical underpinning and detailed listing of assumptions is crucial. In the rebuttal, the authors acknowledged that the con-score could either indicate a confounder or a “crucial feature”. R3 further criticized that a high con-score can also result from other types of biases, mediators, and M-bias. So, while the name con-score may indicate a score of confounding, it rather seems to reflect a vague concept of association. Together with insufficient discussion of related work and limited experiments, I recommend reject.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

The reviewers identified a couple of issues with the paper. The most severe one is the theoretical foundation of the Con-Score. In causality, the theoretical underpinning and detailed listing of assumptions is crucial. In the rebuttal, the authors acknowledged that the con-score could either indicate a confounder or a “crucial feature”. R3 further criticized that a high con-score can also result from other types of biases, mediators, and M-bias. So, while the name con-score may indicate a score of confounding, it rather seems to reflect a vague concept of association. Together with insufficient discussion of related work and limited experiments, I recommend reject.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper proposes an interesting original idea. However, the paper has several flaws. First, the experimental setting is quite week; there is no quantitative evaluation or comparisons with other approaches. Importantly, the proposed approach does not necessarily identify confounders as advertised but instead features that are associated with the predicted label. Lastly, while the idea is interesting, no insight is provided regarding assumptions, informing the user about where this might fail and how.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

The paper proposes an interesting original idea. However, the paper has several flaws. First, the experimental setting is quite week; there is no quantitative evaluation or comparisons with other approaches. Importantly, the proposed approach does not necessarily identify confounders as advertised but instead features that are associated with the predicted label. Lastly, while the idea is interesting, no insight is provided regarding assumptions, informing the user about where this might fail and how.

Meta-review #4

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper provides an exploratory tool to aid the discovery of shortcut learning in predictive algorithms. This is an important problem, and I would like to champion the paper in spite of the weaknesses brought up by reviewer 3.

Shortcut learning is a risk for any predictive model, which can be extremely harmful for generalization. When demographic variables act as shortcut, this can help create demographic biases in models, with harm to healthcare equity. At the same time, shortcut learning can be incredibly difficult to foresee and detect before the performance drop is actually observed. Hence, an easily accessible exploratory tool could provide incredible value to the MICCAI community – as could the discussions following the paper, which would no doubt also lead to an increased awareness of these problems.

That being said, Reviewer 3 brings up important points regarding the statements made in the paper, and I completely agree with the reviewer that the assumptions made need to be clarified, and the potential weaknesses need to be discussed, in the final paper. This is crucial to make sure the model is not mis-interpreted.

But having made those assumptions clear, this paper would definitely be valuable to MICCAI, especially helping boost the health equity angle, as shortcut learning is one of the important mechanisms that can lead to algorithmic bias.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

This paper provides an exploratory tool to aid the discovery of shortcut learning in predictive algorithms. This is an important problem, and I would like to champion the paper in spite of the weaknesses brought up by reviewer 3.

Shortcut learning is a risk for any predictive model, which can be extremely harmful for generalization. When demographic variables act as shortcut, this can help create demographic biases in models, with harm to healthcare equity. At the same time, shortcut learning can be incredibly difficult to foresee and detect before the performance drop is actually observed. Hence, an easily accessible exploratory tool could provide incredible value to the MICCAI community – as could the discussions following the paper, which would no doubt also lead to an increased awareness of these problems.

That being said, Reviewer 3 brings up important points regarding the statements made in the paper, and I completely agree with the reviewer that the assumptions made need to be clarified, and the potential weaknesses need to be discussed, in the final paper. This is crucial to make sure the model is not mis-interpreted.

But having made those assumptions clear, this paper would definitely be valuable to MICCAI, especially helping boost the health equity angle, as shortcut learning is one of the important mechanisms that can lead to algorithmic bias.

back to top

DeepRepViz: Identifying potential confounders in deep learning model predictions

Author(s):