List of Papers Browse by Subject Areas Author List
Abstract
Attributing model predictions to a set of variables is a crucial part of methods in medicine and the sciences. However, in multimodal settings, ablating the contribution of a particular part of an image is often challenging. We present the STRAP framework (separable tissue representations for attributable risk prediction) using a novel masked autoencoder (MAE) enabling learning representations of a varying number of image patch tokens, enhancing memory efficiency and flexibility. We apply this framework on a fracture risk prediction task using clinical features and high-resolution peripheral quantitative computed tomography (HR-pQCT) images, to investigate the contribution of bone vs. muscle and fat tissues. Unlike previous work, we are able to selectively include specific tissues in risk prediction, and attribute their contribution to the risk using ablation and state-of-the-art interpretability methods. For the first time, we demonstrate that including soft-tissue from HR-pQCT increases prediction performance both in terms of C-index and overall AUC. Source-code is openly published online: https://github.com/waahlstrand/strap.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2899_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/waahlstrand/strap
Link to the Dataset(s)
N/A
BibTex
@InProceedings{WåhVic_Separable_MICCAI2025,
author = { Wåhlstrand, Victor and Alvén, Jennifer and Johansson, Lisa and Axelsson, Kristian and Lorentzon, Mattias and Häggström, Ida},
title = { { Separable tissue representations for attributable risk prediction } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {573 -- 583}
}
Reviews
Review #1
- Please describe the contribution of the paper
The study presents visual transformer modification, where only image batches which covering predefined ROIs in the image are included in model training. The model accepts both clinical variables and image as input. Authors use the masked autoencoder approach to derive relative attribution of each input variable to the final predictions. The model is evaluated with fracture risk prediction task. The model is trained with pQCT images and fracture related information from 3028 women, where bone, fat and muscle ROIs are segmented from pQCT before feeding the image and ROIS to the network. The dataset was first split to train and final test set, and then the train set was split into train, validation and test sets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The relative significance of input parameters would be very helpful in clinics since it could in optimal situation help clinician to judge which parameters are the most significant for fracture risk for the specific patient and where the clinician should focus the intervention to.
I also appreciate work related to predictive models with deep learning since that field is less addressed than e.g. lesion detection approaches.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Author report that the used dataset is over 3000 female patients, but they do not report the age group or other main osteoporosis related parameters as average BMI, and height. Most importantly, they do not report the number of fractures in the dataset or how long the fractures were followed. Often, if fractures are followed 5 or 10 years, the portion of patients with fractures may be among 2-5%, depending on if only hip fractures or all osteoporotic fractures are accounted. Therefore, the number of cases may be in the set quite low, and therefore it is possible the statistical power behind the reported results is also only moderate.
It is quite difficult to interpret the results in Fig 3. Authors show high relative attribution of fat and muscle measured from radius and tibia on fracture risk. The attribution is much higher than for example for the bone. The result surprising, and it is difficult to interpret the clinical mechanism behind such finding. According to other literature, the bone has higher role on fracture risk than the soft tissues. Authors do not discuss very detail about their findings and if they are realistic related to known human physiology and anatomy. Therefore, it remains somewhat unclear how reliable results the attribution analysis in its current form provides.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Introduction: “…but the impact of other major tissues, such as the distribution of muscle and fat tissues, is poorly explored, and being able to explicitly investigate the contribution of these tissues to prediction is of great interest.”
The estimation of the amount of soft tissue e.g. from DXA has been examined, e.g. 10.1016/j.jocd.2013.01.007. The protective role of the soft tissue for fractures has been examined through BMI and weight, and also the protective role, or padding, of the soft tissue has been evaluated in relation to the impact force of the femur, e.g., 10.1002/jor.1100130621, and related to the fracture strength, e.g., 10.1359/jbmr.070309. Therefore, I would not say the impact of other major tissues are poorly explored. - Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
With current information given, such as lacking number of fractures in the dataset, it is difficult to evaluate how realistic the reported results are.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
In survival analysis it is very important to know the number of negative and positive cases and patient demographics to evaluate performance and results of the study. Unfortunately these information were not available for the reviewer since they were not given in the original version or during rebuttal. This made it difficult to evaluate how well the results that the method extracted reflect the reality.
Review #2
- Please describe the contribution of the paper
This paper creates a method which separates tissue representations for the evaluation of model attributions on a risk prediction. The main contributions of this paper include three parts. The first is an adapted vision transformer which enables regions of interest which increases memory efficiency. Another is to create a multimodal framework with interpretable separation of tissues to understand the effect of different biological features in a prediction from an MAE. The third claim is that they improve risk prediction from a baseline method of osteoporosis risk predictions using only soft tissue representations.
The model is evaluated on osteoporosis risk prediction as the clinical task. Interpretability is illustrated and compared using Grad-CAM and IG. Aggregation of tissue attributions are used to compare methods and the relationship between tissue representation and clinical features. An ablation study of different tissues allows for an understanding of the effect on the output predicted risk factors and in turn the information the model captures. They also report some model efficiency metrics. Overall, the paper shows improved performance when compared to other prior work and other explored methods.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strength of this paper is the model’s unique ability to separate the tissues. I believe the authors have a unique idea of approaching multimodal fusion. They evaluate how different tissues within a model are a variable which most papers do not. This information can is then separated and can be used in a downstream evaluation to better understand how the inputs from those tissues affect the prediction or decision. This is critical in clinical decision making and interpretability. Overall the impact of this work would be large as interpretability and understanding of decisions is important to clinicians. I think there is some clarity in the presentation of these methods on this aspect, but the overall idea of information separation is unique and compelling. The ablation study also helps to illustrate and support the effect of each tissue.
The authors leverage the existing MAE model which is a great start for their work to reduce computational load. The comparison of different architectures helps to explore how various aspects of the model compare to one another and compare to previous work.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
In the intro, the second sentence is way to long please split it up for clarity. The overall motivation for osteoporosis makes sense but what is the clinical baseline? You say there are these methods and known variables but then in the last sentence you say the impact of other major tissues is poorly explored? What do you mean poorly explore by automated methods or in general understanding of the disease? Don’t we know BMI has an effect as you previously stated so what do you mean here specifically. Please elaborate or clarify.
The authors say ‘‘survival analysis estimating fracture risk plus other variables.’’ Don’t you combine variables to make decisions? or clinicians do? What is the current clinical standard for this application in the decision making process. Your third claim is only using soft tissue representations? why is this needed? its unclear why you need an only soft tissue prediction model.
Some of the details are unclear of prior work. Ex. Jaiswal sentence says ‘‘predictive power’’ but of what? also fracture classification? also the last sentence says that methods are naturally interpretable but are limited. Finite element features are sort of learnt if you mean that ML values are learnt in optimization. Make this clear what FE features they are extracting then for this prediction and that they are just matrix calculations? Overall the last bit of the intro section has some interesting information on the different methods mentioned. However a lot needs to be elaborate. ex. why do integrated gradients allow for joint attributions over grad-cam/occlusion? if its a joint network couldn’t grad cam still learn the output if the tabular features are within the network? also vit scales
Preliminary risk prediction (sec 2.1) check brackets in Sentence 1. It’s also unclear if this seems like background ? what method are you using you just state the conventional vs new methods for risk prediction? This also may make more sense at the end of the methods section after you ahve introduced how you get the input variables for the loss equation. How do you optimize for number of patches? they are currently all even between tissue types? If you have time, a visual would be good for your strap variants or expanding on your current figure. You mention in self-supervised with probing that both reconstruction loss and clinical risk factor will influence by clinical risk factors in this method. It should be made more clear why this is different then in your loss function when you combine these outputs in the cox loss then you will be doing the same? Fig 1 is confusion after I read about the different variation I don’t think STRAP uses L recon? as feed back wasn’t this the difference for MAE-STRAP-P? Please clarify in the image or in the text?
You say ‘older woman’ are how old? What is the age range? of 90% training validation what was the split within. Why does the patch probability change between groups? and usually MAEs are more like 75% why do you use 0.8-0.85? Why were epochs changed between methods? Fig 2 needs a better description. it is unclear why you are showing a, b and c? Why not show integrated gradients for all? You say that it’s a baseline but why not show both then? You say it shows the variability but its not clear to me what that is? I think you should reference the figure after this is described. Once I kept reading it made a bit more sense but I had to look back and forth many times. There is some in the text but it is not clear if someone does not know what the ‘‘undesired attribution on leg arm support?’’. Maybe add arrows to the figure to highlight what you are talking about.
You are comparing the grad-cam results to the IG results between methods to make a point, which is a bit unfair. You should show both for all methods if you want to compare. I also am unclear at this point if you are trying to say that MAE-STRAP is better as it is more continuous or if STRAP is better. Please clarify this result. Fig 3. b) why are you only showing MAE-STRAP-P? you say ‘‘at representation level we aggrate the tissue representations’’ but from the patches? or from the final output predictions? This should be clarified which stage you take the representation at. Is it a sum of representation from multiple output samples?
There is an ‘‘80%” reduction in complexity? when comparing what to what? and define complexity. Overall there is limited reported values other than that and patch reduction which are not clearly compared. You cant really claim form this there is a reduction in computational cost when you have not clearly outlined the comparison or done it on the multiple methods you have mentioned.
A conclusion is that the best model includes all tissues. This is logical as you are including more information. There is no explanation of if the difference in adding bone+/-muscle+/-BMI is added at the same amount of information. Ex. Explain how your network is different by doing these things the same. either you have added for example 12 pixels of bone or 6 bone 6 bmi OR did you increase the amount of input data when increasing? Adding more information/adapting the network would create more chances to learn. This needs clarification.
Overall the stated contributions by the author do not feel clearly met and require clarification. Your main claims are that you have a novel transformer. This is supported however I would make it more clear how the ROIs are added in comparison to the MAE in general. How are the ROIs selected differently than an MAE? It seems you just take an even distribution (randomly) from each tissue. Is that the novelty over MAE which just randomly takes and does not consider any difference in pixel? If so it would be critical to compare if you just randomly selected in each iteration. In theory this would better reflect the amount of each tissue in the image. Please clarify this point. Multimodal framework is stated to be interpretable. I think this needs clear evidence or explanation. There is not real support other than you tried some visualization/interpretability. How is this verified? Your claims state “improved risk prediction from baseline using only soft tissue”, however your methods states that you found the best results when comparing all tissues. Please clarify.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
In general I think the ideas and motivation are clearly laid out. I think overall the paper make a few general claims which are not supported. The author needs to fully evaluate these aspects or clarify in the work where I noted some confusion or contradicting statements if they wish to support their original claims. If not they can change the wording of their claims to better fit the scope of their findings. I think the promises were not met currently by the authors and require additional evaluation or explanations to support their claims.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors addressed my concerns fully. They were able to also better highlight the importance and contributions of their work. I am more convinced of their ideas.
Review #3
- Please describe the contribution of the paper
the manuscript “Separable tissue representations for attributable risk prediction” describes an AI-method to predict fracture risk from HR-pQCT using not only bone but also fat and muscle tissue. Their method performs slightly better than alternative approaches.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- incorporation of bone, fat and muscles in fracture prediction.
- good dataset
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The authors claim in the abstract to be the first finding out that soft-tissue from HRpQCT (muscles & fat) contributes to fracture prediction, however it is well known that muscle-mass per se contributes to fracture risk.
- Harrel’s and Uno’s concordance are not very common measures, please explain.
- please provide more datails of dataset: which scanner do you use, which mAs, kVp, which pixel spacing, slice increment, poputlation age, SD, etc.
- why did you resize images, which resolution was finally used?
- You claim that risk prediction is often done with DeepSurv, please provide at least 3 examples then or remove sentence.
- there are some tipos throughout the paper, please revise.
- what do you mean by “For images, including a limited region of interest in an image is more challenging” ?
- Please remove second part of sentence “Such methods are naturally interpretable, but limited, since the features are hand-crafted and not learnt”, do you mean limited in number or in quality? Some “handrafted” (maybe better “mindcrafted”) are indeed superior to AI-features.
- “Efforts ot include specific parts of an image typically rely on cropping, which is not suitable where features are embedded in each other”. Here cropping itself is not the problem. One has simply to crop the “region of interest” meaning the entire region that might be important. You won’t crop randomly.
- Fig. 2 too small.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I like the idea to combine fat, muscle and bone for fracture prediction.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have addressed my concerns.
Author Feedback
We sincerely thank the reviewers for their invaluable insights and thoughtful suggestions, which have significantly contributed to improving the manuscript. Below, we address the main concerns.
Contributions & interpretability [R2,R3]: Our method’s main contribution is attributability by ablation, enabling the inclusion and exclusion of spatial tissue features, which is not efficiently possible with compared methods. The manuscript now emphasizes our contributions: (i) the ability to flexibly isolate and model risk contributions from different tissues in images, (ii) a novel architecture enabling input of a variable number of patch tokens from segmented regions and (iii) attribution in multimodal survival analysis. This work is, to our knowledge, the first to enable ROI-based image attribution in survival prediction using MAEs and ViTs. We also highlight the potential of the STRAP framework beyond bone imaging to underline its general usefulness for tissue contributions.
Visual attribution methods like Grad-CAM are notoriously vague, sensitive to model backbone and intended for qualitative inspection, and may focus on extraneous elements or background. We will clarify this in the text and caption of fig. 2. Due to the variable number of spatial tokens from our encoder, STRAP variants are not compatible with Grad-CAM, although IG is preferrable in joint tabular-spatial attribution since Grad-CAM is only intended for spatial features. While the former is commonly used in the community, we will change fig. 2 to use IG for uniformity.
ROI & patches [R2]: Now clarified in the text, patches across all tissues are selected randomly. As pointed out weighted patch sampling is an interesting future direction. The prior knowledge of tissue type allows us to encode each set of patches separately, yielding 3 different representations from the encoder. Since the number of tissue patches is significantly less than that of the full image, the shorter input sequence leads to a drastic decrease in the encoder transformer memory use.
Baseline comparisons [R2,R3]: We report modest gain over baselines. We stress that our main contribution is not absolute performance gain but enabling interpretable, tissue-specific modeling in a deep survival analysis - something ConvDeepSurv and similar methods cannot. That said, STRAP does perform comparably or slightly better across metrics while providing meaningful attribution and greater flexibility. We have revised the text to highlight this. We have also added references showing the wide use of DeepSurv.
Cohort & scanner details [R1,R2,R3]: We agree that some additional data on the cohort is needed, and has now been added. We refer to the (anonymized) paper on the cohort and acquisition for in-depth description, however.
Soft tissues & features [R2, R3]: In line with the reviewers’ helpful comments, we have now made it explicitly clear that extracted features (beside BMI) from other modalities have been used to model fracture risk. We have added the important references and stated that as far as we are aware, no other works have modelled risk contribution from direct ablation of soft tissues, nor as auxiliary information from HRpQCT. Moreover, we have emphasized the important point that hand-crafted features, like stiffness and load extracted from FEM, have clinical motivation, but are limited in that similar features may not be known or computable for soft tissues. In this sense, learnt features may be more expressive and convenient.
Clinical translation [R3]: Concerning the clinical relevance of the attributability, we agree that this is an important direction to pursue, but we believe it is out of scope here and suited for a paper on clinical translation.
Readability & clarity: We thank all reviewers for their comments on readability and clarity, which we have made sure to address. Thanks to R2’s extensive reading of the paper we have also made several miscellaneous improvements.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Although R3 has recommended a Reject, I tend to agree with R1 and R2 that this work has merit. Perhaps some of the objections raised by R3 can be addressed in the final version where anonymity is not required.