Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Obesity is a chronic disease that increases the risk of multi-organ damage as well as cardiovascular disease, diabetes, and certain cancers. It is strongly related to Visceral Adipose Tissue (VAT), which is the fat stored around the internal organs. New approaches to assessing VAT in large populations are essential to understand how obesity contributes to chronic disease progression. Various direct and indirect measures have been developed to quantify VAT. However, many of these techniques either fail to distinguish between various types of body fats (e.g., subcutaneous versus visceral) or involve high radiation imaging or are costly (e.g., Computed Tomography). Annually, millions of individuals globally undergo hip or spine Dual-energy X-ray Absorptiometry (DXA) scans to screen for osteoporosis as well as lateral spine (LS) scans to detect vertebral fractures. In this paper, we develop a multi-modal attention-based framework for VAT estimation from LS DXA scans and patient demographic information. We compare our results on two LS DXA datasets with baseline methods and also perform clinical analysis to demonstrate its effectiveness. The proposed approach has the potential to enable cost-effective, non-invasive, and efficient quantification of VAT in people undergoing bone density assessment with LS scans. To the best of our knowledge, this is the first paper to predict VAT from DXA LS scans.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2744_paper.pdf

SharedIt Link: https://rdcu.be/eG4Dh

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_25

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/arroobamaqsood/A-Multi-Modal-Attention-based-Framework-for-Visceral-Adipose-Tissue-Estimation.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{MaqAro_From_MICCAI2025,
        author = { Maqsood, Arooba AND Saleem, Afsah AND Sim, Marc AND Suter, David AND Radavelli-Bagatini, Simone AND Hodgson, Jonathan M. AND Prince, Richard L. AND Zhu, Kun AND Leslie, William D. AND Schousboe, John T. AND Lewis, Joshua R. AND Gilani, Syed Zulqarnain},
        title = { { From Pixels to Prognosis: A Multi-Modal Attention-based Framework for Visceral Adipose Tissue Estimation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {249 -- 259}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper provides a novel deep-learning model for predicting visceral adipose tissue (VAT) content using lateral spine (LS) Dual-energy X-ray Absorptiometry (DXA) scans. Since LS DXA scans are two-dimensional (2D) images and cannot fully reflect anatomical structures, the authors propose to combine the patient demographic data and scan image to feed into the deep learning model for improving prediction accuracy. Two datasets are used to evaluate the proposed deep learning model. Furthermore, the authors investigate the correlation between the contents of VAT and some clinical markers related to metabolic syndrome.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The use of combining the patient demographical data and image data. Multi-modality data models are one of the most promising research fields for boosting clinical applications.
2. The investigation involving the correlation of the deep learning model prediction and clinical-used metabolic indicators. The authors conduct experiments to evaluate AI predictions and biochemical indices related to metabolic syndrome, and therefore, demonstrate the clinical functions of the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The concerns are mainly about the modality for VAT quantification and the experiment. The DXA method is commonly used to measure bone density. Although it can be applied to quantify VAT, the evaluation is acquired in an indirect way in the two-dimensional DXA image and cannot provide a fully accurate distribution/content of body compositions. CT/MRI-based methods are the gold standard for body composition analysis, especially for measuring different types of adipose tissues, as all body compositions can be visualized in three-dimensional images. Furthermore, several research works compared different VAT quantification methods, which revealed that utilizing DXA for VAT analysis is, to some extent, inaccurate[1-3]. This is also emphasized in this manuscript. Thus, for the purpose of quantifying VAT, more accurate methods should be considered, in my opinion, and AI-based methods should be encouraged based on more accurate data. On the other hand, the proposed deep learning model doesn’t show significant novelties. The adopted network structures including MobileNetv2, components of the Tabular encoder and the Attention Feature Fusion Model are widely used. As for the philosophy of the method, the authors use the age, height, and weight of the patient to complement the lacked information from the 2D DXA scan. Since DXA scans naturally fail to capture the volume and distribution of VAT, the performance of using simple demographic data of age, height, and weight is doubtful. Although experiments show improvement compared to the baseline method, the reported mean absolute percentage error (MAPE) results (25.88% and 21.96% in Table 1) are still high and indicate less accurate predictions. Without other evaluations/data, it is hard to determine the model performance. Furthermore, Spearman’s correlations between VAT measures and Triglycerides and High-density Lipoprotein Cholesterol in Table 4 show weak correlations between them. However, DXA is not the most accurate facility for VAT measurement, related conclusions deserve further validation. Finally, the authors include three references (references 16, 23, and 24 in the manuscript) to introduce the facilities (Hologic-4500A and Hologic Horizon) they used for acquiring the datasets. However, these three references are not highly related. The authors should consider using more related references to the facilities. In addition, the authors named the dataset using the facility for scanning, which may confuse those who are unfamiliar with this field. The dataset should be named more formally and contain more useful information.
1. Taylor, Jenna L., et al. “Accuracy of dual-energy x-ray absorptiometry for assessing longitudinal change in visceral adipose tissue in patients with coronary artery disease.” International Journal of Obesity 45.8 (2021): 1740-1750.
2. Maskarinec, Gertraud, et al. “Subcutaneous and visceral fat assessment by DXA and MRI in older adults and children.” Obesity 30.4 (2022): 920-930.
3. Bredella, Miriam A., et al. “Comparison of DXA and CT in the assessment of body composition in premenopausal women with obesity and anorexia nervosa.” Obesity 18.11 (2010): 2227-2233.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Multi-modality models deserve in-depth exploration in the medical image processing field. The authors’ philosophy of combining the image and patient demographic data is novel and reasonable. However, for the purpose of quantifying VAT, DXA itself is not a proper/accurate facility.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces a novel framework for predicting visceral adipose tissue (VAT) from lateral spine (LS) dual x-ray absorptiometry (DXA) scans and basic demographic information. The authors use an attention-based layer to combine modalities into a single prediction of VAT. Results indicate that this framework improves mean absolute percentage error, root mean squared error. The VAT prediction of the proposed model is also shown to be well-correlated with clinical markers of metabolic syndrome.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The overall results in the paper support the strength of this method. Any one component of this method is not particularly methodologically novel, but the overall results are empirically good. The ablation study is a strength of the paper, supporting architecture choices that would otherwise seem overly complicated or arbitrary. It could be highlighted more/ earlier in the manuscript. The correlation between the derived VAT and metabolic markers is also a strength, further supporting the efficacy of this method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The methods section is underdeveloped and lacks critical details. Areas that need improvement: It is stated that the weights of the network are pre-trained on ImageNet, a 3 channel dataset, but at least some of the DXA scans only have one channel. How is this handled? Are DXA pixel values scaled? If so, how? ImageNet weights are usually trained on standardized features, is this the case for the DXA scans as well? Are demographic variables scaled? If not, this is a major methodological flaw. Different ranges of the values (e.g. height vs weight) and large gradients stemming from wide possible ranges can negatively affect network learning. The AFFM section is unclear. The authors describe concatenating the demographic and DXA representations but then compute modality-specific attention weights. This seems contradictory. If the vectors do get fused before doing the attention weighting, this may also introduce issues with magnitude unless some sort of magnitude scaling (e.g. l2 normalization) is done. For the Hologic Horizon Machine scan dimensions, it would be good to mention the range of possible values in either dimension. It is also unclear how many channels these have. Why are scans resized instead of padded, this is unusual since it distorts feature dimensions. What is the probability of augmentations being applied? What are the parameters of the rotation augmentation, can scans be rotated a full 180 degrees? This needs to be justified. What is the loss function? MAE/ MSE? In the 10-fold CV, what was the validation set used for? Determining plateaus for the learning rate schedule? Or was some sort of other hyperparameter tuning done? Were the mobilenet weights trained end-to-end or frozen during training? Given the exceedingly small sample size, training the entire network end-to-end may be doing more harm than good. “with a learning rate reduction on the plateau” this is unclear? What is “the plateau”? What if there are multiple encountered during training? Was this determined dynamically (i.e. on the validation loss) or at a predetermined step? In general, there seem to be a lot of arbitrary choices made for training. Why 25 epochs? Why this specific learning rate schedule? Why the optimizer and learning rate combination? Was this tuned? If it was tuned based on CV performance this constitutes leakage and invalidates results. If this was not tuned at all, the authors are likely presenting an ill-fitted model. Baseline model: Using RMSE as a loss function is somewhat unusual, why not MSE or MAE? Baseline model: Does this model have the added linear layers after the MobileNet output of the full model or is regression done directly on the full MobileNet embedding vector? If so, this is somewhat of an unfair comparison. The authors claim that “Our model significantly improves” performance. Is this based on some sort of statistical testing? Because many of the confidence intervals in testing overlap significantly. The failure of the predicted VAT marker to reach significant correlation with CHOL and LDL should be discussed. As it is, the fact that it is glossed over in the manuscript weakens the otherwise strong results in this section. For Pred D_train, how were there predictions gathered? Is it a combination of the test set predictions from the 10 cross validation runs?

There are some issues with presentation that hamper readability of the work: Equation 2 breaks up the text unnecessarily and can be skipped or moved into an appendix. The attention fusion module would benefit from a diagram to support the textual explanation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

From a cursory search of the literature, while this is the first work attempting to predict VAT from LS DXA, there are works exploring lumbar AP spine DXA for VAT prediction. These scans fill slightly different clinical roles so this does not take away from the present work’s novelty, but could provide an interesting performance comparison. An example would be https://doi.org/10.1016/S0899-9007(01)00673-6
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The current presentation of the method is too weak to accept. Most of the concerns stem from the authors being unclear about choices made over the course of model development. These concerns are addressable, assuming this is just a presentation issue, and if they are, this manuscript could be a fit for acceptance.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed many of the concerns raised in my review. Assuming the presentation will be improved in the final version, I think the paper is just novel enough to justify acceptance.

Review #3

Please describe the contribution of the paper

The main contribution of the study is developing a multi-modal attention-based framework for visceral fat tissue estimation from lateral DXA scans and patient demographic information via feature fusion.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The major strengths of the paper is the proposal of the multimodal architecture that captures not only the image but also demographic information for visceral fat prediction. These would potentially improve the estimation accuracy. The careful design of the architecture, including attention mechanism for tabular data and attention based fusion module, adds to the strength of the proposed method in exceeding the conventional approaches. The experiments included validations on multiple datasets, from two different modalities and sufficiently large populations. Furthermore, ablation studies on the CNN architecture and fusion method were provided. The study has further compared the assessment of body composition metrics for metabolic syndrome diagnosis that enhances the clinical significance of the method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper has multiple weakness is the dependence on lateral DXA measurements, which may not be accessible, especially in low-income regions. Also, no comparisons were made with other imaging modalities, such as CT and MR (e.g. DIXON sequence) which are widely used for several diagnostic purposes, and publicly available in large population studies (e.g. UKBiobank).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The fusion module itself is not a quite novel concept; however, putting it in the context of visceral fat prediction sounded novel. It was interesting to see a new application for lateral DXA with integration on demographic data for improved VAT prediction. This led me to acceptance decision.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Two primary concerns were raised as potential reasons for rejection: R1 said “DXA is not a proper/accurate facility”, & R3 critiques the presentation of methodological choices. Since R3 indicates this can be addressed, we respond to it first. R3 (Paper Presentation): Yes, we acknowledge that more details would have been helpful. We provide the following as an example, while reiterating that all points raised will be addressed in the camera-ready version. Single channel DXA scans were replicated to match 3-channels. All pixel values & demographic variables were normalized. In the AFFM module, modality-specific attention weights are indeed computed prior to fusion, will fix the wording to avoid confusion. Resizing was chosen to avoid artificial borders that may hinder learning. Data augmentations were applied with 0.3 prob, using rotations limited to ±10 deg, as larger angles reduced performance. Primary loss function used is RMSE; MAPE was also monitored. Training settings including no. of epochs etc. were selected via grid search within a 10-fold cross-validation strategy that re-initialized models for each fold to prevent leakage. In each fold, val set was used to monitor loss, for dynamic learning rate scheduling. We fine-tuned the network end-to-end using a low learning rate to preserve pretrained features. Baseline doesn’t regress, directly instead it uses a 3×3 conv, dropout, & two FC layers on MobileNet embeddings. D_train refers to the full training dataset, & predictions were obtained by applying the trained model directly, not via test folds from cross-validation. A separate holdout set (D_test) was used to verify if ground truth–prediction trends persist on unseen data. Paired t-tests showed significant improvement over baseline (p < 0.001), despite overlapping CIs. Before addressing R1’s concerns, we thank R3 for highlighting the ref. on VAT estimation using DXA scans. This supports the relevance of our method & connects directly to the broader issues raised by R1. R1 (‘DXA proper/accurate’): DXA-derived VAT is valuable for population-level studies, as it is not feasible to subject an entire population to CT/MRI due to high cost, & radiation exposure. DXA offers a practical, low radiation & scalable solution for estimating VAT across large cohorts. Flagged individuals can then be referred for advanced imaging when necessary. We acknowledge that currently lateral spine DXA is not a gold standard for VAT estimation, but it is one of the most widely used imaging modalities for VAT estimation in research & increasingly in clinical settings. Our study presents an initial exploration of an accessible alternative that demonstrates encouraging results, particularly given the known limitations of DXA. Other Comments: R1: Although the MAPE values appear relatively high, 2–3% improvement over the baseline is clinically meaningful. Considering the inherent constraints of LS DXA as a proxy for full-body imaging, our results are the first & quite promising. This shows that further research in this direction is worthwhile. R2: (Comparison with other modalities): We used LS DXA scans as input & VAT measures from whole-body DXA scans as labels, as these were the only cases with matching LSIs & full-body data. Unfortunately, matching CT/MRI data were not available. For submission, we avoided descriptive naming to datasets to prevent any risk of identity disclosure. R1&R3 (Significance of results): Similar findings have been reported in other cohorts, where VAT was inversely correlated with HDL [1] & showed low or no association with total CHOL & LDL [2]. [2] have also reported comparable correlations between VAT with TRIG (r=0.196, p<0.001) & HDL (r=-0.252, p<0.001). These weak associations, along with some estimation error, likely explain our results.

Moreira, V. C., et al. Visceral Adipose Tissue …, Journal of Aging Research, 2022

Ma, X., et al., Association of Conventional…, Diabetes, Metabolic Syndrome & Obesity, 2025

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

From Pixels to Prognosis: A Multi-Modal Attention-based Framework for Visceral Adipose Tissue Estimation

Author(s):