Abstract

Masked Autoencoders (MAEs) have emerged as a dominant strategy for self-supervised representation learning in natural images, where models are pre-trained to reconstruct masked patches with a pixel-wise mean squared error (MSE) between original and reconstructed RGB values as the loss. We observe that MSE encourages blurred image reconstruction, but still works for natural images as it preserves dominant edges. However, in medical imaging, when the texture cues are more important for classification of a visual abnormality, the strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM) feature in Radiomics studies, we propose a novel MAE based pre-training framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM captures intensity and spatial relationships in an image, hence proposed loss helps preserve morphological features. Further, we propose a novel formulation to convert matching GLCM matrices into a differentiable loss function. We demonstrate that unsupervised pre-training on medical images with the proposed GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms the current state-of-the-art across four tasks – gallbladder cancer detection from ultrasound images by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Source code and pre-trained models are available at: https://github.com/ChetanMadan/GLCM-MAE

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4284_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ChetanMadan/GLCM-MAE

Link to the Dataset(s)

Gallbladder cancer ultrasound dataset: https://gbc-iitd.github.io/data/gbcu Breast cancer ultrasound dataset: https://scholar.cu.edu.eg/?q=afahmy/pages/dataset Pneumonia detection chest X-ray dataset: https://data.mendeley.com/datasets/rscbjbr9sj/2 COVID-19 detection lungs CT dataset: https://www.medrxiv.org/content/10.1101/2020.04.24.20078584v3

BibTex

@InProceedings{MadChe_Focus_MICCAI2025,
        author = { Madan, Chetan and Satia, Aarjav and Basu, Soumen and Gupta, Pankaj and Dutta, Usha and Arora, Chetan},
        title = { { Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a domain-specific modification to the masked autoencoder framework by incorporating a differentiable approximation of the Gray-Level Co-occurrence Matrix (GLCM) into the reconstruction loss. This aims to encourage the preservation of texture features, which are often important in medical image classification tasks. The authors demonstrate modest improvements over baseline methods across several classification datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses an important aspect of medical image analysis by attempting to preserve texture information during masked autoencoder pre-training, which is often overlooked in standard pixel-wise loss formulations.

    • The incorporation of a differentiable approximation of GLCM into the loss function is a domain-aware adaptation that is well-motivated for medical imaging tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The method makes only a modest technical contribution from a computer vision perspective. It reuses a standard MAE architecture and focuses solely on modifying the loss function, with no significant architectural or algorithmic innovation.

    • The performance improvements reported are relatively small in some tasks, and the paper lacks a thorough statistical analysis or robustness evaluation to support the generalizability of the method.

    • The scope is limited to classification tasks, with no discussion of extension to segmentation or other clinically relevant tasks, and no clinical validation or user study to support claims of practical impact.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the integration of GLCM into MAE loss formulation is well-motivated and empirically effective for medical imaging, the methodological novelty from a computer vision standpoint is relatively limited. The architectural backbone remains standard, and the differentiable histogram approach has prior art. However, the domain-specific adaptation is clever and yields meaningful gains in clinically relevant tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal is carefully written and addresses several concerns appropriately. While the authors clarify many implementation and validation aspects, the core technical novelty and clinical translation potential remain modest. My overall evaluation remains unchanged.



Review #2

  • Please describe the contribution of the paper

    This paper proposed a model which enables a focus on subtle gray-level variations within each patch, and better preserve morphological features. Authors investigate the reasons behind relative ineffectiveness of the MAE based pre-training on medical imaging tasks. We observe that pixel-wise MSE based reconstruction over-smooth subtle, but semantically important, texture features in the image representation, and degrade performance on medical image classification.

    Authors propose a novel MAE technique tailored for medical imaging, incorporating a GLCM-guided loss. GLCM is non-differentiable. However, we do a clever work-around based on a differentiable joint histogram of an image, shifting the image by one pixel to the right to generate a GLCM-like representation. This design ensures that the learned representations preserve critical morphological information, which is essential for downstream tasks such as classification and segmentation in medical imaging.

    Authors demonstrate the efficacy of GLCM-MAE with the proposed GLCM loss, outperforming existing SOTA on four medical image analysis tasks: gallbladder cancer detection from ultra sound by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Authors are releasing source-code and pre-trained models for the community.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Texture-Aware Loss Formulation Domain-Specific Innovation Improved Representation Learning Strong Evaluation and Ablation Studies Potential for Broader Impact

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Limited Discussion on Differentiability and Integration of GLCM Scalability to High-Dimensional or 3D Data Limited Clinical Evaluation or Feasibility Study Ablation Depth Could Be Expanded Potential Overfitting to Texture Bias

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper entitled “Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification” sounds technically good and acceptable.

    However, I have few concerns which are listed below:

    1. What motivated the integration of GLCM into the MAE framework, and how does it improve upon traditional pixel-wise MSE reconstruction in medical images?

    2. How exactly is the GLCM-guided loss computed and incorporated into the MAE training pipeline? Is it applied to the reconstruction output, embeddings, or another level?

    3. Is the GLCM loss used in combination with MSE loss or as a replacement? How do you balance the two if combined?

    4. How does the choice of GLCM parameters (e.g., window size, angles, distance) affect the model’s ability to learn meaningful textures?

    5. What kind of improvements did GLCM-MAE show on downstream classification tasks compared to baseline MAE or contrastive methods like SimCLR or BYOL?

    6. Does the GLCM-MAE generalize well across different types of medical images (e.g., MRI, CT, histopathology), or is it optimized for a specific modality?

    7. How do you handle the non-differentiability of GLCM? Do you use any approximation or workaround to enable gradient flow during training?

    8. Can the features learned through GLCM-MAE be effectively transferred to non-texture-heavy tasks, or are they overly specialized?

    9. Have you performed ablation studies to isolate the contribution of the GLCM loss? What was the performance impact when it was removed?

    10. What are the limitations or challenges of incorporating texture-based losses like GLCM in large-scale self-supervised pretraining?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces a novel and well-motivated approach by integrating GLCM-guided loss into MAE for texture-aware self-supervised learning in medical imaging. The method addresses key limitations of pixel-wise losses like MSE by capturing subtle texture variations crucial for clinical tasks. Strong empirical results and thorough ablations support the effectiveness and generalizability of the approach. While there may be some computational considerations, the contribution is significant and opens new directions for texture-based self-supervision

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers and AC for their detailed feedback, and acknowledging the novelty and relevance of our work.

R1, AC: Moderate improvement? Statistical analysis, robustness? While gains on the accuracy (0.5-3.1%) appear modest, they are made over a strong MAE baseline with accuracy 92-94%. The p-values (GLCM-MAE vs. MAE): GBCU=0.036; BUSI=0.005; PX=0.034; COVID=0.048 will be included in the final version. We validate robustness via 5-fold cross-validation (Tab.3, Fig.4) and Haralick analysis (Fig.3), with consistent gains across 4 tasks across modalities (US, X-ray, CT), demonstrating strong generalization. The consistency of the proposed loss with medically established radiomics features enhances the trustworthiness of the proposed technique, and also constitutes a significant technical contribution of our work.

R1: Novelty? Existing MAEs indiscriminately rely on natural image inspired pixel-wise MSE loss, overlooking the need to preserve fine textures and spatial patterns crucial for clinical diagnosis. We introduce the first MAE framework with a differentiable, texture-aware loss inspired by radiomics. Using KDE-based joint histograms, we approximate GLCM in a gradient-compatible form. This enables texture-preserving, gradient-compatible training tailored for medical imaging – an unexplored and impactful direction. We believe our work will inspire many other future works including other radiomics features as well in the AI model training, thus improving the robustness of broader clinical applications across many medical imaging problems.

R1, R3: Clinical study? Our work, as described in the current manuscript, details the methodological contribution around the loss function and validates on 4 real-world datasets across 3 medical imaging modalities, demonstrating strong generalization and applicability. A follow up paper in a medical journal detailing the prospective clinical study is part of the future work.

R1: Segmentation task? The proposed GLCM loss is a foundational work for pretraining AI backbones for medical imaging problems, and can be extended to various tasks across multiple problems, including image segmentation.

[R3 Questions] Discussion on Differentiability? Sec.2.1 contains a detailed discussion. We will further improve the exposition in the camera ready version.

3D Applicability? We thank the reviewer for the suggestion. We believe the loss should be extensible to 3D problems as well, but might need some changes in the current implementation. We will consider it in future work.

Texture-bias Overfitting? We use a combination of GLCM, MAE and SSIM loss with relative weights determined through cross validation. Fig.3a shows that we better reconstruct the Haralick features using the proposed loss function on the test set, thus removing apprehensions of overfitting in our experiments.

Ablation Depth? R3 mentions “thorough ablations” for acceptance. We note that Fig.4 contains architecture ablations; Tab.3 (left) - GLCM kernel bandwidth; and Tab.3 (right) isolates loss components.

Motivation? MSE over-smoothens the reconstructed image (Fig.3b), missing fine texture details important for medical diagnosis. Proposed loss captures gray-level co-occurrences and better preserves radiomics features.

Integration Level? GLCM loss is applied at the output reconstruction level (Eq.3).

Loss Composition? Early training uses MSE alone. Later, we combine GLCM and SSIM, reducing MSE weight (alpha=0.1) for stable, texture-aware learning.

Downstream Comparison, Generalization? Outperforms MAE, OmniMAE, and other SOTA across 4 real-world datasets and 3 modalities (US, X-ray, CT) confirm strong generalizability (Tab.1,2)..

Non-differentiability? KDE-based joint histograms [3] with shifted intensity maps yield a fully differentiable GLCM approximation (Sec.2.1, Fig.1).

Non-textural Tasks, Limitations? As the proposed loss captures texture, the tasks requiring shape features may not benefit.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The paper introduces a differentiable GLCM-MAE approach to preserve textures in images when performing self-supervised pretraining, which then is applied to multiple classification tasks. The reviewers have expressed concerns regarding only moderate improvement in some of the classification tasks - a discussion of why this is the case should be better explained. Please address the reviewers’ concerns thoroughly.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    As pointed out by R1, concerns still remain on the technical novelty and clinical translation potential. The generalizability of the method to other feature banks is also unclear



back to top