Abstract

In self-supervised pre-training, learning consistent and hierarchical representations that capture relationships among anatomical semantics holds promise for enhancing the performance and interpretability of downstream tasks. However, the representations learned by existing methods are vulnerable to scale variations, which manifests as inconsistency on some scales and misjudgments of hierarchy. Therefore, we propose a scale-robust anatomical representation learning framework with self-supervision, which incorporates contrastive learning with our newly proposed pretext tasks: location-scale prediction(LSP) and ecomposition prediction(DP). Our method addresses the vulnerability from three aspects: 1) It uses multi-scale patches as inputs to embrace diverse anatomical semantics in pre-training. 2) LSP promotes consistency at multi-scales by enhancing the model’s sensitivity to scale and resolving representation conflicts caused by multi-scale inputs. 3) DP eliminates hierarchy misjudgments by producing hierarchical representations for anatomies and their constituent parts that better balance the similarity and discriminability. Evaluations across six chest X-ray datasets demonstrate that the representations learned by our method are consistent and hierarchical at multi-scales and have great transferring ability to various downstream tasks. The code is publicly available at https://github.com/SurongChu/SRHRS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0179_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/SurongChu/SRHRS

Link to the Dataset(s)

NIH Chest X-ray14: https://www.kaggle.com/datasets/nih-chest-xrays/data/data NIH-Mon(Montgomery County CXR Set & Shenzhen Hospital CXR Set): https://lhncbc.nlm.nih.gov/LHC-downloads/dataset.html COVQU(Covid19): https://www.kaggle.com/datasets/mustafaalgun/covid19-chest-xray-dataset GZCP(pediatric pneumonia): https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia SIIM-ACR: https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation

BibTex

@InProceedings{ChuSur_Anatomybased_MICCAI2025,
        author = { Chu, Surong and Qiang, Yan and Ji, Guohua and Ren, Xueting and Zhang, Lijing and Jia, Baoping and Wei, Yangyang and Zhao, Juanjuan and Li, Shuo},
        title = { { Anatomy-based Self-supervised Pre-training for Scale-robust Hierarchical Representations in Chest X-rays } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {64 -- 74}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a novel method for self-supervised learning based on scale-robust hierarchical representations. This is achieved as follows: in a student-teacher architecture with contrastive learning, two novel optimization terms are defined and included: 1) location scale prediction in the form of position + scale parameters and 2) a decomposition prediction for a more consistent hierarchy of representations. The performance is measured on various public datasets for chest radiography.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well structured and easy to follow.
    • The experimental setup is very clear, an ablation study is performed, including an analysis of the learned representations.
    • The 2 proposed optimization terms are novel, especially the decomposition prediction - both being well motivated.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • A weak point of the study is the comparison to reference solutions, which seem to be reimplemented by the authors. This creates a certain gap to the literature and the highest performing baselines; For example, on the TDC (i.e., Chest X ray 14 dataset) the authors report an average AUC of 82.42 AUC, at least 5% higher than all reference methods. However, to my knowledge the highest performing system achieves an AUC of 0.842 (https://link.springer.com/article/10.1007/s10278-023-00801-4). Even older publications from 2019 report average AUC levels of 0.8, significantly higher than the baselines reported in this submission. It is extremely important to align the validation w.r.t. to published literature and position the results and measured performance in that light.
    • Consequently, the literature review should also be adjusted.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite the intriguing methodological contributions of the paper, the above-mentioned gap to the existing literature on chest radiography analysis is concerning. Given that the datasets are public, a clear comparison to the accuracy numbers in prior publications on the same dataset split is possible; but missing from the paper. To properly include this to the paper would require a significant revision of Section 3 on Experiments and Results which would not be possible as part of a rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes SRHRS, a self-supervised learning framework explicitly designed for anatomical semantics in CXRs. SRHRS introduces two novel pretext tasks 1) LSP - mitigates representation conflicts from scale variation by predicting the global position and scale of anatomical patches 2) DP-enforces parent-child hierarchy by decomposing feature and image spaces and aligning corresponding components. This paper provides a new evaluation metric and task (“Finding Parent”) to quantitatively assess hierarchical representation structure.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Learning objectives: The proposed LSP and DP tasks are both intuitive and well-motivated, grounded in anatomical properties of CXRs.
    2. Analysis: The paper thoroughly evaluates both the consistency and hierarchy of learned features, including qualitative (t-SNE) and quantitative (“Finding Parent”) assessments. The ablation studies are also compelling in showing how each component contributes to performance.
    3. Empirical performance: SRHRS outperforms recent hierarchical and consistency-based pretraining methods (e.g., Adam v2, TransVW) across six diverse tasks (classification and segmentation)
    4. Clarity: The methodology is clearly presented
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited-data settings: One of the core advantages of pretraining is to improve model performance in limited-data regimes. However, the paper only evaluates a single “low-data” scenario at 30% of the training data, without specifying the corresponding number of samples per dataset. This is a critical omission, as 30% of a large dataset may still vastly exceed the size of another dataset’s full training set. Including evaluations at lower percentages (e.g., 1%, 5%, 10%) and reporting absolute sample counts would strengthen the framework’s demonstrated utility in data-scarce settings.
    2. Scale robustness: A key motivation for SRHRS is its robustness to anatomical scale variation. Although Table 3 includes an ablation on multi-scale patch sampling, it does not compare scale robustness directly against prior baselines. Incorporating such comparisons would provide stronger empirical evidence supporting SRHRS’s claimed advantage over existing methods.
    3. Limited discussion of failure cases: While performance is strong overall, there’s little discussion of where the method fails—e.g., cases of overlapping anatomies or high inter-subject anatomical variability.
    4. [Minor] Figure Clarity: Use bold an underlines for the best, a second-best performances to make comparing other methods/ablations easier
    5. [Minor] Theoretical framing of hierarchy remains informal: While the proposed “Finding Parent” task empirically evaluates hierarchical structure, the paper doesn’t rigorously define what constitutes a hierarchical representation in Euclidean space as opposed to distance-based clustering.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave this paper a Weak Accept based on its strong methodological contribution, clear writing, and compelling empirical results, particularly in demonstrating the benefits of its novel pretext tasks (LSP and DP) for learning hierarchical and scale-consistent anatomical representations. The proposed “Finding Parent” evaluation adds originality and rigor in assessing representation hierarchy, and SRHRS achieves state-of-the-art results across multiple downstream tasks. However, the paper would benefit from a broader evaluation under more limited-data settings, a direct comparison to baselines for scale robustness, and deeper analysis of failure cases. These limitations prevent a full accept, but the paper still represents a valuable contribution to the field of self-supervised learning for medical imaging.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Although there are some remaining questions, the strength of the contribution still stands



Review #3

  • Please describe the contribution of the paper

    This paper proposes a self-supervised pre-training method to mitigate inconsistencies and hierarchical misjudgements according to multi-scale chest X-ray inputs for anatomical representation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is intuitive and clear. The authors considers location-scale of input patches for consistent representation, and similarities in both feature and image space for hierarchical representation.
    2. Experiments demonstrate that this proposed method has robutness in scale and hierarchy.
    3. The experiment “finding parents” can be used to evaluate hierarchy of representations from pretrained models quantitatively.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In Method section 2.3, teacher encoder is used as pretrained network for downstream tasks. Is there a reason why student encoder was not used? If you clarify this, the method will be more clear.
    2. In Experiments and Results section, downstream tasks are performed for classification and segmentation. There is a case where it has been used in downstream tasks such as object detection and image-to-image translation (https://doi.org/10.1007/s10278-024-01032-x), and this could be a good experiment to verify the robustness and universality of pretrained network learned through a novel pre-text tasks in future work.
    3. In Method section 2.1, The notation for LSP loss is incorrect. You need to change from L_CL to L_LSP in L_CL=dist((p_hat, s_hat), (p, s)).
    4. In Experiments and Results section 3.3 (Hierarchy of representations), second paragraph starts from “As shown in Fig.4(a),”. You may change it to “As shown in Fig.4(b),”.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although question remains about the design of the architecture, the proposed method and comparative analysis to address both inconsistency and hierarchical misjudgement is clear and strong.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I adhere to accept because the authors have addressed my questions well.




Author Feedback

We appreciate positive feedback from reviewers (R) regarding the “novelty” (R1, R2, R3), “compelling experiments” (R1, R3), and “clear descriptions” (R1, R2, R3). Special thanks to R1 and R3 for their immediate acceptance of the paper. We address the reviewers’ key concerns about the identified weaknesses(W).

Q1:No SOTA comparison on Chest X-ray14 (X14)(R2W1) A1: -We evaluated our SRHRS framework against SOTA self-supervised anatomical representation learning methods including TransVW (2021), GVSL (2023), and Adam v2 (2024) (Table 1). To ensure fair comparison, all baselines were pretrained and fine-tuned using identical configurations: ResNet50 backbone and standard fine-tuning protocols without task-specific adaptations for long-tail distributions or label noise. -This comparative study received positive feedback from reviewers, such as “compelling empirical results” (R1) and “clear and strong comparative analysis” (R3). -While the paper cited by R2 (T) reports higher AUC on X14, it addresses a distinct challenge (long-tail distribution) rather than representation learning. Direct comparison with our work (O) is therefore inappropriate due to fundamental experimental differences, as detailed below: Backbone ( T:CoAtNet-0-rw [CNN+Transformer hybrid, ImageNet-pretrained] |O: ResNet50 [X14 64K-pretrained]); Classification head (T: 14 binary classifiers |O: 1 multi-label classifier); Training data ( T: X14 78K images |O: X14 20K images); Loss (T: classification + long-tail mitigation loss|O: classification loss only).

Q2:Table3 lacks direct baseline comparisons for scale robustness(R1W2) A2: -Actually, Table 3 is not a comparative analysis of scale robustness but an ablation study showing how varied inputs and end-to-end training affect learned representations. While the top row of Table 3 shares similarities with [22] in input/training methods, it still uses the SRHRS framework as do other rows. -Direct comparisons to prior baselines on scale robustness are shown in Fig. 3 and Fig. 4. Our SRHRS framework shows two key advantages for scale robustness: 1) improved anatomical differentiation across input scales; 2)reduced mismatches in the “finding parent” experiment.

Q3:Why transfer teacher encoder to downstream tasks?(R3W1) A3:Pre-trained models with teacher-student architectures typically transfer the teacher network to downstream tasks (e.g., BYOL and its variations), as its stable parameter updates and implicit ensemble effects yield more robust features. We follow this convention.

Q4:Additional experiments(R1W1W3,R3W2) A4:-We agree with reviewers that testing SRHRS under more extreme limited-data settings (1%/5%/10%) (R1W1) and on broader tasks (e.g., object detection, image-to-image translation) (R3W2) would be valuable, as would analyzing analyzing failure cases (R1W3). However, rebuttal guidelines prohibit adding new results here or in the main text. Therefore, these insightful suggestions will be addressed in our future work. -Crucially, the experiments presented in our work are more indispensable, as they directly validate SRHRS’s effectiveness by addressing 3 pivotal questions: 1) Can SRHRS learn hierarchical representations? Figs.3-4 show this capability. 2) Do these representations benefit downstream tasks? Table1 confirms positive impacts. 3) What drives SRHRS’s performance? Tables 2-3 provide insights.

Q5:Informal theoretical framing of “finding parent”(R1W6-minor): A5: We thank the reviewers for the positive assessment of ‘finding parent’ experiment (e.g., “originality and rigor in assessing representation hierarchy” (R1), “can be used to evaluate hierarchy of representations from pretrained models quantitatively” (R2)) and for directly noting its theoretical limitation. We will address it in our future work, by formalizing a theoretical framework to rigorously model semantic hierarchies in Euclidean space.

Q6A6: All typographical errors (R1W5-Minor, R3W3, R3W4) will be corrected in the final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I enjoy reading this paper and its idea, it is quite novel and sound. And the performances improvements are also solid. Thus, a clear acceptance is recommended.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top