Abstract

Accurate interpretation of multi-view radiographs is crucial for diagnosing fractures, muscular injuries, and other anomalies. While significant advances have been made in AI-based analysis of single images, current methods often struggle to establish robust correspondences between different X-ray views, an essential capability for precise clinical evaluations. In this work, we present a novel self-supervised pipeline that eliminates the need for manual annotation by automatically generating a many-to-many correspondence matrix between synthetic X-ray views. This is achieved using digitally reconstructed radiographs (DRR), which are automatically derived from unannotated CT volumes. Our approach incorporates a transformer-based training phase to accurately predict correspondences across two or more X-ray views. Furthermore, we demonstrate that learning correspondences among synthetic X-ray views can be leveraged as a pretraining strategy to enhance automatic multi-view fracture detection on real data. Extensive evaluations on both synthetic and real X-ray datasets show that incorporating correspondences improves performance in multi-view fracture classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3061_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MohamadDabboussi/Multiview-Xray-Matching

Link to the Dataset(s)

https://stanfordmlgroup.github.io/competitions/mura/

BibTex

@InProceedings{DabMoh_SelfSupervised_MICCAI2025,
        author = { Dabboussi, Mohamad and Huard, Malo and Gousseau, Yann and Gori, Pietro},
        title = { { Self-Supervised Multiview Xray Matching } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {581 -- 591}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. This paper proposed a self-supervised method for many-to-many patch matchings in multiview X-rays.
    2. This paper proposed an algorithm to generated many-to-many correspondence matrixes along with multi-view digitally reconstructed radiograph from CT.
    3. The paper proposed to utilize correspondence learning as pretraining and.to integrate correspondence matrix into attention mechanism to improve fracture detection from multiview X-ray images.
    4. The proposed method was evaluated on a correspondence prediction task and a downstream task of multi-view X-ray classification task, using 4 datasets including synthetic and clinical datasets.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. To achieve unsupervised training of multiview X-ray matchings on synthetic X-ray (DRR), the paper proposed a novel algorithm to generate correcpondence matrix for multiview DRR.
    2. The effectiveness of the proposed correspondence learning was validated through correspondense prediction task and multiview fracture prediction task.
    3. Ablation study was performed to validated the improvement by the correspondence pretraining and correspondece matrix intergration for multivew fracture prediction.
    4. Experiments with both synthetic and real X-ray image datasets.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Although the paper designed several experiments with different datasets, it was still missing enough explanation on method details. For example, the Figure 1 was not refered to anywhere for an explanation of the training pipelines. When using correspondence pretraining for the fracture detections, it is not cleare which modules from pretraining are used to initialize the target training.
    2. The section 3.4 described that the correspondence matrix was integrated into attention machinsim by addition to the attention map; however, it is not clear where the query Q and key K were obtained from. This is confusing considering that attention and the correspondence matrix are both direction (i.e., A-to-B attention is different from that of B-to-A). Furthermore, early fusion would double the number of tokens in self-attention, which causes the shape missmatching between attention map and the correspondence matrix.
    3. In table 3, the late fusion seemed better than early fusion when using no pretraining and no attention guidance, why dose the main text stated that “early fusion achieving higher accuracy than late fusion” in section “Multi-View X-ray classification” on page 8?
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the experiments result showed promising improvement on this interesting tasks, the serious weakness on lacking methodology details is not negligible.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My most concerns were addressed. I could reccommand a minor acceptance. But I really hope the author could improve the paper clarity and figure quality.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a self-supervised pre-training scheme to learn a many-to-many correspondence matrix between X-ray views. This approach is designed to pre-train the model and subsequently improve its performance on the task of fracture detection from multiple X-ray views.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method introduces, in a novel way, the task of Correspondence Prediction to make the model more aware of the relationships between multiple views, rather than simply using the multiple X-ray views as input. Furthermore, the integration of the learned correspondence matrix across different attention levels in the multi-view transformer model proves both effective and innovative. This mechanism for incorporating correspondence information through a transformer-based architecture, contributes to the development of multi-view learning.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The introduction is vague in terms of medical context, lacking references that properly frame the clinical relevance of the problem.
    • More models architecture details are needed for Multi-view Classification network. Although both models (correspondence prediction and Multi-view Classification network) are transformer-based or incorporate attention mechanisms, the specific architecture of each is confused throughout the text. For the multi-view classification model, while the functioning of Attention Guidance is explained, the backbone architecture into which it is integrated is not clearly described. How many attention layers are used? Is the same α (alpha) value applied across all layers, or is a different α used for each layer?
    • It would be helpful to reference the corresponding figures more explicitly, for instance, referring to the figures directly in Section 3.4 when describing the classification model, and in Section 3.2 Correspondence Prediction. This would improve clarity and aid reader understanding differentiate between both networks.
    • When integrating the correspondence matrix C with the attention scores (section 3.4), A+αC, are their dimensions are compatible?. If they are not, how was this integration performed? .
    • In the Experimental Setup, although the task is presented as a classification problem, the class labels are not clearly defined. For example, Table 1 reports metrics such as Precision, Recall, and Average Precision, presumably for fracture classification (2 classese) but it is not explicitly stated whether this refers to a binary classification (fracture vs. no fracture) or a multi-class problem. This ambiguity increases in Table 3, where a Kappa Score is reported, typically associated with multi-class classification tasks. The classes should be clearly stated to improve clarity and allow for a better understanding of the evaluation results.
    • Discuss why both proposed mechanisms: Correspondence Retraining and Attention Guidance, are only incorporated into the early fusion strategy, and not into late fusion. Given that late fusion outperformed early fusion prior to the inclusion of these modules, it would be expected to see an evaluation of their impact when applied to the late fusion setup as well. This raises questions about the completeness of the experimental analysis. A discussion on this design choice is necessary, and it would strengthen the paper to include comparative results or justify why such a configuration was not explored.
    • It would be helpful to include comparisons with other state-of-the-art methods for multi-view X-ray classification. While it is clear that the authors propose a novel self-supervised approach, situating it within the broader context of existing multi-view classification techniques would strengthen the evaluation and highlight its relative advantages.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    There are implementation details (particularly regarding the classification network) missing or insufficiently explained, which would hinder a clear and faithful reimplementation of the proposed strategy.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed self-supervised strategy based on pre-training a correspondence matrix and integrating it into the classification (attention) network is conceptually interesting. However, the experimental section requires clearer explanations in several areas, and the method would benefit from a more thorough comparison with existing approaches in the context of multi-view X-ray classification.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel approach for establishing robust correspondences across multiple X-ray views—an essential capability for accurate clinical evaluation. The proposed self-supervised pipeline removes the need for manual annotations by automatically constructing a many-to-many correspondence matrix between synthetic X-ray views, leveraging digitally reconstructed radiographs (DRRs) generated from unlabeled CT volumes. The methodology includes DRR-based correspondence generation, self-supervised pre-training, and the integration of correspondence cues into transformer-based architectures. Experimental results demonstrate strong performance on both synthetic and real-world datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article is well-organized, presenting a clearly defined methodology. Extensive experiments were conducted across multiple datasets. The results demonstrate that pretraining on correspondence notably improves classification accuracy, particularly when the correspondence information is incorporated into transformer-based architectures.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper would benefit from providing additional details about the transformer architecture used for correspondence prediction, such as the number of transformer blocks and any architectural modifications. Furthermore, including more qualitative assessments or visual results would strengthen the empirical evaluation and offer deeper insights into the model’s performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The article is well-organized, presenting a clearly defined methodology. Extensive experiments were conducted across multiple datasets. The results demonstrate that pretraining on correspondence notably improves classification accuracy. Additional details about the transformer architecture used for correspondence prediction—such as the number of standard transformer blocks—should be provided. Including more qualitative results or visual assessments would further enhance the evaluation. The image quality also needs improvement, and Figure 3 is not referenced anywhere in the main text, which should be addressed.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The article is well-organized, presenting a clearly defined methodology. Extensive experiments were conducted across multiple datasets. The results demonstrate that pretraining on correspondence notably improves classification accuracy.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors promise to publish the code and qualitative results upon acceptance.




Author Feedback

We thank the reviewers for their thoughtful comments and address each concern below:

  • Reproducibility, method and implementation details (R1, R2, R3) About reproducibility, we apologize to the reviewers, we indeed forgot to mention that, upon acceptance, the code will be released, including data generation, training scripts, and architecture configurations. To further improve clarity, we will also update Section 4.1 to provide more implementation specifics. Our classification model uses a ViT-S (6 layers, 4 heads, hidden size 384). All model modules were pre-trained on synthetic DRR correspondence data and subsequently fine-tuned for the binary fracture classification task. Each transformer layer uses a distinct learnable α for attention guidance. Importantly, our attention-bias module is modular and can be integrated into other attention-based architectures, making it broadly applicable to classification, segmentation, and detection tasks in multi-view settings. We chose classification as a proof-of-concept due to the availability of a well-established public dataset.

  • Why attention guidance is not applied in late fusion (R2, R3) Attention guidance requires token-level interactions across views, which are supported by early fusion. In contrast, late fusion processes each view independently for most of the network, rendering patch-level guidance inapplicable. We will clarify this point in the revised version.

  • Correspondence matrix integration and dimension compatibility (R2, R3) In early fusion, we concatenate the N patch tokens from both views A and B into a 2N×D sequence, producing a 2N×2N attention map split into four N×N blocks: AA, BB (self-attention), AB, and BA (cross-attention). We add the learned correspondence matrix to AB and its transpose to BA as a soft bias before the softmax in each attention layer, guiding cross‑view focus. We will add these details in the revised version of the manuscript. They have also been fully implemented in the code.

  • Early vs. late fusion performance clarity (R2, R3) Thank you for this remark. We will correct our statement to clarify that “with correspondence pretraining, early fusion surpasses late fusion” (Page 8).

  • Figure references (R2, R3) We will improve the quality of all figures and revise the manuscript to explicitly reference Fig. 1 and 3.

  • Clinical relevance not clear in the introduction (R3) Thank you for pointing this out. We will strengthen the introduction by citing clinical studies that show the importance of multi-view diagnosis and how multi-view X-rays improve diagnostic accuracy, for example: Cheung et al. (J Hand Surg Br, 2006) and Brandser et al. (AJR Am J Roentgenol, 2000).

  • Binary or multi-class classification (R3) All fracture classification tasks in our work are binary (fracture vs. no fracture). We will clarify this in Section 4.1 and note that Cohen’s Kappa was used to evaluate binary label agreement.

  • More comparisons (R3) Our main goal was to demonstrate the effectiveness of correspondence pretraining and attention guidance using an attention-based model, not to benchmark against every multiview method. Since our modules are plug‑and‑play, they can enhance any attention‑based architecture. In future work, we plan to integrate and benchmark our approach on models such as Cross‑View Transformers (van Tulder et al., MICCAI 2021) and Cross‑View Deformable Transformers (Li et al., MICCAI 2023) to further validate the generality and utility of our method. We will add this point in the Conclusion of the revised manuscript.

  • More qualitative and visualization results (R1) We agree that additional visualizations would be valuable. We have already generated several informative figures but could not include them due to space limits. These will be available in the public repository and may appear in an extended journal version.

We thank the reviewers and believe these changes will strengthen our manuscript.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top