Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Mammography screening is an essential tool for early detec-tion of breast cancer. The speed and accuracy of mammography inter-pretation has the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guid-ance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings. The code is available at https://github.com/XYPB/GLAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2415_paper.pdf

SharedIt Link: https://rdcu.be/eHdSO

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_29

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/XYPB/GLAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DuYue_GeometryGuided_MICCAI2025,
        author = { Du, Yuexi AND Chen, Lihui AND Dvornek, Nicha C.},
        title = { { Geometry-Guided Local Alignment for Multi-View Visual Language Pre-Training in Mammography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {299 -- 310}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces GLAM, a novel framework that leverages geometric guidance to learn both global and local alignments for multi-view mammography. Pretrained on the extensive EMBED dataset, GLAM consistently outperforms baseline methods across multiple datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It makes sense to leverage geometric guidance to learn both global and local alignments for multi-view mammography. The writing of the article is quite standard, and the experiments are thorough and well-detailed.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Please clarify the dataset splitting criteria. While EMBED uses a 70%/10%/20% split for training/validation/test, the RSNA-Mammo split uses 15% as the test set without a validation set.
- The paper mentions using tabular annotated data—could the authors provide examples or a detailed explanation of what this data includes?
- The paper compares to [9], which also focuses on multi-view alignment in mammography. It would be helpful to clarify the core differences between the two approaches. Additionally, a comparison with the method from Wang et al. (2023) is recommended. Wang, P., Wells, W.M., Berkowitz, S., Horng, S., Golland, P.: Using multiple instance learning to build multimodal representations. In: International Conference on Information Processing in Medical Imaging. pp. 457–470. Springer (2023)
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See above.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s answer basically solved my confusion.

Review #2

Please describe the contribution of the paper

The paper presents a self-supervised, multi-view mammography visual language framework, which explores the alignment between patches of different mammographic views (CC, MLO). The paper performs a good experimental analysis with some (limited) baselines and public datasets and a good ablation study to show the relative improvement of the proposed method. The main drawback is seen as the comparison to a limited number of works in the state-of-the art, the limited number of datasets and the lack of thorough discussion of its potential clinical usefulness and applicability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Well written and structured paper
- The proposal of the paper is interesting and includes methodological novelty of self-attention between patches of different mammographic views in addition to a global CLIP model.
- Results show a benefit of the proposed method in terms of other models and ablation studies indicate the improvement of each of the components of the architecture.
- Datasets used are known datasets publicly available.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The paper does not include a complete review (in introduction) with state of the art. For instance (and it is not an extensive list) [r1,r2]
- Although some are based on lesion detection or supervised paradigms, results using similar datasets should be discussed and commented.
- Results within the BIRADS classification or malignancy of RSNA are not particularly very high. This should be commented and framed within existing works and potential clinical applicability. For instance would be interesting to compare to baselines, like r3, or ref 6 or 31 of the paper which are also mult-view approaches, or with existing AUC and bACC results in the literature using those datasets.
- Other datasets more extensively used in the literature could be also reported (CSAW, DDSM, etc) which could help to compare to the SoA.
[r1] Manigrasso et al.Mammography classification with multi-view deep learning techniques: Investigating graph and transformer-based architectures Med Image Anal. 2025 Jan:99:103320. doi: 10.1016/j.media.2024.103320. Epub 2024 Sep 2. [r2]Chen et al, BRAIxDet: Learning to detect malignant breast lesion with incomplete annotations,Medical Image Analysis, Volume 96, 2024, 103192, https://doi.org/10.1016/j.media.2024.103192. [r3]Shen et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Medical Image Analysis 2020
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The overall recommendation is based on the existing weaknesses, although the paper has a limited but interesting methodological contribution and is well structured and written, the lack of comparison and discussion with existing state of the art limits the interest and potential applicability of the work.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Taking into account reviewers evaluation and their rebuttal, I think there are no major drawbacks in the paper, and could be interesting for the conference. Main issue may be the incremental results compared to state of the art but it is well argued by the authors and, nevertheless, also an interesting methodologically.

Review #3

Please describe the contribution of the paper
- GLAM - global and local alignment for multi-view mammography with geometry guidance
- a self-supervised, cross-view local patch alignment method that respects the CC and ML projection relationship
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- great figure 1- sets up your work nicely
- using datasets from different populations is great
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- tab 1 and 2 - the AUC metrics improve on average, as you said, 2.3% AUC. The metrics in table 2 are already quite high but in table 1, they are still < 70% AUC. It would be great to have more discussion around why you think the model is still struggling, like the other models did. What is it about the data or your modeling that still leaves room for growth on this dataset ( this could be a limitation)
- no discussion on either limitations or future generalizability of using real notes - in methods you cite the method used to synthetically generate structures mammography reports. How do you think this might generalize to real reports? is this a limitation also ?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think this manuscript is well written and supported. It has a great application not only in breast imaging but any multi-modal imaging applications. It also makes a contribution to the alignment of non-rigid bodies across views. With further some discussion on limitations and their work, this manuscript is great work.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors addressed all my comments

Author Feedback

We thank all reviewers for their constructive reviews and appreciation for our interesting novel methodology, strong experiments on diverse datasets, and paper quality. We here clarify concerns. [Motivation] Our method (GLAM) aims to inject geometry-guided multi-view awareness into vision language pretraining (VLP), not to develop a classification/detection method. Thus, we use a simple linear classifier for downstream tasks to assess embedding quality. The local contrastive branch provides fine-grained supervision, not multi-view feature fusion. GLAM complements existing multi-view classification & detection works, and its backbone can be fused with them for stronger multi-view awareness. Our baselines are VLP-focused, as ours is the first approach to supervise VLP with multi-view geometry. [R1W1 More Literature] Thank you; the noted works fit under the global fusion designs discussed in our Introduction (p.2). Specifically: [r1] uses a global multi-view cross-attention without geometry guidance. As the ablation (Row 4 Table 5) shows, global attention leads to suboptimal performance. Also, [r1] follows the feature addition scheme in the classification head, compromising local details. [r2] focuses on detection tasks and also uses a global feature fusion design in its local co-occurrence module, limiting local awareness. [R1W2 Results in Similar Datasets] As rebuttal rules do not allow new experiments, we report some results from prior work on similar datasets, which performed comparably or worse. [A] has 74.86% AUC on a binary BI-RADS task with VinDr, while GLAM achieves 74.82% AUC on the full 5-class task. [B] fine-tunes CLIP models and reports 66.72% AUC on EMBED (different split), while we achieve 67.34% AUC. Our density prediction also exceeds 90% AUC. [11] uses global multi-view alignment on VinDr and has 88% AUC for density prediction, while ours shows 93% AUC. [R1W3, R2W1 Performance] We will discuss the current performance limitation in revision. The moderate performance mainly stems from the imbalanced real-world datasets (<2% positive in RSNA, <3% cancer in EMBED). [r3] used a filtered, balanced dataset (~1:1), which naturally yields higher metrics. We used a simple linear classifier for downstream tasks to focus on the quality of the pre-trained embedding space. Our method outperforms other VLP methods, and we plan to improve absolute performance and test on more datasets. [R1W4 More Datasets] Future work will test more datasets. We note here we evaluated on 3 diverse large-scale datasets: EMBED (largest U.S. public dataset, 257k images), VinDr (20k images), and RSNA-mammo (54k images), each with different distributions. [R2W2 Synthetic Reports] Synthetic reports follow radiologist guidelines [9] to closely mimic real reports. While some gap may remain, it performs better than tabular data [9], and no public dataset has real reports. [R3W1 Split Criteria] As RSNA only provides training data, we split it into training and test sets. All models are trained for a fixed “8k steps” (p.6) and tested directly, so there is no data leakage. [R3W2 Tabular Data] We use the template from [9], which includes examples and code. Data covers imaging info (machine, purpose, view, side, procedure), patient data (age, race, ethnicity), and findings (density, BI-RADS, description of mass/calcification). [R3W3 Difference from [9]] Like [5,11], [9] “only conduct global alignment, neglecting fine-grained multi-view local alignment” (p.2). [R3W3 Comparison to Wang et al] Wang et al. focus on local vision-to-language alignment for single-view chest X-rays, not multi-view. While we can’t provide new results, [9] has shown Wang et al.’s method is weaker than the chosen MaMA baseline. References: [A] Nguyen et al. “Towards robust natural-looking mammography lesion synthesis on ipsilateral dual-views breast cancer analysis.” ICCV 2023. [B] de Moura et al. “Unlocking The Potential Of Vision-Language Models For Mammography Analysis.” ISBI 2024.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

All reviewers acknowledged the contribution of the paper, while also raising specific concerns, including limited comparison to prior work, lack of discussion on clinical applicability, and missing details on datasets and limitations. Please address these points carefully in your rebuttal.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers have reached a consensus to accept the paper.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers have acknowledged the potential and merit of the work, and based on this, I vote for acceptance.

back to top

Geometry-Guided Local Alignment for Multi-View Visual Language Pre-Training in Mammography

Author(s):