Abstract

Breast cancer is one of the leading causes of mortality among women worldwide. Early detection and risk assessment play a crucial role in improving survival rates. Therefore, annual or biennial mammograms are often recommended for screening in high-risk groups. Mammograms are typically interpreted by expert radiologists based on the Breast Imaging Reporting and Data System (BI-RADS), which provides a uniform way to describe findings and categorizes them to indicate the level of concern for breast cancer. Recently, machine learning (ML) and computational approaches have been developed to automate and improve the interpretation of mammograms. However, both BI-RADS and the ML-based methods focus on the analysis of data from the present and sometimes the most recent prior visit. While it has been shown that temporal changes in image features of longitudinal scans are valuable for quantifying breast cancer risk, no prior work has systematically studied this. In this paper, we extend a state-of-the-art ML model to ingest an arbitrary number of longitudinal mammograms and predict future breast cancer risk. On a large scale dataset, we demonstrate that our model, LoMaR, achieves state-of-the-art performance when presented with only the present mammogram. Furthermore, we use LoMaR to characterize the predictive value of prior visits. Our results show that longer histories (e.g., up to four prior annual mammograms) can significantly boost the accuracy of predicting future breast cancer risk, particularly beyond the short-term. Our code and model weights are available at https://github.com/batuhankmkaraman/LoMaR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3369_paper.pdf

SharedIt Link: https://rdcu.be/dV18K

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_41

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3369_supp.pdf

Link to the Code Repository

https://github.com/batuhankmkaraman/LoMaR

Link to the Dataset(s)

https://snd.se/en/catalogue/dataset/2021-204-1

BibTex

@InProceedings{Kar_Longitudinal_MICCAI2024,
        author = { Karaman, Batuhan K. and Dodelzon, Katerina and Akar, Gozde B. and Sabuncu, Mert R.},
        title = { { Longitudinal Mammogram Risk Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {437 -- 446}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces LoMaR, a breast cancer risk assessment model that makes use of multiple prior images to predict risk of breast cancer. The methods is trained and evaluated on a large public dataset and shown to outperform MIRAI, a state-of-the art risk assessment model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper explores the use of multiple priors and shows its benefit. The paper is generally clear and well-written and experiments are generally well performed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I think the paper is generally well written and experiments are set up well, however it can be improved in the following way

    • It would be good to provide a more clear description of the data, so the model can be compared to state-of-the art work. The authors mention the dataset that is used, but it would be good to provide a bit more details on the exact data characteristics of the splits used for validation.
    • I think the experiment section can be improved by following the experimental setup used in the two most similar papers (Yala et al [20] and Lee et al. [12]), which make use of confidence bounds obtained using bootstrapping and DeLong tests and statistical tests to compare between different models. It is unclear why the authors elected to use bootstrapping instead.
    • I think the experimental section can be improved further by performing more analysis/interpretation of the results. The localization plot is interesting, but is this really relevant for risk prediction? We are often interested in detecting inherent risk not early signs of cancer, which can be detected with a CAD solution (see e.g. https://arxiv.org/pdf/2007.05791.pdf), for which localization might not be relevant. It is difficult to imagine 5-year risk is improved because of better localization as there may not be any visible lesion 5 years before the disease is detected. Additionally, the use of GradCam is somewhat controversial (see e.g., https://arxiv.org/abs/1810.03292). If the authors elect to proceed with that analysis, I would recommend to motivate the merit of better localization from a clinical application standpoint (how would a radiologist/radiographer benefit from better localization). An alternative to look at would be identifying subgroups (e.g., specific age, density categories) for which longer history can be beneficial or perhaps discuss different types of density progression.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please find below some more suggestions to improve the paper

    Title

    • The title captures the meaning of the paper, but I think this can be made more descriptive. In my opinion it should include the main point of the paper which is to include multiple prior images.

    Abstract

    • “While it is clear that temporal changes in image features of the longitudinal scans should carry value for quantifying breast cancer risk, no prior work has conducted a systematic study of this.”

    It would be good to tone this down as there is some prior work on this topic already.

    Introduction

    • ” Breast cancer is the most prevalent cancer worldwide”

    I would also tone this down a bit and rephrasing this to ‘one of the most’ and cite a clinical paper instead. The cited paper talks about incidence, not prevalence (incidence = number of new cases, prevalence = existing cases).

    • “In response, recent advances have seen the emergence of machine learning (ML) based algorithms that analyze mammographic data for breast cancer risk prediction.”

    I would make this a bit more clear by mentioning that newer models use image data specifically. Note that TC and Gail also use machine learning.

    Experiments

    • In the first paragraph of 2.3, the authors mention “we employed a training strategy with multiple techniques to mitigate overfitting and improve model performance.” and then describe a way in which training data is composed. This can be made more clear I think. Is this one of the techniques? What are the other techniques?

    • In section 2.4 it is stated that “To counter potential biases, we employ an inference strategy designed to mitigate these discrepancies.” It would be good to clearly specify which biases are mitigated this way.

    • In section 3.2 the authors write that “we note that LoMaR shows a progressive improvement in long-term prediction performance as the history duration increases” This is an interesting observation, however there could be a bias in the results as I can imagine we can only collect 5 year history from older patients. The authors do mention a bias mitigation strategy but it is not clear to me how this would prevent it. Could you kindly elaborate on this?

    Lastly, I think the conclusion could be improved a bit by discussion clinical merit or any shortcomings of the current approach.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is interesting and explores a novel use of data. However, it can be improved using some of the suggestions provided in this review.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose in this paper the construction of a new breast cancer prediction model (LoMar) for screening programs. This model is based on the inclusion of mammograms performed in previous years as part of the model. The model is based on the use of Transformers and is an improvement of a previous model called Mirai, considered state of the art, which only uses current mammography for prediction. The comparison with previous results is made in terms of ROCAUC, comparing different scenarios. Thus, three procedures based on the use of DL techniques with the current image, Mirai and Lomar are compared including different situations: only current image and current image with images from different previous years.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is a robust method that includes information from up to the last previous years and allows predictions up to 5 years ahead, being the results in terms of ROCAUC higher than 0.92 for the prediction at the first year. The model has been developed using the Karolinska dataset that includes 19328 mammograms (1413 diagnosed with cancer) from 7353 patients.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is that the results achieved are slightly better than those obtained with Mirai. Given that the information used is much higher (up to 4 years of previous cases), it does not seem that the developed model can take full advantage of the power used to obtain it, both in images (up to 4 years as mentioned above) and in model complexity (it uses Transformers). It is striking that in terms of Grad-CAM maps, the results achieved are quite similar to the radiologist’s annotations. However, this may be due to the presentation of selected cases. Perhaps it would have been better to present cases where it works well and where it does not.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The use of previous cases when diagnosing mammograms in breast screening programs is a common practice among radiologists. Therefore, the construction of mathematical models that take this fact into account seems a priori to be a good idea. The proposed method does so, but the results achieved in terms of improvement in ROCAUC are not as relevant as they might seem a priori. Therefore, this increase in complexity in the proposed algorithm is not reflected in the results achieved, and this raises the suspicion that the proposed increase in complexity is not being fully exploited. Perhaps some kind of experiment is missing to bring to light what is happening and why this increase in complexity does not translate into a clear improvement in the results. On the other hand, the section “Improved localization of cancer with past mammograms”, at the end of the article, which seems to visually demonstrate the improvement obtained, is too short and it is difficult to extract conclusions from it, given that only results from some of the selected cases are presented.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting article, where a new method for breast cancer prediction is presented, using images from up to 4 years before the current one. The method has been trained with a sufficient number of cases, but the results obtained slightly improve those obtained previously that represent the state of the art (Mirai).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This work proposes a longitudinal risk model for screening mammography. A multi-stage architecture was proposed comprising of transformer modules to encode the complete patient history, as well as survival prediction module. Extensive experiments were performed to demonstrate the performance of the proposed model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper is well organized and well written, making it easy to follow
    • The longitudinal risk modeling of screening mammography is a clinically relevant and challenging task
    • This work seems to propose a new state-of-art for risk modeling using a novel model architecture.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Adding directions for future improvement would make this work more complete.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In 2.3 Training: providing a schematic for how the 10-year dataset was constructed would make it easier to understand.

    In Improved localization of cancer with past mammograms, could the author comment more on the Grad-CAM visualization? Based on the Grad-CAM without history data, it seems that the model is pricking up breast density as the main predictor, and achieved > 0.8 AUC.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well written paper on a clinically relevant task of longitudinal mammography risk modeling. Both the methodology and experimental aspects are well detailed in the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their constructive feedback. Upon publication, we will share a GitHub repository for our model to facilitate further research. We address the reviewers’ points below, organized by section and noting the specific reviewer (R#) who raised each.

R1 suggests detailing subject characteristics and conducting subgroup analyses (e.g., by age, density categories) during inference. We value this suggestion and plan to collect and incorporate more detailed demographic and risk factor data in future studies to enable these analyses.

R1 highlights the potential impact of variations in the characteristics of patients with more extensive historical data on our results. Thanks for pointing this out. We will address this in our revised discussion and look into it in future studies.

R1 suggests improving the experiments section by adopting the statistical methods used in similar works. Our evaluation method is similar to bootstrapping particularly in how we construct each pseudo test set by randomly choosing one present point from every subject. This approach, combined with 10-fold cross-validation, is designed to enhance the robustness and reliability of our results. We believe this provides a statistically rigorous framework for evaluating model performance. We acknowledge that incorporating statistical tests, such as those used in referenced works, could further be informative. We will include the result of a DeLong test between the fifth year ROCAUC score of the LoMaR (4*) vs. LoMaR (no history) in the revised version of the paper.

R3 notes that the ROCAUC results suggest the model may not be fully utilizing the longitudinal input. Although LoMaR shows up to a 10% ROCAUC improvement over Mirai in the fifth follow-up year for tests excluding screen-detected cancers, as detailed in the supplementary material, we acknowledge the potential for further model enhancement. Our initial focus was on assessing the impact of longitudinal mammogram history on prediction accuracy, which included a grid search to optimize hyperparameters. Moving forward, we will explore alternative architectures, expand data sources across multiple sites, and incorporate more longitudinal data types to enhance LoMaR’s performance. Furthermore, we plan to collaborate closely with clinical experts to conduct a more detailed analysis of why LoMaR demonstrates more substantial performance improvements over longer time horizons.

To address R1’s concerns regarding the relevance and use cases of localizations: We agree that 5-year risk might be dominated by non-localizable features. However, we still believe localization can be informative for two reasons: 1) It can enhance interpretability of LoMaR’s predictions, which can be important for validating and real-world deployment; and 2) It can assist radiologists by highlighting areas of interest, helping to focus their attention, and potentially reducing the likelihood of misreading scans. We will update our manuscript to include these points. We also agree that GradCAM can have shortcomings. We will add these points in our revised discussion.

Regarding the cases in Figure 2, highlighted by R3: We have shown a representative set of test subjects with breast cancer where historical mammograms corrected LoMaR’s predictions. In future work, we intend to perform a comprehensive analysis of localization results across TP/TN/FP/FN classes, with input from clinical experts.

Regarding the feedback from R3 and R4 about the lack of detail concerning the observed trends from the GradCAMs: We acknowledge this oversight and will extend our discussion in the revision.

R1 and R4 recommend that our conclusion discuss clinical merits and shortcomings. Thank you for pointing these out. We will include them in the revised discussion.

Finally, we will implement the minor changes and clarifications suggested by R1.




Meta-Review

Meta-review not available, early accepted paper.



back to top