Abstract

Background parenchymal enhancement (BPE) classification for contrast-enhanced mammography (CEM) is highly affected by inter-reader variability. Traditional approaches aggregate expert annotations into a single consensus label to minimize individual subjectivity. By contrast, we propose a two-stage deep learning framework that explicitly models inter-reader variability through self-trained, reader-specific embeddings. In the first stage, the model learns discriminative image features while associating each reader with a dedicated embedding that captures their annotation signature, enabling personalized BPE classification. In the second stage, these embeddings can be calibrated using a small set of CEM cases selected through active learning and annotated by either a new reader or a consensus standard. This calibration process allows the model to adapt to new annotation styles with minimal supervision and without extensive retraining. This work leverages a multi-site CEM dataset of 7,734 images, non-exhaustively annotated by several readers. Calibrating reader-specific embeddings using a set of 40 cases offers an average accuracy of 73.5%, outperforming the proposed baseline method based on reader consensus. This approach enhances robustness and generalization in clinical environments characterized by heterogeneous labeling patterns.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3980_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{RipElo_Deep_MICCAI2025,
        author = { Ripaud, Elodie and Jailin, Clément and Milioni de Carvalho, Pablo and Vancamberg, Laurence and Bloch, Isabelle},
        title = { { Deep Learning Framework for Managing Inter-Reader Variability in Background Parenchymal Enhancement Classification for Contrast-Enhanced Mammography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {132 -- 142}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study aims to develop a novel deep learning-based framework for the accurate assessment of Background Parenchymal Enhancement (BPE) in CRM. The goal is to reduce inter-reader variability and improve the consistency of radiological evaluations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A key strength of this work lies in its incorporation of multi-reader annotations during model training and supervision. By leveraging diverse expert perspectives, the proposed method seeks to improve robustness and generalizability in BPE assessment, which is a thoughtful and innovative direction.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.The abstract does not clearly state the clinical significance of the proposed study, which limits the understanding of its real-world impact. 2.The introduction fails to highlight the novelty of the work in the context of existing literature. 3.The manuscript suffers from several writing issues, with many sentences being difficult to interpret, which affects overall readability and clarity. 4.The methodological innovation appears limited and lacks substantial improvement over prior approaches. 5.The process of obtaining the reader-specific embeddings is not sufficiently explained, making it difficult to assess the validity and relevance of this component.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1.The manuscript needs significant improvement in articulating and emphasizing its core innovations. 2.The methodological novelty is relatively weak and should be further enhanced. 3.From a writing perspective, the paper contains too many long and complex sentences, which hinder readability. It is recommended to simplify the language by using shorter, clearer sentences. Leveraging effective writing-assist tools could help improve the manuscript’s overall clarity and readability.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The primary concerns with this manuscript lie in the overall writing quality and the lack of methodological innovation. Both aspects need significant improvement to meet the standards of a top-tier conference like MICCAI.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a two-stage deep learning framework designed to address inter-reader variability in BPE assessment. The first stage learns discriminative image features and reader-specific embeddings that reflect individual labeling styles. The second stage calibrates these embeddings using a limited number of annotations to enable consistent, standardized, or site-specific assessments.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is generally well-structured, with a clear and coherent introduction that effectively sets the context. The schema of the proposed framework is well-designed and contributes positively to the reader’s understanding of the methodology.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Although CEM is a high-resolution modality that provides fine-grained details, the study applies a severe downscaling to 570×479 pixels, which may result in a significant loss of critical information. Especially for small lesions, such as micro-calcifications. Instead of relying on a CNN pre-trained on the ImageNet dataset, I recommend that the authors consider pre-training it on 2D mammograms to reduce the representation gap between the source and target domains and limitations about the input size or the computation time. Additionally, presenting all evaluation metrics in a table would greatly improve clarity and make it easier to compare results across different methods.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A few comments regarding the methodological choices: while the approach is promising, certain design decisions would benefit from further justification (see weaknesses for more detail)

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of the paper is a lightweight technique to tailor a Brreast Parenchymal Enhancement (BPE) classification model to individual readers. A CNN classifier is augmented with a reader-specific embedding trained on multiple readers. In the calibration stage, the model is adapted to a new reader by freezing the classifier and adapting only the reader-specific embedding on a small dataset. Representative-based and uncertainty-based sampling are combined to select relevant samples for the calibration. Experiments show that the proposed technique can achieve an average balanced accuracy of 73.5%, without extensive retraining.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces of “adapter” that can be used to tailor classifier to individual scores. Many classifications in radiology have an element of subjectivity. While establishing a more objective reference standard is certainly desiderable, the possibility to dynamically adjust a classifier to reflect the preference of individual radiologists (as opposed to train it on a consensus or majority voting) could prove an interesting direction. I can see this open new directions not only in terms of adjusting the classification to individual readers, but also in conjuction with, e.g., ensemble learning.

    • The proposed methodology is clear and easy to apply, with clear guidelines to select the dataset

    • The dataset is diverse including CEM images from different imaging systems, annotated by eight readers.

    • Ablation studies are conducted on the size of the embedding and of the size of the calibration of the dataset size

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The manuscript makes sometimes conflicting claims regarding the role of inter-rater variability. The introduction claims for a more objective and reproducible method to decrease variability, yet the aim of the proposed method aims precisely at making inter-rater variability reproducible by producing classifiers tailored to individual readers. At page 7, the manuscript suggests that creating a reference standard based on a consensus of different readers can make significant disagreements and important nuances, framing inter-rater variability almost as an advantage, which seems to be in contrast with the introduction. I would suggest the authors to provide a clearer narrative as to how the proposed method could be used to tackle the problem of high inter-rater variability, since the goal of the calibration is precisely to reproduce such hgih inter-rater variability.

    • The evaluation is performed only two readers. Specifically, the baseline is trained on N=7 readers and the calibration is performed on the eight reader. While the methodology is sound, extending to multiple readers would clearly strenghten the result.

    • The clarity of the manuscript could be improved, although modifications are likely minor.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • In mammographic density classification, variability in individual assessment may also be due to different national legislations, guidelines and/or training [1]. I wonder if similar tendencies could emerge for BPE as well.
    • Please provide the hyper-parameters and optimizers used for all trainings
    • In Eq. 1, it is not clea how the mean of the model scores is established (prior to calibration?)
    • In Eq. 1, what are the clusters ci and how they are computed, how is the number of clusters determined? do the clusters have clinical significance?
    • In the baseline, instead of training the embedding, the final classification layers are trained on the specific reader, keeping the feature extractor frozen. Why not use the multi-reader model as baseline?
    • Are there any differences between images from different vendors?
    • At page 6, the authors compare against a single CEM BPE model in literature (ref. [23] in the paper). This comparison should be further clarified: since the datasets and experimental settings are different, it is unlikely that a direct comparison is particularly meaningful.
    • The experiments were performed on only two readers. While expanding the reader pool would be certainly ideal, additional information about the relative behaviour of R1/R2 could be beneficial. Judging fom Fig.2, R1 and R2 have similar distributions compared to R3 or R4. It would be interesting to understand the effectiveness of the calibration with respect to how “far” the reader is with respect to the “average” or “consensus” reader.

    [1] Alomaim, Wijdan, et al. “Variability of breast density classification between US and UK radiologists.” Journal of Medical Imaging and Radiation Sciences 50.1 (2019): 53-61.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While most of the literature focuses on machine learning as a way to reduce inter-rater variability by computing a more standardized assessement, the proposed method clearly departs by exploring ways to adapt the prediction to individual readers. To the best of my knowledge, this approach is novel and could spark interesting discussions. The methodology is overall sound, and experiments while limited wrt to the number of readers, include sensitivity studies on the most critical parameters.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Taking into account the comments from fellow reviewers and the authors’ rebuttal, I confirm my initial assessment.




Author Feedback

We thank the reviewers for their insightful comments. Below, we concisely clarify the major points raised:

Novelty regarding inter-reader variability (R2) - The key novelty of our approach lies in its use of self-trained reader-specific embeddings to explicitly model individual annotation styles. In contrast to the existing literature, which typically enforces a single ground truth via an aggregation method to reduce inter-reader variability, our method models this variability by conditional learning. To our knowledge, this original approach improves state-of-the-art results in BPE classification.

In addition, by decoupling the feature extractor from the reader-specific embedding, we aim to ensure that the learned image representations remain robust to annotation inconsistencies. Finally, our calibration stage efficiently adapts the trained model to new readers or clinical consensus standards using only a few annotated cases. This represents a crucial advantage in clinical practice where large-scale multi-reader annotations are prohibitively costly or impractical.

The proposed method offers an original way of managing unavoidable variability, substantially improving the classification score, while requiring minimal model modification. We believe this approach will strongly interest the MICCAI community, where managing annotation inconsistencies is frequently encountered.

Clinical significance (R2) - We agree the clinical relevance was understated. We will revise the abstract and introduction to clearly emphasize our method’s practical clinical advantage: enabling flexible, scalable, and cost-effective deployment in diverse clinical environments by minimizing the need for exhaustive annotations and standardizing BPE classification across sites or readers.

Manuscript clarity - In line with R2’s comment, we commit to simplifying complex sentences and improving narrative flow. Additionally, we agree with R3’s suggestion to summarize key results in tables, enhancing readability. As pointed out by R1, the manuscript may present ill-defined claims regarding the role of inter-reader variability. A clearer narrative will be provided by explicitly stating that the proposed method aims to produce classifiers tailored to individual readers from heterogeneous multi-reader annotations.

Embedding training (R2) - We clarify that the embeddings are implemented as a trainable weight matrix, where each reader is associated with a row. They are initialized from a standard normal distribution and optimized jointly with the network parameters via backpropagation, using an SGD optimizer. This joint learning ensures that the embeddings evolve to reflect meaningful inter-reader variability. To assess their impact, we conducted an ablation study by varying the embedding dimension.

Model design (R3) - R3 raised concerns regarding the choice of image resolution and the pretraining strategy. We agree that image resolution can significantly impact model performance. A study investigating its effect on BPE classification [23] has shown that the selected image size offers a good balance between computational efficiency and classification accuracy. The BPE task primarily assesses large-scale enhancement patterns rather than fine-grained details (e.g., micro-calcifications). Thus, this resolution optimally preserves relevant information for BPE classification. We agree with R3’s comment that domain-specific pretraining may further boost performance. It constitutes an interesting future research direction, yet no publicly available BPE-specific datasets currently exist.

Model evaluation (R1) - We agree that extending the evaluation to more readers would strengthen the study and will be considered in future work.

We sincerely thank all reviewers again for their helpful suggestions. We believe our framework provides an innovative and practical approach for adapting a BPE classifier to any clinical reference using multi-reader annotations.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Reviewer comments are diverse. The authors need to carefully address the concerns, especially from reviewers #2 and #3.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, several issues are addressed. However, by taking into account reviewers’ concerns, the overall writing quality and the limited innovation are not addressed. I recommend to reject this paper and authors need to improve their work in the future submission.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After reading the rebuttal and the reviewers comments, I think this is an interesting and novel idea. As pointed out by reviewer #1 from the start of the review process, methods tend to try to deal with variability by standardising annotations. This paper takes the opposite approach. Just for the fact and the promising results and novel technique, I believe the paper deserves acceptance. Furthermore, the authors have agreed to revise the manuscript if accepted to improve clarity (which the paper lacks at times).

    My only minor concern is that I am still unsure of the clinical application, even if I understand that the goal is to provide annotations that mimic specific experts that might not have time for it. In the case where a hospital has multiple experts (N > 4) and the method can replicate all of them, what would be the way to proceed in cases with major discrepancies? I know this is probably out of the scope here, but I think it is an important question to explore in future work.



back to top