Abstract

Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast-LT-GenAug.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5051_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheShi_Subtyping_MICCAI2025,
        author = { Chen, Shijing and Zhou, Xinrui and Wang, Yuhao and Huang, Yuhao and Chang, Ao and Ni, Dong and Huang, Ruobing},
        title = { { Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a novel framework for addressing the long-tailed subtype classification of breast lesions in ultrasound imaging using generative augmentation. The framework integrates a class-conditional latent diffusion model (LDM) guided by a sketch-based structural supervision module (SynSketch), and introduces a reinforcement learning-based class-adaptive sampler (RL-CAS) that dynamically adjusts the ratio of real and synthetic images per class during training. The method is evaluated on a large-scale in-house long-tailed dataset (Breast-LT-8) and a public dataset (BreastMNIST), showing improved performance on tail classes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel generative supervision with structural priors: The introduction of SynSketch, a sketch-guided structural perception branch for diffusion-based synthesis, brings structural awareness into the generation process. It allows the generator to produce images with more realistic anatomical features, which is especially useful in clinical contexts where fine-grained structure is important.

    Class-adaptive sampling via reinforcement learning: The RL-CAS module is an interesting attempt to dynamically calibrate the number of synthetic images per class. This goes beyond fixed oversampling or reweighting strategies, and aims to better balance head and tail class performance in long-tailed settings.

    Clinical relevance and real-world applicability: The use of biopsy-verified breast lesion subtype data and a clinically meaningful task (histological subtype classification) highlights the practical relevance of the approach.

    Comprehensive evaluation across long-tailed and binary datasets: Experiments are conducted on both a custom long-tailed dataset and a standard benchmark, providing a multi-angle assessment of the method’s performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of RL-CAS behavior analysis and sampling visualization: RL-CAS is designed to dynamically explore optimal real-synthetic data mixing strategies across classes. However, the paper does not show how the sampling strategy evolves during training, nor does it provide visualizations or statistics of class-wise sampling adjustments or real-vs-synthetic ratios. The absence of learning curves or policy progression significantly weakens the interpretability of this key module.

    Unclear motivation in the abstract: Although the paper focuses on addressing long-tailed breast lesion subtype classification with generative augmentation, several prior works already attempt similar goals using class-conditional or structure-aware generation. The abstract fails to clearly identify what gaps exist in current methods and what limitations the proposed approach aims to overcome, resulting in a weak problem definition and motivation.

    Incomplete reinforcement learning formulation in RL-CAS: While RL-CAS is described as a reinforcement learning-driven sampler, the paper lacks essential RL components such as the reward function definition, policy architecture, or training algorithm. Without these details, it remains unclear whether RL is genuinely applied or merely used as a conceptual framing, raising concerns about the module’s transparency and reproducibility.

    Lack of sketch quality validation in structural supervision: The SynSketch module relies on externally extracted sketch maps (via [20]) as structure supervision signals. However, the paper does not evaluate the quality or accuracy of these sketches, nor does it assess whether they reliably capture anatomical structures across lesion types. The potential impact of inaccurate sketches on training is not discussed, leaving the effectiveness of the structural supervision unclear.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a clinically relevant approach for addressing long-tailed classification in breast lesion subtyping using structure-aware generative augmentation and dynamic class-adaptive sampling. The proposed SynSketch and RL-CAS components are grounded in real-world data.

    However, the reinforcement learning component lacks critical implementation details, the structural supervision lacks validation, and the overall motivation—particularly in the abstract—is not well grounded in existing literature gaps. These issues affect the interpretability and clarity of the proposed contributions. Despite these limitations, the overall framework is interesting and potentially impactful, meriting a weak accept.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a dual-phase framework for long-tailed recognition of breast lesion subtypes in ultrasound images. It combines a class-steerable generative synthesizer based on latent diffusion models with a reinforcement learning-driven class-adaptive sampler (RL-CAS) to dynamically balance real and synthetic data. A sketch-grounded perception branch is introduced to inject structural priors into the synthesis process. Experiments on in-house and public datasets demonstrate competitive performance, particularly in recognizing rare classes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    See main contribution.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Overall, I appreciate the design philosophy and integration strategy presented in this paper. Although each individual component is not technically novel in isolation, the way they are combined is intuitive and interesting.  That said, I still have some concerns: 

    1. The evaluation lacks convincing diversity in datasets. The only dataset with a long-tailed distribution is in-house (and I assume it will not be made publicly available?), while the public dataset contains only two classes. I wonder whether other relevant datasets could be considered—for example, other ultrasound datasets. Additionally, how does the proposed method generalize to other long-tailed tasks or modalities [1]? Could it be further improved by integrating techniques from [1]? 
    2. Although the sketch-grounded perception branch yields gains, the robustness of the PDN-based edge extraction under varying imaging conditions (e.g., noisy handheld ultrasound, cross-device generalization) remains unclear, especially since PDN was not trained on breast ultrasound data. Did the authors evaluate the quality of the generated sketches and images or analyze model sensitivity under domain shifts? 
    3. Since all agents share a common global reward, how does the framework handle potential stagnation or reward collapse in early training stages, especially when the overall classifier performance is low (e.g., as seen in Table 1, the Baseline performance for Medium and Few classes is very poor)? Have the authors considered agent-specific reward shaping or curriculum learning to address this? 
    4. The “All” metric may not be meaningful in a long-tailed setting, as it is dominated by the performance of head classes. I recommend reporting the average accuracy across the three shot-based groups (Many, Medium, Few) for a more balanced view. 
    5. I would appreciate more analysis of how the reinforcement learning and multi-agent interactions evolve over time. For example, how does the average sampling ratio per class change across training epochs, and how does that relate to performance trends for each group? 
    6. The proposed framework involves additional components such as diffusion models, PDN, reinforcement learning, and multiple classifier training episodes, all of which contribute to increased training cost. While I understand that these are only used during training, it is unclear how this compares fairly with the baselines. I encourage the authors to provide quantitative measurements (e.g., training time, GPU usage) and discuss the computational cost trade-offs.

    [1] MONICA: Benchmarking on Long-tailed Medical Image Classification

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See comments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel dual-phase framework for long-tailed classification in breast lesion subtyping via ultrasound images. The method addresses the significant challenge of class imbalance through high-fidelity generative augmentation, leveraging a reinforcement learning-driven class adaptive sampler (RL-CAS) to dynamically adjust the ratio of synthetic to real data during training. Additionally, a sketch-guided perception branch is incorporated to preserve anatomical details, ensuring class-discriminative features in the generated data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and clearly articulates the problem, methodology.
    2. The proposed method is well-conceived and addresses the long-tailed classification problem in a meaningful way, combining generative augmentation with reinforcement learning for adaptive data sampling.
    3. The experiments show solid and convincing results, demonstrating that the proposed approach significantly outperforms existing methods.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper assumes that data generated by the generative model can effectively address sample distribution imbalance. However, if the generative model itself is trained on imbalanced data, it may inherit the same bias, which could limit the effectiveness of this approach.
    2. The paper does not provide a detailed comparison of training time and computational complexity, particularly considering the additional use of the generative model and reinforcement learning (RL) for data sampling, which could potentially increase the overall cost.
    3. While the ablation study shows that RL-CAS outperforms standard re-sampling methods, the paper does not visualize or provide an analysis of how the RL-CAS ultimately converges in terms of the sampling strategy. A visualization of the final sampling behavior could provide more insights into its effectiveness and decision-making process.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please provide further insights into the method and experiments, taking into account the weaknesses mentioned above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

  1. Incomplete RL-CAS formulation, visualization gaps, and potential reward-collapse concerns (R1, 2, 3) We appreciate the reviewer’s concerns. All essential RL components are defined in our manuscript: Section 2.2 describes the complete framework (environment, state space, agents, action space), Equation (1) presents the REINFORCE algorithm, and page 6 defines the reward function. Our system prevents reward stagnation through the cubic reward function that amplifies small performance differences, the baseline term that stabilizes training, and our multi-episode approach enabling parallel exploration. Due to the page limit, we provide visualizations showing the sampling strategy evolution in the GitHub repository.
  2. Sketch quality and domain shift (R1, 2) While this concern is indeed valid, we want to highlight that the design of the edge extraction approach fundamentally focuses on capturing gradient information through pixel difference convolution, making it inherently adaptable across different imaging domains. While not specifically trained on breast ultrasound data, its architecture is particularly effective at extracting edge features based on local pixel differences rather than domain-specific patterns. To demonstrate this robustness empirically, we have showcased edge extraction results on both our in-house dataset and the public BUSI dataset on our GitHub repository. More generative samples are also provided for visual verification.
  3. Unclear motivation in the abstract (R1) Thank you for pointing this out, we have modified the abstract accordingly following this comment.
  4. Lack of diversity in datasets(R2) Currently, no public breast ultrasound datasets with long-tailed distributions are available; therefore, we validated our method on both an in-house long-tailed dataset and the public two-class BUSI dataset with class imbalance. As these datasets are entirely independent, this partly demonstrates the generalizability of our approach. We agree that further evaluation on broader modalities and benchmarking frameworks such as MONICA [1] would be valuable, and will consider this in future work.
  5. Possible bias in the generative model (R3) Our generative model is guided by class labels during both training and generation, which helps minimize the bias introduced by long-tailed data distributions. Additionally, the sketch-grounded perception branch leverages anatomical priors to preserve distinctive class features and further reduce bias. Combined with our RL-based sampler, these strategies ensure effective augmentation and improved performance for minority classes.
  6. Training time and GPU usage (R2, 3) We acknowledge our approach introduces additional computational overhead during training, while maintaining comparable inference time to baselines. This increased cost is a deliberate trade-off that yields significant performance gains on long-tailed classification tasks, with a 30% accuracy improvement for few-shot classes. We argue that this trade-off is justified for applications where rare class performance is critical. Future work will explore optimization techniques to reduce computational requirements while preserving performance benefits.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top