Abstract

The segmentation of mass lesions in digital breast tomosynthesis (DBT) images is very significant for the early screening of breast cancer. However, the high-density breast tissue often leads to high concealment of the mass lesions, which makes manual annotation difficult and time-consuming. As a result, there is a lack of annotated data for model training. Diffusion models are commonly used for data augmentation, but the existing methods face two challenges. First, due to the high concealment of lesions, it is difficult for the model to learn the features of the lesion area. This leads to the low generation quality of the lesion areas, thus limiting the quality of the generated images. Second, existing methods can only generate images and cannot generate corresponding annotations, which restricts the usability of the generated images in supervised training. In this work, we propose a paired image generation method. The method does not require external conditions and can achieve the generation of paired images by training an extra diffusion guider for the conditional diffusion model. During the experimental phase, we generated paired DBT slices and mass lesion masks. Then, we incorporated them into the supervised training process of the mass lesion segmentation task. The experimental results show that our method can improve the generation quality without external conditions. Moreover, it contributes to alleviating the shortage of annotated data, thus enhancing the performance of downstream tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4386_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaHao_Paired_MICCAI2025,
        author = { Zhang, Haoxuan and Cui, Wenju and Cao, Yuzhu and Tan, Tao and Liu, Jie and Peng, Yunsong and Zheng, Jian},
        title = { { Paired Image Generation with Diffusion-Guided Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a generative network that produces digital breast tomosynthesis (DBT) containing masses. They show that a their method, paired image generation (PIG), can generate DBTs with masses without any conditional inputs. Simply training a diffusion model with two channel images, the true image and lesion mask, allows the diffusion model to generate images with masses. The authors compared their model against multiple SOTA methods in both conditioned and unconditioned generation and showed that their model was the best performer. Finally, they tested a lesion masking model augmented with their generated images as well as SOTA generated images and show the best improvements in the segmentation task.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s strength is that they show a straight-forward method of training a diffusion model to generate DBTs with masses that can be then used to train for downstream segmentation tasks. Their work shows that the addition of generated images during training improves the performance of the downstream segmentation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There are several weaknesses of this work. First, their novelty is limited as the main body of their contribution is providing the diffusion model with two channel images. Second, they assert their model is not a conditioned model. However, isn’t the additional of the second masked channel a condition itself? Third, there is no standard deviations or tests of significance in their results, making it difficult to determine if their results are significant. Fourth, evaluating the image quality with only FID seems very limited as there are other metrics of image quality that could easily be performed in addition to FID, such as KID and LPIPS. Another concern of model robustness is in their results where the addition of more data (+PIG_3072) actually reduced the performance of the downstream task compared to less (+PIG_2048) which leads me to believe that the model is not as robust or as stable. This might be due to segmentation model’s saturation but it is hard to tell without any tests of significance. Lastly, the authors assert that obtaining dense breast lesion masks is a problem but do not provide details of what percent of their dataset was dense breast for training and segmentation.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It would be nice to see in the rebuttal some more metrics of image quality in the evaluation, like KID and LPIPS. Additionally, it would be good to see what the performance of the segmentation model is when trained only on generated images, along with the tests of significance as without them, the improvements could be a matter of random seeds or noise. Separating segmentation performance based on breast density would immensely help the paper as the authors mention that dense breast lesion segmentation is the most difficult task. Finally, clarifying why the authors assert that adding a mask channel is not conditioning would significantly help the paper.

    What sort of processing was performed on the dataset? As far as I’m aware, DBTs have multiple slices (angles) and choosing the center slice might defocus lesions.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper needs some more work as there isn’t a strong novelty in the model architecture, the definition of conditional training being questionable, and model performances, while seemingly the best, does not show any tests of significance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I think the major aversion for the initial rejection was the confusion around conditioning and guidance. While both concepts are very similar, it is true that it is different. Their currents results do show that compared to similar guidance and conditional models, theirs perform better on average. While I think there’s room for improvement, as mentioned by the other reviewers as well, this paper deserves to be accepted.



Review #2

  • Please describe the contribution of the paper

    The article proposes Paired Image Generation (PIG) framework using mutually guided diffusion models to simultaneously generate Digital Breast Tomosynthesis (DBT) images and their corresponding mass segmentation masks. Unlike conditional diffusion models with external guidance, this method models a joint denoising process across two networks, with the image and mask generation processes guiding one another.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of generating paired DBT images and mass segmentations could be useful to generate synthetic datasets

    The paper is well-written including the formal derivations and practical implementations.

    Experimental results demonstrate that segmentation accuracy improves through synthetic data augmentation suggesting real clinical utility.

    The paper compares its method against a variety of baselines (DDPM, DDIM, LDM, SegGuidedDif, ControlNet), demonstrating superior performance in FID and segmentation Dice metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The method does not compare against a naive baseline that simply treats the image and mask as two output channels of a single diffusion model, which is conceptually straightforward and might yield similar performance. For example, unconditional generation of color images jointly generates R, G, and B channels, and they are all mutually guiding each other during the diffusion process. Your method is a little more complicated because on each iteration, the mask is updated and then the image is updated sequentially by two networks. But I would have liked to see an ablation study to see if this sequential approach is necessary compared to just processing them jointly as channelized images.

    ​The concept of jointly generating images and segmentation maps not entirely novel; similar approaches can be found with a quick search. In particular, the JSSR method below closely resembles their approach and was applied to CT images.​

    Liu, Fengze, et al. “Jssr: A joint synthesis, segmentation, and registration system for 3d multi-modal image alignment of large-scale pathological ct scans.” European Conference on Computer Vision. Cham: Springer International Publishing, 2020.

    Sushko, Vadim, et al. “One-shot synthesis of images and segmentation masks.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.

    Qi, Lu, et al. “Unigs: Unified representation for image generation and segmentation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Minor issue, but in your theory, I think \sigma_t is never defined and may be used incorrectly in the reverse process. I thought the variance of an update step should be \beta_t. I may be wrong here but just double check.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty is not very high, but I am very impressed with the practical task-based evaluation showing that synthetic data augmentation improves the segmentation accuracy, demonstrating practical clinical utility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel Paired Image Generation (PIG) method for generating both medical images and their corresponding lesion mask annotations using two unconditional diffusion models that guide each other. The focus is on Digital Breast Tomosynthesis (DBT) where annotated data is scarce and lesion regions are difficult to learn due to concealment in dense tissue. In this method two separate diffusion models (one for images and the other for masks) guide each other, eliminating the need for external conditions. Thus a learnable guider replaces fixed conditioning, enabling dynamic mask-image pairing. The authors demonstrate results on 8,723 DBT slices to generate images and upon augmenting these generated images with original dataset they improve on baseline conditional methods for lesion segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The concept of mutually guided diffusion models for generation is novel in this space and has been explained well.
    2. A particularly interesting aspect was that the authors demonstrated the effectiveness of their method by augmenting generated images for downstream tasks and improving results.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper introduces a dual-diffusion setup where two models guide each other during training. However, a natural question is that such mutual conditioning could introduce instability. The authors do not elaborate on any specific training strategies that they used to ensure convergence and avoid degenerate behavior.
    2. The authors describe their framework as ‘‘unconditional’’ and compared only against DDPM/DDIM for image quality (FID). However, PIG uses mutual guidance between two diffusion processes so it is kind of a model guided conditioning even though no explicit class/text conditions are used. As such, the comparison with fully unconditional models does not seem entirely fair.
    3. While Figure 3 provides illustrative examples, it seems like all shown lesions appear in the upper or mid regions of the DBT slices. This may be due to space constraints in the manuscript that they authors were unable to include more examples, however, it raises the question of whether the model generalizes well to lesions located in other spatial locations or is there a mask position bias?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents a novel and promising dual-diffusion framework for generating paired medical images and annotations without external conditions, several concerns prevail. First, the mutual guidance mechanism between the two diffusion models introduces the risk of training instability, yet the paper does not discuss any techniques used to ensure convergence or mitigate the same. Second, the method is described as “unconditional” but in practice it involves internal guidance/conditioning between models, making the exclusive comparison to unconditional baselines for image quality less convincing. Third, the visual examples provided (Fig. 3) appear to focus only on lesions in the upper or mid regions of DBT slices. While this may be due to space constraints, it raises concerns about potential spatial bias and generalizability across lesion locations. Finally, reproducibility is limited due to the lack of detailed implementation specifics. Despite these limitations, the core idea is strong, and the performance improvements are promising.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The the definition of conditional training is still questionable to me even after the rebuttal and I believe that some parts of the results are compromised as it was not clearly pointed out in the rebuttal.




Author Feedback

Thanks for your valuable comments. We sincerely appreciate your time and effort in reviewing our work. Your insights have been instrumental in refining our proposed PIG. In this rebuttal, we wish to address your concerns.

For W1 (Weakness 1) from R3 (Reviewer 3), PIG may be misunderstood as a “naive baseline”, which directly sets the diffusion model’s input/output channels to 2. Based on Section 2.2, we will further clarify the differences between PIG and this baseline in the following response to W1 from R1 (as R1 suggested comparing PIG with this baseline).

For W1 from R1/R3, we will further compare PIG with this naive baseline. For the baseline, assuming both the image and mask are 1-channel, then we need to train a diffusion model from scratch with 2 input/output channels. By contrast, PIG proves that this process can be achieved using two models (x and y), which generate guiding signals for each other (Proposition 1). We can directly use a mask-to-image model pretrained on large-scale unannotated datasets (such as MAISI [2], the generative foundation model) as model y, and only need to train an additional diffusion guider with 2 input channels and 1 output channel as model x. This reduces training costs, enhances applicability and enables the generation of higher-quality images. We deeply regret that due to conference guidelines, we cannot present new experimental results. We will provide further textual explanations in the final submitted paper.

For W2 from R1, we will illustrate the advantages of PIG compared with other methods that jointly generate images and masks. Unlike approaches such as JSSR, which combine generative models with segmentation models, PIG combines two generative models that can mutually generate guiding signals, thus improving the quality of the generated results.

For W1 from R2, the two models guide each other only during the generation process. They are trained separately (as described in Algorithm 1) and do not share parameters. No instability was observed in the training process.

For W2 from R2/R3, we will explain (1) why PIG is unconditional and (2) why we compare PIG with other unconditional models. (1) During the generation process, conditional generation methods like ControlNet [18] require masks as inputs. However, PIG does not need masks and can generate both images and masks. Therefore, PIG is unconditional. (2) Comparisons with other unconditional models reveal an advantage of PIG: when generating images, we can utilize PIG to generate guiding signals simultaneously, which ensures that the process requires no additional conditions and enhances the quality of generated images.

For W3 from R2, the results include diverse lesion locations, which is one of the advantages of PIG, but we have not displayed them. This can be seen from the analysis of the results shown in Table 2.

For W3-W5 from R3, while we cannot provide new experiments, we can prove the high quality and stability of PIG from two perspectives based on the existing results in Table 2. (1) We employed five-fold cross-validation. This approach guarantees that the performance of each patient is evaluated. PIG outperforms other methods in all metrics of every fold. (2) All the generation methods exhibit similar trends. Specifically, the performance peaks after generating ~2048 images, and further increasing the proportion of generated images leads to fluctuations (probably due to performance saturation). For PIG, the performance dips to its lowest point after generating ~3072 images. We report this lowest value in Table 2, which outperforms the highest values achieved by other methods. We will provide further textual explanations in the final submitted paper.

For W6 and comments from R3, ~70% of the samples in the dataset are dense breasts. In future work, we will conduct experiments specifically on dense breast tissue. During data processing, we extracted all slices containing lesions, rather than just the central slices.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors sufficiently addressed the major comments raised by the reviewers.



back to top