Abstract

Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4594_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ChiaraBaldini/endoLDMC

Link to the Dataset(s)

N/A

BibTex

@InProceedings{BalChi_Clinicallyguided_MICCAI2025,
        author = { Baldini, Chiara and Kushibar, Kaisar and Osuala, Richard and Balocco, Simone and Diaz, Oliver and Lekadir, Karim and Mattos, Leonardo S.},
        title = { { Clinically-guided Data Synthesis for Laryngeal Lesion Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors use established tools with a laryngeal endoscopy dataset containing laryngeal lesions to enrich a clinical detection environment. The use of LDM together with ControlNet allows to create synthetic lesion data to be involved in training a laryngeal lesion detection algorithm, supposedly helping in CADx/e.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors use - for the first time up to my knowledge in published, peer-reviewed studies - diffusion models for generating laryngeal data.

    One other major strength is the availability of an perceptual study from five otolaryngologist who rate the realism of the created images. Despite the fact it is maybe not important how realistic the images are, if the purpose (identifying better laryngeal lesions) does not depend necessarily on the realism, as the synthetic data is not key (in contrast to using the method for training laryngologists).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Data source

    From the data that is available out there, the authors did, unfortunately, not make use of those. The amount of data for training and evaluation is relatively tiny (~1k data tops), which others have shown in laryngeal endoscopy, that you need multiple thousand images to get reliable results. The Laryngoscope8 dataset would have been very useful here (https://github.com/greenyin/Laryngoscope8) or even the BAGLS dataset (Gómez et al., Sci Data), that also contains laryngeal lesions (59k frames in total). As well, given the different conditions (nasal, oral endoscope, White light, NBI, ~10 different pathologies), the data might be disbalanced (I did not find any details about the data composition), such that the individual cases is only ~10s of images, which is way too low in comparison to other studies, such as Wellenstein et al., Head&Neck 2023 and Baldini et al., Computer Methods and Programs in Medicine 2025.

    Data quality

    I am not really convinced by the analysis of the FID scores, the synthetic ones are lower (maybe as expected), but I am not sure what this lower value means: is this ok, decent, bad, or just descriptive? The meaning behind this analysis is to me not fully clear.

    Experiments

    This study has analyzed the improvement in classification/detection by incorporating synthetic images. This part is rather weak, as just the availability of additional data could lead to this performance gain, as only little data was present. This is a crucial experiment to conduct (i.e. +10% of real data). In addition, one would argue how the performance would be on sole synthetic data, which haven’t been shown. The relation between Fig. 3a) and Table 1, where the same data is (in my opinion) shown, is unclear - the reported values are similar, but different, why there are different numbers is for me unclear. Also the point that +10% seems to be a sweet spot indicates that more data helps, but then it becomes worse as the variety of the synthetic data is only limited.

    Human rater study

    This study has an interesting approach, that seems to be valid (no criticism here). However, I barely can comprehend the busy figure 3, especially as it is shared between +synth data performance external/interal, as well as the human rater study. Panel (b) looks for me like a confusion matrix, where a couple of synthetic images have been classified as “real” and “unreal”. However, an analysis what makes these images “real” or “unreal” is missing. The AUC analysis in inconclusive - the authors report an AUC of 0.5, which is rather random guessing. I cannot concur with the sentence “Indeed, we can empirically affirm the realism of the synthetic data as clinicians struggled to recognize synthetic samples, often classifying them as real cases.”. To me, it seems like every second image has been classified as real and the other half is unreal, where it might happen that the LDM just reproduces “real images”, as the authors haven’t shown that these are disjunct to real images from the training population.

    I was also asking, if these images have to be realistic at all. The sole purpose of these images is to improve the lesion detection pipeline, and not for training laryngologists. Therefore, one could argue that realism is not the major quest, as one can think of images, that have e.g. dead pixels or some other “broken” anatomy, which still might help in detecting laryngeal lesions.

    Methodological contribution

    This study integrates an existing LDM source together with an existing ControlNet, tailoring the prompts and images to laryngeal data. The results are debatable (see above), but ok and rather expected from the literature. I only see incremental knowledge gain for the MICCAI community. I rated the study as application, but also there I see major flaws.

    Broader scope - generative AI in laryngology

    The authors state that “To the best of our knowledge, no studies have explored the synthesis of images for laryngeal lesion diagnosis”. This is only partially true. A study in J Voice last year (Darvish and Kist, J Voice 2024) showed that variational autoencoders can be used to model not only realistic laryngeal endoscopy frames, but also modulate the glottal opening by adding or subtracting the glottal opening vector. The authors used the BAGLS dataset, that also includes laryngeal pathologies, such as cancer or nodules.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The use of LDM+ControlNet to create synthetic images for enlarging datasets is in general nice. However, given the amount of flaws of the experiments, circumstances and claims, as detailed in the “weakness” section, I think the rebuttal phase is not sufficient to incorporate the shortcomings. Further, I am missing more insights into the problem, rendering the study only partially interesting for MICCAI, such that I recommend rejecting the study.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    In the rebuttal phase, the authors stated that they are aware of the datasets, and did not use/included them consciously, what I do not understand. The lack of annotations is for me not necessarily a valid reason, as those can be generated. The authors do need to have some medical background and clinical collaborators, as without this intellectual input one can hardly verify (i) if the technology is working fine or artifacts exist and (ii) perform the perception study.

    In addition, I do not feel that my concerns are adequately expressed. I do acknowledge that the FID score has been re-considered and the authors gave reasons to not use the both suggested datasets, but I am still not convinced that a revised manuscript would concur with the initial flags. For instance, some comments (the authors are not the first to use GenAI in laryngoscopy) were just ignored.

    On a broader scope, the authors provide for me two new insights: adding (more) synthetic data helps in detecting laryngeal pathologies. Adding more data has been shown before by multiple groups and assessed, if it is just “more” data or specifically “synthetic” data, and if it has to be realistic data has not be quantified, which is for my understanding a crucial contribution.



Review #2

  • Please describe the contribution of the paper

    The paper proposes to use lesion inpainting (using ControlNet, a recent diffusion architecture facilitating this) to generate lesion-present images in laryngology. Experiments are conducted using a detection model to see if adding synthetic data to real data improves performance () as well as a user study assessing the realism of images. Performance is evaluated both on an internal and an external test set, and performance gains are quite impressive.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper demonstrates how modern-day vision diffusion models can be used to improve lesion localization in laryngeal endoscopic images using synthetic images.

    The manuscript is well written and easy to follow.

    Evaluating using both internal test set (same clinical site) and external test set (different clinical site) and demonstrated performance gains are quite impressive.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Some technical implementation details are not clear, see below: Is ControlNet finetuned for the task? It isn’t clear why it would generalize to medical images as a out of the box non-medical model.

    • Are captions generated using an LLM or from clinical notes?
    • The description of the uncertainty estimation process is also not clear.
    • When cross-validation is applied, does the real training set change, or is it only synthetic data selection that changes?

    Authors should also provide a brief overview of related work on lesion in-painting, clarifying whether this is the first work to inpaint lesions using a generative model into laryngologic images, or there exist previous approaches doing so.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Above weaknesses are minor and can be clearly addressed by the authors. I believe the contribution of the approach is significant and the paper overall would be a great addition to the conference.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The other reviews raised several issues, among them the clinical utility and evaluation limitations, however, I believe the manuscript offers a novel idea that in its current form is sufficient to demonstrate the proof of concept.



Review #3

  • Please describe the contribution of the paper

    The paper introduces a latent diffusion model paired with a ControlNet adapter for generating laryngeal endoscopic images. While it primarily builds on existing methodologies, the application of synthetic data to enhance CADx model performance for laryngeal lesion diagnosis is valuable. The authors evaluate their synthetic images using the FID ratio, a robust method that addresses FID’s limitations, coupled with a human observer study. Additionally, they use uncertainty estimation to select synthetic data for training in the downstream task.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors use two forms of conditioning information: an image caption containing imaging modality and lesion type, and a bounding box-based mask to provide information about the location of the lesions. This can provide flexibility in image generation. While the authors use bounding boxes, the work could easily extend to segmentation masks, either manually annotated or generated through SAM.

    The use of the FID ratio is a strong choice, as it effectively evaluates the quality and variability of generated images in relation to real images, which FID alone cannot capture, especially for medical data.

    The use of uncertainty estimation to identify and select the most challenging synthetic samples is a good strategy.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Given that the final model requires both text and spatial conditioning, it’s unclear why the authors did not start from an existing text-conditioned model such as Stable Diffusion. This would have provided built-in support for text via cross-attention and a natural integration point for ControlNet. The choice to begin with an unconditional model and later introduce conditioning complicates the pipeline unless suitably justified.

    The core idea of the work is to use a controlnet adapter to provide both text and mask conditioning. The paper mentions using a text encoder, but it’s unclear how the text embeddings are incorporated into the model. Since ControlNet natively handles only spatial inputs, additional clarification is needed on how textual information influences generation through it.

    The training setup deviates from standard practice by fine-tuning the base LDM unconditionally, then jointly fine-tuning it again with ControlNet and conditioning inputs. Typically, ControlNet is added to a pretrained, already conditioned model, with only the ControlNet module being trained. Clarifying the motivation for this two-stage fine-tuning approach, and whether freezing the base model was considered, would help readers better understand and reproduce the method.

    I appreciate the use of the FIDratio metric as proposed in Medigan. However, the manuscript’s justification for why FID “does not fully align” with the goal of generating clinically diverse data appears to mischaracterize FID. As per my understanding, FID measures distributional similarity, not similarity between individual images, and is not inherently at odds with diversity. The original motivation for FIDratio is to contextualize FIDrs (real-synthetic) relative to the natural variability in real data (FIDrr), which this paper does not mention.

    The inclusion of a clinical perceptual evaluation using Likert-scale ratings is a valuable step. That said, I’m concerned about how AUC scores were derived. Mapping ordinal ratings to probabilities feels somewhat arbitrary and may not reflect true subjective confidence. This could make the AUCs less meaningful. Is there a reference that supports this conversion?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Two minor observations: First, the uncertainty estimation strategy based on variance across cross-validated models is reasonable, but since it relies on model confidence scores, its effectiveness may be impacted by poor calibration—potentially leading to suboptimal sample selection. Second, the discussion focuses on the peak at 10% synthetic data but overlooks the non-monotonic performance trend beyond that, particularly the dip at 20% and rise at 40% on the external test set. A brief analysis of this pattern could strengthen the discussion.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel and clinically relevant application with strong empirical results, but some methodological aspects—particularly around conditioning, training setup, and evaluation—are unclear or under-justified.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors have addressed the major concerns.




Author Feedback

We thank all reviewers for their constructive feedback and for recognizing the clinical relevance and novelty of our approach for generating laryngeal endoscopic images, which shall be used to enhance the CADe/x performance. As stated, our paper features “strong and impressive results” [R2, R3] and includes an “interesting perceptual realism study” [R1]. R1-7.1: We are aware of the suggested datasets. However, even large datasets face class imbalances reflecting real-world incidences (~50% healthy cases in Laryngoscope8). BAGLS is tailored for a different clinical purpose, focusing on high-speed endoscopy to examine vocal cord mobility. Furthermore, while these datasets could enhance future work after proper annotation, they are not applicable to our study as they lack lesion bounding-box annotations. R1-7.2/R3-7.4: We opted for the FIDratio because it contextualizes FID between real and synthetic distributions (FIDrs) relative to real data variability (FIDrr), a choice acknowledged by R3 as highly effective. As expected, FIDrs is higher for inherent differences but remains close to FIDrr (high FIDratio) due to preserved variability. As suggested by R3, we will reformulate the FID definition, emphasizing that it measures the distribution similarity. R1-7.3/R2-7.3,4: Although we considered conducting experiments similar to those suggested by R1 before submission, we consciously excluded them, as the aim of data synthesis in this work is to enrich existing datasets while reducing reliance on real data acquisition and annotation, which require a lot of time and effort from clinicians. Fig. 3a and Table 1 present different results: Fig. 3a illustrates AP evolution as synthetic data is added (using the “fold 1” data from Table 1). We will include this clarification in Fig. 3a’s caption. In the “Downstream task” section, we will clarify that: 1) cross-validation was conducted with 3 different sets of synthetic data and the same real set; 2) Uncertainty Estimation (UE) strategy selects “challenging samples” based on detection prediction uncertainty. Indeed, the generated data was evaluated across 3 models, and the 10% of images with the highest uncertainty, i.e. highest variance in confidence scores, were selected. R1-7.4/R3-7.5: As explained in the manuscript, the human-observer study involves 20 images, 10 real and 10 synthetic, indicated in Fig. 3b by the left (REAL) and right (SYNTHETIC) quadrants. We will specify that the AUC scores were derived as in [A,B], as such strategy gives a good indication of synthetic data perceptual realism and confidence levels of human experts. The clinical realism of synthetic images, especially concerning lesion localization and appearance, is essential for enabling them to complement real data, thus contributing to the improvement of CADe/x models. R2-7.1,2/R3-7.1,2,3: Starting from the pre-trained autoencoder and text encoder from Stable Diffusion 2.1, as indicated by [14], we fine-tuned the LDM to generate realistic laryngeal images. We observed that this was essential for capturing the diversity and complexity demanded by this context. To control the lesions’ synthesis, we integrated and fine-tuned ControlNet while freezing the LDM’s weights. Captions from clinical notes were input to the denoising U-Net of LDM and ControlNet via a cross-attention mechanism. We will update the “Implementation details” section according to the reviewers’ comments. R1-7.6/R2-7.5: As stated, our study is the first to propose a method for synthesizing ready-to-use image-annotation pairs to train CADe/x models for detecting and diagnosing laryngeal lesions in standard endoscopy. We will incorporate the valuable minor observations from R3-10.1,2 in the discussion. [A] Alyafi et al., 2020, DCGANs for Realistic Breast Mass Augmentation in X-ray Mammography [B] Garrucho et al., 2023, High-resolution synthesis of high-density breast mammograms: Application to improved fairness in deep learning based mass detection




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper addresses a clinically relevant application in laryngeal lesion detection by augmenting training dataset with synthetic images. r1/r3 highlights good results but the method is unclear / falls short. r2 acknowledges the limitations but is sufficient to demonstrate the proof of concept. imho, laryngeal lesion detection is quite a niche area and data is quite scarce, to generate new images though the proposed method requires base image + text + bbox, which getting the image on its own is already quite a challenge. this is a borderline paper for me, slightly learning towards accept, but i think it’s more suited for a workshop



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a borderline paper. Its primary strength is being the first to demonstrate that latent diffusion models—with both text‐ and mask‐based ControlNet conditioning—can generate realistic laryngeal lesion images that measurably boost CADx performance, even when experts cannot reliably distinguish real from synthetic frames. However, key methodological details remain under‐explained (e.g., ControlNet fine‐tuning strategy, caption generation, uncertainty estimation), the dataset is small and lacking broader public benchmarks, and the FID-ratio analysis and perceptual study results need clearer interpretation. The authors need to clarify these methodological points, provide fuller justification of their evaluation metrics, and discuss the limits of their small proprietary dataset (ideally with at least one publicly available external cohort).



back to top