List of Papers Browse by Subject Areas Author List
Abstract
In medical imaging, the evaluation of segmentation methods remains confined to a limited set of metrics (e.g. Dice coefficient and Hausdorff distance) and annotated datasets with restricted size and diversity. Besides, segmentation is often a preliminary step for extracting relevant biomarkers, accentuating the need to redirect evaluation efforts towards this objective. To address this, we propose an original methodology to evaluate segmentation methods, based on the generation of realistic synthetic images with explicitly controlled biomarker values. Image synthesis is based on Stable Diffusion, conditioned by either a 1D vector (clinical attributes or latent representation) or a 2D feature map (latent representation). We demonstrate the relevance of this approach in the context of myocardial lesions observed in cardiac late Gadolinium enhancement MR images, controlling the image synthesis with segmentation masks or infarct-related attributes, among which size and transmurality. We evaluate it on two datasets of 3557 and 932 pairs of 2D images and segmentation masks, the second dataset being for testing only. Our conditioning not only leads to very realistic synthetic images but also brings varying levels of task complexity, a must-have to better assess the readiness of segmentation methods.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1968_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/creatis-myriad/cLDM_project
Link to the Dataset(s)
https://www.creatis.insa-lyon.fr/Challenge/myosaiq/
BibTex
@InProceedings{DelRom_Controllable_MICCAI2025,
author = { Deleat-Besson, Romain and Goujat, Celia and Bernard, Olivier and Croisille, Pierre and Viallon, Magalie and Duchateau, Nicolas},
title = { { Controllable latent diffusion model to evaluate the performance of cardiac segmentation methods } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15961},
month = {September},
page = {100 -- 109}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper proposes a novel evaluation methodology for segmentation models by leveraging latent diffusion models (LDMs) to synthesize medical images with controlled biomarker attributes. By conditioning image generation on either 1D vectors (e.g., infarct size, transmurality) or 2D segmentation maps, the authors aim to generate realistic cardiac MRI images for benchmarking segmentation model performance. The methodology is tested on myocardial infarct segmentation in LGE-MRI, and evaluated via downstream clinical attribute recovery and Fréchet Inception Distance (FID).
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Tackles an important and underexplored problem: evaluating segmentation models beyond conventional metrics and datasets.
- Proposes a forward-thinking use of diffusion models to synthesize clinically meaningful data with explicit control over attributes.
- Demonstrates technical understanding of conditioning strategies, with comparisons across multiple control methods (attributes, latent vectors, ControlNet, autoencoders).
- Visual examples show realistic and diverse anatomical variation in generated data.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The core goal of evaluating segmentation models under varying levels of difficulty or attribute-based stress is not fully realized. Only a single segmentation model (nnU-Net) is tested, and no comparative or failure-case analysis is presented.
- The methodology’s clinical relevance is underdeveloped. While attribute control is promising, the paper lacks evidence that this framework offers practical insights beyond traditional evaluation pipelines.
- The claim that this supports clinical transferability is speculative and unsupported by real-world model behavior under varying conditions.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper introduces a creative and potentially impactful idea — using controlled latent diffusion models to create synthetic evaluation datasets for segmentation. However, in its current form, the work reads as an early proof-of-concept rather than a robust evaluation framework. The authors test only one model and stop short of demonstrating whether the generated data truly challenge, differentiate, or reveal weaknesses in segmentation methods. Without stronger experimental validation, and given the methodological caveats around conditioning and circularity, I cannot recommend acceptance at this stage. The idea is promising but premature.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper presents a novel evaluation framework tailored to cardiac MRI segmentation methods, leveraging the generation of controlled synthetic data using conditioned latent diffusion models. The approach enables the synthesis of realistic images by conditioning on clinically relevant attributes such as infarct size, transmurality, and slice position, allowing the creation of diverse and physiologically meaningful datasets. Several conditioning strategies are explored, including both 1D clinical vectors and 2D segmentation-derived maps, which offer flexibility in generating varied and task-relevant scenarios. This methodology shifts the evaluation focus from traditional overlap-based metrics to the assessment of clinically meaningful biomarkers, offering a more application-driven understanding of segmentation performance in cardiac imaging.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
The authors introduce a new way to evaluate medical image segmentation by focusing on clinical relevance rather than just geometric accuracy. Instead of relying on traditional metrics like Dice or Hausdorff distance, it uses synthetic cardiac MRI images with controlled attributes like infarct size and transmurality to test whether segmentations preserve key clinical details. This approach ties model performance directly to practical outcomes, offering a more meaningful way to assess segmentation quality.
-
It explores how to effectively condition latent diffusion models (LDMs) for medical image synthesis, a relatively new area. It compares several strategies, including 1D conditioning with clinical attributes and 2D conditioning using segmentation maps. The study finds that 2D conditioning, especially through AutoEncoder concatenation, best preserves spatial and anatomical accuracy, achieving strong alignment with clinical metrics like infarct size (r² = 0.90). This provides clear guidance for generating realistic and controlled cardiac MR images, offering valuable tools for both evaluation and data augmentation.
-
The manuscript applies its framework to a real clinical challenge: assessing myocardial infarcts in late gadolinium enhancement (LGE) MRI. By conditioning image synthesis on cardiac-specific attributes like transmurality and slice position, the approach stays directly relevant to clinical practice. Validation with cardiologists and testing on a large dataset of over 3,000 slices reinforce its real-world impact. Combining traditional metrics with biomarker-based analysis shows strong control over clinically important features, making the evaluation both technically sound and clinically meaningful.
-
The authors also provide extensive implementation details (e.g., model architectures, hyperparameters) and commit to releasing their code.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Limited Benchmarking Against Prior Synthesis Methods
- The authors position latent diffusion models (LDMs) as a novel tool for generating synthetic medical images with controlled attributes. However, they do not compare their approach to established generative methods like GANs and DDPM.
Narrow Evaluation of Segmentation Models
- The framework is validated exclusively on nnU-Net, a single segmentation model. This limits insights into whether the evaluation paradigm generalizes to other architectures. For example, if a model like TransUNet or SwinUNet achieved high Dice scores but poor biomarker fidelity, it would powerfully illustrate the limitations of traditional metrics. The absence of multi-model testing weakens the argument for moving beyond Dice/HD.
Generalizability
-
The authors claim their model avoids memorization but do not provide evidence (e.g., t-SNE plots of latent spaces). Additionally, the framework is validated only on cardiac LGE MRI. Prior multi-modal synthesis studies (e.g., on brain/liver MRI) highlight the importance of diverse datasets for robustness, but this is not explored.
-
The title “Controllable latent diffusion model to evaluate the performance of segmentation methods” implies a generalizable framework, but the work is exclusively validated on cardiac MRI. Domain-specific limitations (e.g., reliance on cardiac biomarkers like transmurality) are not acknowledged, risking misinterpretation. A more specific title would better align with the scope.
-
Section 2.1 titled ‘Latent Diffusion Model’ appears to mostly summarize existing background knowledge rather than presenting details of the proposed method. While it’s important to contextualize the approach, this content might be more appropriately placed in the Introduction or a separate Background section to keep the Methods section focused on implementation details and novel contributions.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The paper reports clinical regression scores (r²) and traditional metrics like Dice and FID separately, but doesn’t show how they relate. A combined view (like a table or a scatter figure comparing Dice with r²) would make it clearer when high segmentation accuracy doesn’t actually reflect clinical relevance.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
please find the weaknesses
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors propose a novel framework based on controllable latent diffusion models conditioned on clinically meaningful attributes. This enables the generation of realistic synthetic data with specific lesion characteristics, which can then be used to assess how well segmentation methods preserve those attributes. The core contribution is compelling and has the potential to shift how segmentation performance is evaluated, especially in clinical contexts. The rebuttal clarified several points satisfactorily. The clarification regarding the scope of the work was acceptable given the paper’s revised title and focus. However, the response to the concern regarding the use of only a single segmentation model (nnU-Net) remains insufficient. Since the central aim is to propose an evaluation framework, validating it on just one architecture is limiting. A core strength of such a framework should lie in its ability to differentiate segmentation performance across architectures that might perform similarly under traditional metrics but diverge in preserving clinically significant attributes. The inclusion of additional dominant models would have strengthened the argument, particularly if the new evaluation revealed discrepancies missed by Dice or HD. Additionally, while the authors touched on the relation between traditional metrics and clinical attributes, they missed an opportunity to conduct a more direct quantitative correlation analysis. Although I mistakenly referenced r² in the original review (as it appears in Table 1), my actual suggestion was to compare sample-level clinical attribute scores (e.g., transmurality, infarct size) with segmentation metrics like Dice. Such an analysis would provide a clearer picture of how well traditional scores align with clinically meaningful outcomes and reinforce the need for the proposed approach. Despite these shortcomings, the paper presents an innovative idea with clear potential for impact. It is a meaningful step toward aligning technical performance metrics with clinical value, and the methodology, while not fully validated, is original and well-motivated. I lean toward a borderline acceptance, based on the novelty and potential of the proposed evaluation strategy.
Review #3
- Please describe the contribution of the paper
The paper introduces a novel framework that uses controllable latent diffusion models to generate realistic synthetic cardiac MRI images conditioned on clinical attributes or segmentation masks. This enables the evaluation of segmentation methods based not only on traditional overlap metrics but also on their ability to preserve clinically meaningful biomarkers (e.g., infarct size, transmurality), offering a more clinically relevant assessment of segmentation performance.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This paper appears to present a novel approach to evaluating segmentation methods by leveraging controllable latent diffusion models to generate synthetic cardiac MRI images with precisely defined clinical attributes. A key strength lies in the originality of using image synthesis not just for data augmentation but as a means to create clinically meaningful and diverse test sets, enabling evaluation of segmentation performance through the recovery of clinically relevant biomarkers. The use of real-world challenge data (MYOSAIQ) and expert visual assessment further supports the clinical feasibility of the synthetic images. I think there is potential to generalize this methodology to other anatomical structures and imaging modalities, and this could overcome current limitations with data collection and sharing.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The evaluation is limited to a single segmentation model (nnU-Net), without comparison to other segmentation architectures such as transformer-based models which might restrict the generalizability of the findings. The realism of the synthetic images is constrained by limited texture diversity, which the authors acknowledge, and which may limit their utility for evaluating segmentation under varied real-world conditions. The current work is confined to 2D cardiac MRI slices and does not address the challenges of scaling to 3D or time-resolved (4D) data
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper presents a timely contribution that addresses key limitations in medical image segmentation evaluation by introducing a controllable latent diffusion model framework for generating synthetic cardiac MRI images. Although it took a bit of time to get my head around the methodology, I think the approach is both novel and impactful and shifts the focus from traditional overlap metrics to clinically meaningful biomarker preservation. The ability to generate realistic, diverse, and attribute-controlled synthetic images offers a powerful tool for testing segmentation models under a range of clinically relevant scenarios, which is particularly valuable given the challenges of data access, annotation variability, and limited dataset diversity in medical imaging.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
I think this paper should be accepted
Author Feedback
Thank you for the valuable feedback and positive assessment. Our work introduces an original evaluation framework for segmentation models (validated in detail on cardiac LGE MRI from the public dataset MYOSAIQ), based on controllable latent diffusion models and conditioned on clinical attributes or segmentation maps. Its novelty and relevance were highlighted by all three reviewers.
Major concerns: [R1,R2,R3] Comparisons with other segmentation and [R2] image synthesis models:
For segmentation, we focused on nnU-net, which is SOTA on many segmentation tasks and very relevant to demonstrate the value of our pipeline. For image synthesis, we focused on latent diffusion which is also SOTA. Due to the page limit, we prioritized extensive comparisons on conditioning (including SOTA methods such as ControlNet) to first prove that our contribution on clinical attributes is valid. These comparisons were highlighted by reviewers R2+R3. The updated Discussion will mention that future work will cover comparisons with other segmentation and image synthesis models, and on other tasks.
[R3] The generated data truly challenge […] segmentation methods:
We showed we can effectively control the generation based on clinical attributes and obtain synthetic images whose segmentations lead to Dice and HD in line with the literature (Tab2). Besides, our work extends beyond conventional evaluation metrics by incorporating cardiac-specific attributes, enhancing interpretability from a clinical perspective (R1,R2). This also allows enriching existing databases with synthetic samples that exhibit under-represented characteristics. Finally, the difference in conditioned and predicted attributes reveals limitations of the traditional evaluation with Dice (see comment below), which is a major demonstration of how meaningful our approach is (also highlighted by R3).
[R2] Relation between r² and Dice:
r² is a global score for all samples which cannot be compared to individual Dice values. The reviewer probably meant the difference between conditioned and predicted attributes, for each sample. We have verified that this difference is (partially) related to the Dice through a non-linear decreasing trend. However, for very small infarcts, both Dice and attribute difference can be low. Conversely, some synthetic images show high Dice despite poor agreement with target attributes, hence the need for clinically meaningful attributes in the evaluation. We won’t add this experiment, as this would not fit MICCAI rebuttal guidelines, but we will add a few sentences supporting the value of r² to complement Dice and HD.
[R3] Evidence of practical insights beyond traditional evaluation:
Controlling image generation with clinically relevant attributes is essential to create meaningful and realistic synthetic datasets that challenge clinical experts (R1, and see experts’ scores p7). Our framework emphasizes the use of synthetic data not merely for augmenting dataset size, but as a tool for downstream tasks (like segmentation) (R1,R2), by enabling a more nuanced understanding of the limitations of traditional evaluation pipelines.
[R3] Caveats on conditioning and circularity:
We respectfully disagree on conditioning, as multiple conditioning strategies were proposed. Also, there is no circularity, since only images are augmented. Additional variability can be introduced by applying transformations to segmentation maps as done in Fig3.
Minor concerns: [R2] Claim on memorization:
We had computed the MSE between each synthetic image and all training slices, identified the closest one, and compared them qualitatively to claim no memorization. We will update the paper with this explanation.
[R2,R3] Text editing:
We will make the title more specific: “(…) cardiac segmentation methods” and clarify Sec2.1 as “Background knowledge.” Additionally, we will soften the speculative claim by removing “for clinical transfer” from the Abstract.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This paper proposes a novel evaluation framework for medical image segmentation based on a controllable latent diffusion model conditioned on clinically meaningful attributes. The core idea—quantifying model performance through phenotype-driven data synthesis—offers a new perspective on how segmentation accuracy aligns with clinical relevance. Reviewers 1 and 2 supported acceptance, citing the method’s originality and its potential to influence the evaluation of segmentation pipelines. Reviewer 3 remained cautious, pointing to the limited validation scope and lack of evidence for practical clinical benefit. The rebuttal addressed many of these concerns thoughtfully, clarifying the value of attribute conditioning and acknowledging current limitations such as testing with a single segmentation model (nnU-Net). While broader validation across architectures and tasks would strengthen the contribution, the current study presents a promising direction for more meaningful, clinically grounded assessment of segmentation performance.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A