Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Language promptable X-ray image segmentation would enable greater flexibility for human-in-the-loop workflows in diagnostic and interventional precision medicine. Prior efforts have contributed task-specific models capable of solving problems within a narrow scope, but expanding to broader use requires additional data, annotations, and training time. Recently, language-aligned foundation models (LFMs) – machine learning models trained on large amounts of highly variable image and text data thus enabling broad applicability – have emerged as promising tools for automated image analysis. Existing foundation models for medical image analysis focus on scenarios and modalities where large, richly annotated datasets are available. However, the X-ray imaging modality features highly variable image appearance and applications, from diagnostic chest X-rays to interventional fluoroscopy, with varying availability of data. To pave the way toward an LFM for comprehensive and language-aligned analysis of arbitrary medical X-ray images, we introduce FluoroSAM, a language-promptable variant of the Segment-Anything Model, trained from scratch on 3M synthetic X-ray images from a wide variety of human anatomies, imaging geometries, and viewing angles. These include pseudo-ground truth masks for 128 organ types and 464 tools with associated text descriptions. FluoroSAM is capable of segmenting myriad anatomical structures and tools based on natural language prompts, thanks to the novel incorporation of vector quantization (VQ) of text embeddings in the training process. We demonstrate FluoroSAM’s performance quantitatively on real X-ray images and showcase on several applications how FluoroSAM is a key enabler for rich human-machine interaction in the X-ray image acquisition and analysis context. Information on data, weights, and code is available at https://github.com/arcadelab/fluorosam.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5042_paper.pdf

SharedIt Link: https://rdcu.be/eHwWq

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_24

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/arcadelab/fluorosam

Link to the Dataset(s)

N/A

BibTex

@InProceedings{KilBen_FluoroSAM_MICCAI2025,
        author = { Killeen, Benjamin D. AND Wang, Liam J. AND Iñígo, Blanca AND Zhang, Han AND Armand, Mehran AND Taylor, Russell H. AND Osgood, Greg AND Unberath, Mathias},
        title = { { FluoroSAM: A Language-promptable Foundation Model for Flexible X-ray Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {248 -- 258}
}

Reviews

Review #1

Please describe the contribution of the paper
This work has two main contributions: the FluoroSeg dataset with 3M synthetic X-ray images including both diagnostic and interventional exams, and FluoroSAM, a variant SAM model that adds a Vector Quantization module after text encoder to ensure similar semantics are projected to the same token. The dataset is generated by using 1621 real CT scans and do the project simulation from different angles to get the synthetic images. They also use organ masks from the pretrained segmentation model to generate x-rays from different positions. The mask is then combined to get the image-text-mask pair in the dataset. The authors put much efforts including the proposed dataset and model training, but I still have many concerns related to the implementation, necessities, and evaluations.
1. Lack of comparison with other interactive segmentation model using text input such as MedCLIP-SAM, MedCLIP-SAMv2, etc.
2. Provide more detail about how the model is trained with various prompts including text, point, and mask. The paper only mentioned the case when text is considered as prompt.
3. The dataset is constructed using over 1k real CT scans and simulates the synthetic x-ray from these CT scans. Even thought the scale of the generated dataset is 3M, but they are all coming from these limited numbers of CT scans, so I assume the variety of these X-rays is limited (in terms of disease, age, etc)
4. The performance of CXR is not as good as other methods. The explanation from the authors is the synthetic images have systematic differences from the real images. For the cadaver data, it is also coming from real images, why the model performs good in these images
5. In the generated image, I think it will also include the same type of images as CXR, right? Why the performance for real CXR is not good? In real clinical setting, if we want to enhance the annotation efficiency, it is very important to have a good accuracy when only a few points are given such as one or two points. So 2 points results are far more important than 8 points. In this case, if the model trained on 3M synthetic data perform much worse than other methods in 2 points case, shall we doubt the benefits by using synthetic data to train no matter the number scale is 3M or not?
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

as above
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

as above
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

as above
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I appreciate the authors’ efforts to reply my concerns. However, first, although the authors said they can extend the datasets to increase the variety of the datasets. I still doubt using 3M or even more generated images to train a model to get little improvement or even worse performance when users are using only few prompt input, which costs high computational resources, is worthwhile or not. They did not convince about this point, so my suggestion is reject.

Review #2

Please describe the contribution of the paper
This paper presents a language-promptable segmentation framework for interventional X-ray images, contributing both a novel dataset and a tailored model architecture. The two main contributions can be summarized as follows:
1. FluoroSeg Dataset – A large-scale, synthetically generated dataset of interventional X-ray images spanning a range of anatomical regions and surgical tools. The dataset includes paired image and text annotations, enabling the development of language-driven segmentation approaches.
2. FluoroSam Model – A custom-designed segmentation model trained on the FluoroSeg dataset and evaluated on both synthetic and ex-vivo datasets. The model incorporates Vector Quantization (VQ) to enhance promptable segmentation performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Clear and Structured Presentation The paper is well-written and logically structured, with clear articulation of both technical and clinical contributions. This significantly enhances readability and accessibility to a broad audience, including those from both AI and clinical backgrounds.
2. Timely and Relevant Contributions The dual contributions of FluoroSeg and FluoroSam directly address a timely need in the medical imaging and interventional AI space. The integration of language-promptable tools in surgical imaging aligns well with current trends in human-in-the-loop systems and foundational vision-language models.
3. Preliminary Sim-to-Real Validation The use of ex-vivo data for testing FluoroSam is commendable. Although preliminary, it offers useful insights into the potential of synthetic training to generalize toward real clinical settings.
4. Architectural Innovations The paper demonstrates that incorporating vector quantization improves performance on promptable segmentation tasks. The model design reflects thoughtful integration of techniques inspired by recent advances in multimodal learning.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Overuse of Qualitative or Vague Language The paper occasionally relies on informal or non-technical expressions such as “there is enormous opportunity” or “still features significant ambiguity.” While enthusiasm is welcome, the tone should remain objective and evidence-driven in scientific writing. A revision is recommended to eliminate imprecise or promotional language.
2. Limited Generalizability in Real-World Testing The ex-vivo evaluation is limited to a single specimen and a high-end fluoroscopic machine. As such, the claims regarding generalizability and clinical readiness are not yet substantiated. A more nuanced discussion is needed to properly frame the limitations and the scope of generalization.
3. Performance Metrics Do Not Meet Clinical Thresholds Reported Dice scores (0.56–0.70 for synthetic, ~0.6 for real data) fall short of the accuracy typically required in clinical decision-making in intraoperative use-cases, surgical navigation, or robotic guidance systems, which are the primary intended areas of the presented work. The paper would benefit from a more critical assessment of how these early results compare with established clinical benchmarks, and what specific steps could close this performance gap.
4. Overextended Conclusions and Future Claims The manuscript concludes with ambitious projections, including autonomous C-arm positioning, robotics, chain-of-thought-based analysis, and telehealth integration. While these are exciting future possibilities, they are not directly supported by the current scope or results of the study. A more pragmatic framing of the current contributions, while acknowledging long-term potential, would strengthen the scientific rigor of the paper.
5. Insufficient Description of Image Synthesis Methodology The image generation pipeline used to create the FluoroSeg dataset is a fundamental component of this work, yet it remains underexplained. Key details such as anatomical models, simulation parameters, tool placements, and physics-based rendering (if any) are missing. This omission limits reproducibility and weakens the transparency of the dataset’s construction.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The authors have provided the source code but not the FluoroSeg data.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper makes valuable and timely contributions to the emerging field of language-driven medical image analysis. However, in its current form, it overstates its clinical readiness and omits critical implementation details. With a more conservative interpretation of results and clearer methodological transparency, the paper would be a candidate for publication.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have made acceptable changes to the paper in accordance to the requests of the authors. As stated in the original review, this paper is bridging a real gap and has came in “on-time” given the current climate in medical imaging and the expansion of language promotable networks. The presented dataset is also valuable for the community.

Review #3

Please describe the contribution of the paper

The paper introduces FluoroSAM, a language-promptable variant of the popular Segment-Anything Model (SAM) tailored for the X-ray imaging domain.To support this (and similar model) the authors also introduced FluoroSeg dataset, which have 3 million synthetic X-ray images with mask and text-pair annotations for 128 organ classes and 464 tool classes. The authors also used GPT-4o to augment X-ray paired texts extensively, along with frozen CLIP encoder with a vector quantization layer on top of it.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The introduction of the FluoroSeg dataset represents a substantial effort to address the scarcity of annotated X-ray images. The simulation framework, which leverages CT scans and segmentation data to generate synthetic X-ray images for various anatomies and viewing angles, is highly innovative and scalable.
2. The paper provides thorough experimental validation, including both quantitative results (improved IoU and Dice scores compared to baselines) and qualitative examples. The model is also tested for CXR and Cadaver dataset. While we noticed the metrices performing lower on 2 point/box prompts for all the other metrices the proposed FluroSAM performs the best.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the VQ bottleneck is an innovative addition, the paper does not extensively discuss potential trade-offs or limitations of using vector quantization. The author mentions about the prompt generalizability in the manuscript but additional details including its setup and complexity would be a good addition to the manuscript.
2. The detailed training regimen and high-resolution image processing (e.g., use of a Swin Transformer backbone pre-trained on ImageNet-22k, GPU resource utilization) suggest significant computational demands. A more extensive discussion on inference speed and clinical deployability would strengthen the paper’s practical relevance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this manuscript is a promising and innovative contribution to the fields of medical image segmentation and human-machine interaction. The integration of language prompts, combined with a large-scale synthetic dataset and detailed engineering of the model’s architecture, is commendable. Despite some concerns regarding reliance on synthetic data and computational overhead, the work provides significant insights with strong quantitative results and clear potential for clinical impact.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have responded clearly to my comments.

Author Feedback

We thank reviewers for their thoughtful comments. They highlighted the substantial contribution of FluoroSeg (R1, R3), extensive validation in CXR and cadaver data (R1, R2), and superior performance of FluoroSAM in many cases (R1, R2). We believe FluoroSAM is a significant step in bringing the benefits of language-aligned foundation models to interventional X-ray, where they have traditionally lagged, in addition to diagnosis. In fact, FluoroSAM has already been used successfully in new CAI systems, including for natural language control of a robotic C-arm [1].

We look forward to incorporating the reviewers’ feedback as described below:

R3.1: We agree: comparison with other text-prompted models would further highlight FluoroSAM’s generalizability to varied views, on which other models have not been trained. We have observed that MedCLIP-SAM(2) fails to produce meaningful masks for X-ray images beyond the narrow views present in CXR datasets.

R3.2: We will provide full training details.

R3.3: Reviewers highlight the scale and variability of FluoroSeg, but the number of CT scans could still be expanded. We emphasize that changes in image appearance due to viewpoint variations are a significant source of data heterogeneity in interventional X-ray imaging that has not been adequately considered in prior work. We agree that anatomical variation is an important consideration and are working on expanding anatomical models used in simulation, which will clarify in the discussion.

R3.4, R3.5: The lower performance on CXR images stems from 1) the broad scope of FluoroSAM and 2) performance degradation due to sim-to-real transfer from FluoroSeg. Regarding 1), while the data is large, we estimate that only XXXX images represent AP-like chest X-rays, which is below the size of standard chest X-ray datasets. Regarding 2), no public X-ray datasets were introduced in addition to help ``beautify’’ numbers. Yet, FluoroSAM achieves comparable performance to MedSAM (which was trained on real CXR images). A more subtle yet important consideration arises from how our ground truth was obtained. GT for cadaveric images was obtained from projections of 3D segmentations of a registered CT scan. In contrast, GT for real CXR images was obtained manually by radiologists on the 2D imaging plane, which may include subjective differences. The ability to refine segmentations using point prompts is a key feature of FluoroSAM addressing this issue, which we will clarify.

R3.6: We will clarify the discrepancy in performance between cadaver and real CXR images with 2 point prompts for the reasons described above.

R2.1: We will clarify by explicitly referring to the cited literature for each claim.

R2.2: While we agree that extensive testing is needed for clinical use, we believe the experiments included in this CAI paper, including quantitative results on synthetic and real interventional X-rays and diagnostic CXRs, are appropriate to substantiate our claims.

R2.3: It is not clear what the clinical performance thresholds are, since we do not currently specify a particular use case. While it is easy to argue that higher is always better, for applications like language control of C-arms as in [1] the current performance may be sufficient, while high-precision applications will require future work.

R2.4: Although we do not evaluate downstream applications, we believe a discussion of the potential future work in this area is appropriate. We will clarify that the examples in Fig 5 are based on real model predictions.

R1: We will add a discussion of the trade-offs in VQ, which are reasonable since the number of anatomical structures and tools is limited, and the VQ module ensures that severely ambiguous prompts are avoided.

R1: We will detail the inference speed and memory footprint, which are comparable to other SAM variants.

Overall, we believe these comments have greatly improved the work. Thank you.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

while r1 and r2 are satisfied with author responses, I however am still a bit skeptical regarding clinical applicability of a promptable segmentation model and its usability in a clinical workflow. Both r2 and r3 note the performance fell short of expectations, and clinical testing still needs to be conducted. this is a borderline paper for me, but just slightly leaning towards accept for the novel idea and proof of concept of further exploration

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

While the proposed FluoroSAM represents a novel language-promptable segmentation model trained on a large-scale synthetic X-ray dataset, key limitations remain. Reviewer #1 noted the lack of clarity regarding the realism of synthetic data and how well it transfers to real-world settings. Reviewer #2 praised the potential of the method but pointed out the insufficient ablation studies isolating the contribution of vector quantization (VQ). Reviewer #3 questioned the fairness and reproducibility of comparisons with existing models such as MedSAM, particularly in real data evaluations. Although the proposed approach is innovative and timely, the current submission lacks sufficient validation, ablation, and reproducibility evidence to justify acceptance at MICCAI. Further clarification and more robust experiments would be needed to strengthen the contribution.

back to top

FluoroSAM: A Language-promptable Foundation Model for Flexible X-ray Image Segmentation

Author(s):