Abstract

Whole-body PET/CT imaging provides detailed metabolic and anatomical information, which is critical for accurate cancer staging, treatment evaluation, and radiotherapy planning. Automated lesion captioning for whole-body PET/CT is essential for reducing radiologists’ workload and assisting personalized treatment decisions. Unlike previous works that focus on captioning body-part images, we propose a novel automated lesion captioning framework for whole-body PET/CT images, which usually have large volume and high anatomical variability. Our framework first leverages CLIP for lesion localization, upon which we introduce two location-guided strategies: Confidence-Guided Location Prompts (CGLP), which select top-1 or top-3 anatomical location prompts based on confidence scores to guide captioning, and Dynamic Window Setting (DWS), which applies appropriate intensity windowing to enhance visual representation of the localized regions. To our knowledge, our work is the first to achieve whole-body PET/CT lesion captioning. Experimental results on a large dataset comprising 1867 subjects from Siemens, GE, and United Imaging show that our method not only yields higher BLEU scores compared to state-of-the-art methods, but also produces consistent improvements across multiple scanner makers. This advancement has the potential to streamline radiology reporting and enhance clinical decision-making using whole-body PET/CT images.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0248_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YuMin_LocationGuided_MICCAI2025,
        author = { Yu, Mingyang and Gao, Yaozong and Shu, Yiran and Chen, Yanbo and Liu, Jingyu and Jiang, Caiwen and Sun, Kaicong and Cui, Zhiming and Zhang, Weifang and Zhan, Yiqiang and Zhou, Xiang Sean and Zhong, Shaonan and Wang, Xinlu and Zhao, Meixin and Shen, Dinggang},
        title = { { Location-Guided Automated Lesion Captioning in Whole-body PET/CT Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {349 -- 358}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors describe a method to automatically capture and describe lesions in a PET/CT whole body scan. The authors combine a text encoder with a position encoder to capture the most likely locations of abnormalities. Next, according to the selected location, a ‘Dynamic Window Setting’ is used for the CT image similar as radiologists change the intensity window when inspecting different body parts. A Confidence guided Local Prompts module is used which basically selects a proper threshold for region captioning and is converting the used regions in text describing which abnormalities are found.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main contribution of this work is the automatic adaptation of the CT window setting. The window setting is adjusted based on the location of the found abnormalities. Also the Confidence Guided Prompts are a new method where a threshold is applied to the output of the CLIP module. In general, an automated selection of abnormalities in a PET/CT image is of interest and would ease clinical workflow and reduce the workload of a radiologist.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The two main contributions (dynamic window setting and Confidence Guided Prompts) are not well described in the paper and if I understand them correctly, they are somewhat new but very straight-forward and simple. When describing the Confidence Guided Prompts, the authors mentioned that they apply a ‘thresholding strategy’. However, details on this strategy are not explained what makes it impossible to judge the novelty and complexity of this module and hamper reproducibility of the method. In the dynamic window setting, the intensity window is adjusted such that it fits with the corresponding lesion. In my opinion, this is a very simple step. But it is also not explicitly described how this is done. Do the authors have a simple look up table where the intensity windows of different body regions are stored? Regarding training and testing data, it is not clear on which abnormalities the model was trained. The authors mention only the number of images and that ‘abnormalities were segmented using nn-UNet’. But it is not clear, if all abnormalities were found by this approach neither which kind of abnormalities have been found. I also assume that the PSMA scan was 68Ga-PSMA? There are tumor types where malignant regions occur in more than five image regions: How does the model handle these cases? Or are these cases ignored in the current study?

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I miss details about the novel modules of the method. Both modules (Dynamic Window Setting and Guided Location Prompts) are not explained in detail. It is not clear how the thresholds of the Guided Location Prompts were chosen and it is not clear how the Dynamic Window Setting is applied. If the dynamic window setting is e.g. only a look-up-table, the novelty is very limited. Moreover, there are missing details about the datasets used. Which kind of abnormalities were included? Were there prostate cancer patients or lung cancer patients? Where is the text for model training coming from?

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have clarified my questions. I still think that the novelty is marginal but it is an interesting work.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a 2-stage framework for generating captions for whole body pet/CT images. The first stage involves applying CLIP to find the alignment between input 3D patches and 130 anatomical location prompts. The tokenized and word-embedded form of the top most confident anatomical locations will then be concatenated with the imaging features and used in a dynamic window setting to adjust the window setting based on the lesion region identified in the first stage. The output image, along with the location prompts, are then fed into a text decoder (XTransformer) to generate a caption for the image.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a novel approach for generating captions for the whole CT/PET images, rather than using specific body organs, showing the efficiency of the framework. Furthermore, they used a large dataset corresponding to different scanners, improving the generalizability of the model. Also, qualitative visualizations have been provided to compare the generated text using this framework with others.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. It’s not clear whether the split of the data into training, validation, and test is based on unique patient ids. If that’s not the case, there may be potential data leakage in the experiments.
    2. Furthermore, to have a more robust evaluation, it would be better to perform cross-validation to report the mean and std of the results and use an external dataset for testing the model.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed framework seems to be promising, but approving the results requires that we understand whether data split was based on unique patient ids.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors stated that the data split was based on unique patient ids and cross-validation results will be added to the paper. My current only concern, which was not addressed, is that the provided results may not be generalizable to external datasets. The Area chairs can decide on this.



Review #3

  • Please describe the contribution of the paper

    The automated captioning method for whole-body PET/CT broader the aspect than prior body-part-specific approaches. Confidence Guided Location Prompts (CGLP) and Dynamic Window Setting (DWS) both techniques addressing real challenges in whole-body imaging.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Prompting and window adjustment are logical extensions of existing clinical heuristics; Extend the aspect of whole-body PET/CT than body-part-specific; Experimented through a reasonably large dataset.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The proposed CGLP and DWS modules are conceptually sound, the novelty is marginal in methodology; Real clinical evaluation/feedbacks are missing; No discussion of failure cases; Missing the analysis for runtime, computational cost, model size

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    logical extensions of existing applications, marginal novelty proposed in the methodology

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Most of the issues are addressed by the authors. Therefore, the paper could be accepted now.




Author Feedback

Q1: The reviewer raises concerns regarding the Dynamic Window Setting module, questioning its novelty and clarity, and whether the method simply uses a lookup table to assign intensity windows to different body regions in CT. (R1) A1: Using different intensity windows for different anatomical regions in CT (covering 130 locations) is clinical prior knowledge. The key challenge in this task is that the lesion location is initially unknown, making it necessary to first localize the lesion. Therefore, the novelty of the Dynamic Window Setting module lies in leveraging the CLIP model for lesion localization, which allows the system to automatically determine the lesion’s approximate position and dynamically adjust the intensity window accordingly, without requiring manual adjustment by radiologists. For example, a lesion may appear very close to the heart but still requires lung window settings for proper observation. Our method enables automatic window selection using CLIP, which is the key contribution of our approach.

Q2: The reviewer questions the Confidence-Guided Location Prompts (CGLP) module, the thresholding strategy is not clearly explained, including how thresholds were chosen. This makes it hard to assess the module’s novelty and affects reproducibility. (R1) A2: In our method, the threshold in the Confidence-Guided Location Prompts (CGLP) module is designed to ensure that the location prompt embeddings provided to the captioning module are reliable. When the Top-1 confidence score is above 0.95, this prediction achieves over 90% accuracy in our validation set. We consider it sufficiently reliable to guide the captioning process. However, when the confidence score is lower than 0.95, the correct location prompt is more likely to appear within the Top-3 predictions. Therefore, in such cases, we convert the Top-3 predictions into embeddings, which provides more reliable prompts to the model.

Q3: The reviewer raises concerns about how the model handles cases where multiple lesion types appear in the same region, whether such cases are supported by the model or ignored in the study, and whether all abnormalities are detected. (R1) A3: Our model processes each lesion type separately, even if multiple types exist in the same region. For each captioning step, the input image contains only one lesion type mask and its corresponding text description. This is attributed to our dataset post-processing pipeline, which includes separating connected lesions and manual review by radiologists to filter out false positive lesions and to ensure image-text correspondence.

Q3: Dataset details: Radiotracer, disease type, text source, dataset split (R1, R2) Our study uses 18F-FDG and 18F-PSMA as radiotracers. The dataset includes patients with lymphoma, nasopharyngeal carcinoma, lung cancer, prostate cancer, and liver cancer. As noted in Section 3.1, the data is split into training, validation, and test sets based on unique patient IDs to prevent data leakage. The text for model training is derived from radiology reports. Specifically, we first applied an LLM-based approach to convert lesion information into structured reports, which were then reviewed and refined by radiologists to ensure accuracy and alignment with the corresponding images. We will revise Section 3.1 to clearly include these details.

Q4: Evaluation, computation cost (R2, R3) A4: Our model has 400.16M parameters, and the average inference time is 0.512 seconds per lesion, which requires relatively low computational resources and is practical for deployment in hospitals. In the ablation studies (Section 3.4), particularly the evaluations on localization and CT finding accuracy, we believe the results closely reflect real clinical scenarios. Furthermore, our proposed approach has already been integrated into a radiology-assisted image interpretation software and deployed across several medical centers. We will add cross-validation and failure case analysis to the paper.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Need clarify on novely/method contribution and experimental setup.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top