Abstract

Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation—F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K). Project: https://github.com/MrGiovanni/R-Super

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0049_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MrGiovanni/R-Super

Link to the Dataset(s)

https://github.com/mrgiovanni/radgpt

BibTex

@InProceedings{BasPed_Learning_MICCAI2025,
        author = { Bassi, Pedro R. A. S. and Li, Wenxuan and Chen, Jieneng and Zhu, Zheren and Lin, Tianyu and Decherchi, Sergio and Cavalli, Andrea and Wang, Kang and Yang, Yang and Yuille, Alan L. and Zhou, Zongwei},
        title = { { Learning Segmentation from Radiology Reports } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {305 -- 315}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The research discusses the use of more readily available medical reports as a means to train disease segmentation. A novel loss function “Report-Supervision Loss” is proposed to converts information from the radiology reports into per-voxel targets. The loss consists of two components (Volume Loss & Ball loss). The Volume loss utilizes an LLM to extract the size of the described tumor volume, build a voxel target and computes the difference between model prediction volume and this new target volume with an arbitrary difference measure. The ball loss similarly constructs non-learnable kernels shaped like balls whose diameters match the tumor diameters in the report. After some further enhancements a mask is constructed to be the ground truth which can be used to train the models with standard distance metrics. The approach is tested on a private abdomen dataset and tested on data from multiple centers. Improved tumor detection performance is obtained when using the proposed Report-supervision loss.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I really appreciate the attempt to utilize the information in the more readily available text reports as a form of supervised training. I think the idea is novel and can have large impact. The current approach seems solid at first glance, although it might not account for many edge cases (e.g. described tumors without a given size, LLMs failing to extract the tumor size). Future work can expand on these cases.

    • The approach is tested data from multiple centres
    • The design of the two compiled losses seems good as a means to translate the extracted information to segmentation labels
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • For the volume loss it is mentioned that “We experimented with diverse loss functions to penalize this difference”. There is no discussion of a separate validation set. Did you experiment with losses for optimal performance on P1-Test?
    • It is strange for me to see the supervised (segmentation) model performed so poorly. There are numerous studies indicating much better detection performance after training on much fewer images. https://pubmed.ncbi.nlm.nih.gov/35053538/ https://www.mdpi.com/2072-6694/16/13/2403 https://www.nature.com/articles/s41591-023-02640-w
    • Is a single model used to detect both pancreatic and kidney tumor? That could answer my previous remark, but raise a new one: what is the performance boost on a tumor specific model?
    • “The Volume Loss, applied to intermediate layers of the segmenter” can the authors comment on what is meant here with intermediate layers? And how is this implemented? “Organ sub-segment masks improve the precision of Report-supervision Loss” can the authors provide some evidence of this?
    • What is the criteria for detection with the segmentation model? Please report AUROC/FROC curves.
    • Why are there varying number of cases in the test sets?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Which variant of LLAMA 3.1 was utilized?
    • In the abstract the unseen-hospital is said to be OOD. Does the hospital use different scanners? It might be unseen but not perse OOD.
    • ChatGPT has a tendency to overly use — extended hyphens. They are not that common in every day English.
    • AbdomenAtlas2.0 is not publicly available. I think the community can greatly benefit from a public release of the dataset.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed idea and implementation is novel and has high potential impact. I find the current validation, or explanation of the validation procedure somewhat limiting, with many details omitted. The fully supervised segmentation methods (although trained on AbdomenAtlas 2.0 and not P1) has an extremely low performance, which is much lower than related literature on this task. I am willing to improve my score based on clarification on the questions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Many of the concerns were addressed.



Review #2

  • Please describe the contribution of the paper

    The paper proposes Report-supervision Loss that converts radiology reports into per-voxel supervision for segmentation models, thus promoting the segmentation accuracy using large-scale training data with reports but without masks. The loss uses information from radiology reports—tumor presence, location, size and quantity—to optimize segmentation, ensuring segmented tumors are consistent with radiology reports. Experiments on one internal dataset and an external dataset demonstrate the proposed loss outperforms two baselines: segmentation only and multi-task learning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of transforms text-level report supervision into voxel-level supervision is interesting and reasonable. Existing segmentation methods typically train with masks alone. While we know reports are useful, there is no well-established methods to use them. Multitask learning is not always useful. This paper proposes a series of rules and losses to convert textual descriptions to mask supervision. Although they seem not so elegant and somehow ad hoc, they indicate a new and direct way to leverage reports.
    2. Experiments support the new method, on two tumor types and two datasets, with either full supervision or few-shot supervision.
    3. The method considered many details such as what if a training patch has only part of the organ.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In Table 1, not all metrics improve along with the proposed loss. For example, specificity decreased in two cases and sensitivity dropped in one case, although the counterpart metric improved. This should be explained. AUCs or F1-score should be computed.
    2. Ablation study should be conducted to show the contribution of each component, e.g. volumn loss and ball loss. Which information is more useful, tumor presence, location, size or quantity?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    In the ball loss, what if the report did not indicate the number of the lesion? For example, “multiple cysts exist in the kidney”. If the report described lesions but the segmentation model did not find any lesion, how should you generate the pseudo mask? The authors should consider to qualitatively and quantitatively measure the accuarcy of the generated pseudo mask.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is inspiring.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    All reviewers agree on acceptance and the rebuttal addressed my concerns.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a semi-supervised segmentation method for CT images, which exploits diagnostic reports as the ground truth labels for the segmentation model training. Technically, the paper describe a semi-supervised training loss that constrains volume, position in the resulting segmentation. The method’s efficacy is validated with a multiple large public dataset of pancreatic and kidney tumors.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This is one of early works of diagnostic report-supervised segmentation for 3D CT images, although similar approached have been done for 2D X-ray images. Unique novelty of this paper is the novel training loss that consider target tumor volumes and organ sub-segment where the tumor is found in the report.
    • Large scale validation: This paper uses multiple large scale dataset which includes the one author’s newly curated. The results showed the method achieved higher sensitivity, and robustness for external test cohorts.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    It is preferred to add more consideration regarding performance limitation due to utilizing report information. For example, in table 1, the sensitivity increased but the specificity decreased compared to the baseline method. It is not sure how the performance balance is determined. Also, because sensitivity and specificity balance is different in other tables, it is hard to compare the performance figures naively. I also wonder the qualitative difference in terms of detection results.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The purpose and approach to exploit large unorganized datasets is written clearly, and the contribution will be beneficial for many readers. The validation of method’s efficacy is persuasive with large scale datasets.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This is an interesting paper.




Author Feedback

Thank you for your positive ratings (WA, A, WA) and constructive feedback. A shared concern was the limited evaluation metrics, which we have now addressed.

Q1: Results have a trade-off in a few cases—higher sensitivity but lower specificity in 2 cases, the opposite in one.

Sensitivity and specificity often move in opposite directions. To better reflect overall performance, we now include F1-Score and AUC.

On the P1 test set, our method got F1/AUC = 0.82/0.90 (pancreas tumors) and 0.76/0.78 (kidney tumors)—more than all baselines by at least +0.15/0.12 and +0.06/0.05.

On the P2 test set, it got F1/AUC = 0.91/0.92 (pancreas tumors), surpassing all baselines by at least +0.11/0.08.

Q2. No DSC and NSD.

We have now reported DSC/NSD on P2, the only test set with tumor masks (pancreas). Our method got DSC/NSD of 0.59/0.69, surpassing the segmentation baseline by +0.08/0.10 and MTL by +0.21/0.26.

To R1

Q1: No description of the validation set. We randomly split the entire train set, i.e., P1-Train + AbdomenAtlas, into training (90%) and validation (10%). Note that this validation set is different from P1-Test.

Q2: Why did the supervised (segmentation) model perform poorly? We carefully read the references you provided. They used in-distribution (IID) test sets with the same hospitals, scanners, contrast, and patient populations as training. Our test sets (P1-Test, P2) are out-of-distribution, from hospitals not seen by the seg. model. P1-Test also includes non-contrast scans—hard to see tumors.

Q3: What is the performance of the universal and specialized tumor models? Our paper implemented a universal model (one model for kidney and pancreas tumors). We have now trained pancreas-only models, which did slightly better. On the P1 test set, the specialized model improved by +0.01/0.01 (F1/AUC); on P2, by +0.01/0.06.

Q4: Where is the volume loss applied? How? We applied it to the decoder layer 2 of MedFormer. Before the loss, we use a 1x1x1 conv to reduce channels, sigmoid activation, and linear interpolation to 1 mm spacing. We will make the code public for reproducibility.

Q5: Do organ sub-segment masks improve the precision of Report-supervision Loss? Yes. We ran an ablation by replacing pancreas sub-segment masks with full-organ masks. Performance dropped by -0.05/0.09 (F1/AUC).

Q6: Why were the test set sizes varied? Sorry, a mistake. There was a memory bug that skipped a few predictions; after fixing it, metrics shifted by <2%, leaving all conclusions unchanged.

Q7: Which LLAMA 3.1? 70B, instruct, AWQ.

To R2.

Q1: Qualitative results of tumor detection. Thanks for the suggestion. We will include visualization in the final version. An expert radiologist has reviewed the tumor detection results and confirmed that our method is more helpful for their workflow than the baseline methods.

To R3.

Q1: Ablation studies on ball and volume losses. On the P1 test set, combined losses got better results: F1/AUC = 0.82/0.90 for pancreas tumors and 0.76/0.76 for kidney. Volume loss alone: -0.06/-0.00 (pancreas) and -0.01/-0.02 (kidney). Ball loss alone: -0.11/-0.08 (pancreas) and -0.02/-0.01 (kidney).

Q2: In the ball loss, what if the report misses the number of lesions? We skip the loss in these cases, which are few in our dataset—3.9% of pancreas tumor reports and 10.8% of kidney tumor reports.

Q3: If the AI did not find a lesion, how does the ball loss generate the pseudo-mask? If the report mentions a tumor, the ball loss selects the region with the highest (even if low) predicted tumor probability (Sec. 2.3). When the AI output is near zero everywhere, the volume loss helps: it discourages all-zero outputs with strong gradients (Eq. 2). This is why combining ball and volume losses outperforms the ball loss alone.

Q4: Qualitatively and quantitatively measure the accuracy of the generated pseudo mask. An expert radiologist will visually inspect these masks. We will add resulting figures and analyses in the final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Although all reviewers provide positive feedback, all of them raise concerns on the effectiveness of the proposed method, because specificity decreased in two cases and sensitivity dropped in one case. Plus, as a segmentation task, authors should also report conventional metrics such as Dice score and distance metrics. Hence, rebuttal from authors are needed.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a novel loss function, Report-supervision Loss, that effectively leverages radiology reports for weakly supervised segmentation. The approach is original, well-validated across multiple centers, and addresses a practical limitation in medical imaging. While some concerns remain, they are addressable, and the core contribution is strong.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a novel report-supervision loss that transforms radiology reports into per-voxel supervision for training segmentation models. The method is evaluated using two datasets: one internal and one external. The contribution of leveraging large datasets without pixel-level annotations is particularly inspiring.



back to top