Abstract

Automated Universal Lesion Detection (ULD) based on computed tomography (CT) images provides physicians with rapid and objective information regarding lesion locations and shapes. However, it is difficult to detect universal lesions in various regions because of the disparity in lesion sizes and the grayscale variation present in CT images. In this paper, we propose DetectDiffuse, a multiscale diffusion model driven by feature aggregation and 3D attention. First, we utilize the diffusion model to generate noisy detection boxes, incorporating a scale factor to simulate lesions at different scales and mitigate detection errors. Second, we develop a Neighborhood Aggregation (NA) module to enhance the model’s capability to distinguish between lesioned and normal tissues. This module aggregates features within and around detection boxes, reducing false detections caused by significant grayscale differences in lesions. Third, we propose a 3D Stripe Attention (SA) module leveraging dimensional disambiguation. This module uses an attention mechanism to extract information across different dimensions of CT images more effectively. We performed comparison experiments on five datasets, the results show that the proposed method outperforms the 12 compared state-of-the-art methods, and improves the performance by 5.82% compared with the best method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1540_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiXin_DetectDiffuse_MICCAI2025,
        author = { Li, Xinyu and Ai, Danni and Fan, Jingfan and Fu, Tianyu and Song, Hong and Xiao, Deqiang and Yang, Jian},
        title = { { DetectDiffuse: Aggregation- and Attention-driven Universal Lesion Detection with Multi-scale Diffusion Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {153 -- 163}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes DetectDiffuse, a multi-scale diffusion model detector for universal lesion detection. Two novel strategies, namely Neighbor Boxes Aggregation and 3D Stripe Attention, are proposed to improve the model’s ability to distinguish between lesioned and normal tissues, and to extract information across different dimensions of CT images. Experiments on DeepLesion dataset show DetectDiffuse outperforms 12 compared state-of-the-art methods, and improves the performance by 5.82% compared with the best method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Two novel strategies are proposed to improve the model’s ability to distinguish between lesioned and normal tissues, and to extract information across different dimensions of CT images.
    2. Comprehensive comparison. DetectDiffuse outperforms 12 compared state-of-the-art methods on DeepLesion and 4 external datasets.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The first novel strategy, Neighbor Boxes Aggregation, is a straightforward aggregation of information from 8 neighboring ROIs using self attention. I don’t understand why other traditional backbone networks such as CNN or transformer cannot do it. Besides, the paper says “As long as the feature extractor identifies a region with significant differences, it can signal to the lesion detection decoder that this region is highly likely to contain a lesion.” However, if a lesion detector simply signal each ROI that is different from its context, it will generate many false positives. A lesion detector should learn to compare lesion regions with normal regions that look similar to a lesion, instead of comparing a region with its neighboring regions.
    2. Some details in the method part is not described clearly. For example, what is stripe pooling in section 2.3? How does the multi-scale diffusion model work and why choose the diffusion model as the backbone detector?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty is somehow limited and the method description is not very clear.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal clarified my questions. Other reviewers also indicated strengths of the paper.



Review #2

  • Please describe the contribution of the paper

    The framework introduces two key modules: (1) a Neighborhood Aggregation (NA) module that aggregates features around detection boxes to reduce false positives caused by grayscale similarities, and (2) a 3D Stripe Attention (SA) module that models long-range 3D spatial information across three directions to enhance weak lesion features. Experiments on multiple datasets show that DetectDiffuse outperforms state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clever design of NA and SA modules: The Neighborhood Aggregation (NA) module is a well-motivated design that reduces false positives by comparing features inside and around detection boxes. The 3D Stripe Attention (SA) module effectively captures spatial information along three directions; if inspired by prior works, appropriate citations would further improve the clarity.

    Inclusion of Zero-Shot Evaluation: The paper validates the model’s generalization ability through zero-shot experiments on multiple external datasets, which is crucial for Universal Lesion Detection (ULD) tasks aiming at real-world deployment.

    Well-Structured Ablation Studies: The ablation experiments carefully isolate the effects of the NA and SA modules, providing convincing evidence that each component meaningfully contributes to the overall performance gains.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Limited ablation experiments for NA and SA modules: While the NA and SA modules are well-motivated, the experimental analysis around them is insufficient. There are many alternative designs (e.g., different neighborhood window sizes for NA, replacing 3D convolutions in SA with Transformers) and hyperparameters that could significantly affect performance. Some exploration, even if brief due to MICCAI page limits, would strengthen the validation of the current design choices.

    Unclear technical details on NA and SA transition: Important implementation details are missing, such as how the output of NA is transitioned into the SA module — whether features are cached to disk or processed in parallel. If processed in parallel, how to ensure batch sampling is random rather than neighboring slices (which may impact learning efficiency) is not discussed.

    Inadequate reporting and discussion of mAP metrics: The paper does not clearly define the mAP calculation (e.g., whether it is AP@0.5). Moreover, for ULD tasks, mAP alone is not the most informative metric, since clinical usage is more concerned with high sensitivity at low false positive rates. It would be important to show FPPI results from 0.5 up to at least 4.0, and to achieve over 95% sensitivity at reasonable FPPI levels.

    Lack of analysis on zero-shot results: The model shows very strong zero-shot performance compared to baselines, but the reasons behind this large gap are not analyzed. Understanding whether it is due to better feature generalization, overfitting differences, or data preparation inconsistencies would add value.

    Performance gain mainly at low FPPI: From the current FPPI@0.5 and FPPI@1 results, the method improves performance, but the margins are modest. As FPPI increases beyond 2.0, the differences among methods might shrink or even reverse. It would be important to report a more complete FPPI-sensitivity curve.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would prefer to give a borderline score, but since it is not available this year, I will assign my current score accordingly. I appreciate the contributions of this work and would be happy to raise my recommendation to accept after the rebuttal, if the authors can adequately address the concerns I have raised.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Most of my concerns are adressed.



Review #3

  • Please describe the contribution of the paper

    The paper presents DetectDiffuse, a novel multi-scale diffusion model framework for universal lesion detection in CT images that effectively addresses two major clinical challenges: false positives caused by lesion-like normal tissues and false negatives from underutilized 3D spatial information. The key innovation lies in its integration of a Neighborhood Aggregation (NA) module that suppresses false detections through attention-weighted feature comparison between lesion candidates and surrounding tissues, and a computationally efficient 3D Stripe Attention (SA) module that employs directional 1D convolutions to capture long-range spatial dependencies across three orthogonal planes. Extensive validation on the DeepLesion dataset demonstrates state-of-the-art performance with 84.71% sensitivity and 63.89% mAP, representing a 5.82% improvement over existing methods, while maintaining strong zero-shot generalization across four additional datasets (BraTS2021, COVID-19-20, LiTS, and Task08) without fine-tuning. This work makes significant methodological contributions by being the first to combine diffusion models with 3D attention mechanisms for lesion detection, offering a practical solution that balances accuracy and computational efficiency through its innovative 1D stripe attention design, with important implications for improving diagnostic workflows in clinical practice.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ①First Diffusion Model for Universal Lesion Detection (ULD) Novelty: This is the first work to adapt diffusion models (typically used for generation) to detect lesions in 3D medical images. The multi-scale noise injection and reverse diffusion process uniquely address size-varying lesions, unlike fixed anchor boxes or center-point methods. Why Important: Diffusion models offer a probabilistic framework to refine detection boxes iteratively, mimicking radiologists’ “hypothesize-and-refine” workflow, which is new for ULD. ②Neighborhood Aggregation (NA) Module for Clinically Relevant FP Reduction Novelty: The NA module introduces attention-weighted feature comparison between lesion candidates and surrounding tissues, explicitly modeling how radiologists distinguish lesions from normal tissue. Why Important: This directly tackles a major clinical hurdle—false positives wasting radiologists’ time—by suppressing detections that resemble background (e.g., vessels, bones). Prior work rarely modeled this context explicitly. ③Computationally Efficient 3D Stripe Attention (SA) with Directional Disentanglement Novelty: The SA module replaces heavy 3D self-attention with lightweight 1D convolutions along three orthogonal planes, achieving global 3D context at local computation cost. Why Important: This is the first directional attention design for ULD, enabling efficient modeling of long-range dependencies (critical for detecting subtle lesions) without transformers’ GPU bottlenecks—key for real-world deployment.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    ①Limited Justification for Diffusion Models in Detection Weakness: While the use of diffusion models for lesion detection is novel, the paper does not adequately justify why this approach is superior to established detection paradigms (e.g., anchor-free methods like FCOS or CenterNet), especially given the high computational cost of diffusion. Evidence: Prior work (e.g., DiffusionDet) has shown diffusion models for natural image detection, but the claimed benefits (e.g., multi-scale noise) are not rigorously compared to simpler multi-scale techniques (e.g., FPN). No ablation studies compare the diffusion framework to a non-diffusion baseline (e.g., replacing the diffusion decoder with a standard R-CNN head). Impact: The added complexity of diffusion may not be justified if similar performance could be achieved with simpler methods. ②Lack of Real-World Clinical Validation Weakness: The paper claims clinical relevance but lacks validation in real clinical workflows (e.g., radiologist-in-the-loop evaluation, deployment on hospital-grade hardware). Evidence: Performance is measured only on retrospective datasets (DeepLesion, BraTS). There is no testing on prospective data or in clinical settings, unlike works like Liao et al. (2019), which validated lesion detection with radiologist feedback. No analysis of inference speed, GPU memory usage, or compatibility with hospital PACS systems—critical for clinical adoption. Impact: The method’s practicality for clinical use remains unproven.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend Accept (Score 5) for this paper. The key strengths are its novel integration of diffusion models with 3D attention for lesion detection, clinically-motivated NA and SA modules that effectively address FP/FN problems, and comprehensive validation across multiple datasets showing SOTA performance. While the computational cost and clinical deployment details could be better analyzed, the methodological innovation and rigorous experiments make this a clear accept. The paper advances the field with a technically sound, clinically-relevant solution that is well-described and reproducible.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    None




Author Feedback

We thank all reviewers for their thoughtful review and constructive comments. The following are our responses to reviewers’ comments. Technical Details [R2Q1]Unclear explanation of innovation. The NA module first learns lesion features via supervised training, then aggregates 8-neighborhood information by comparing them with lesion features. Thus, instead of simply treat distinct regions as lesions, it enhances spatial awareness around lesions, filtering normal tissues that resemble lesions and highlighting subtle lesions. Experiments show it effectively reduces false positives. While CNN is limited by local receptive fields, making it hard to capture large lesions; Transformer, though global, focus mainly on lesions and may ignore normal tissue features due to its data-driven attention distribution. Explanation of details.[R2Q2]Stripe pooling effectively captures the contextual information of long stripe region by performing pooling operations in the length, width and height axes of the feature map separately. Enhance the feature extraction capability of the model for long-distance dependent targets.[R2Q2&R3Q1]The scale of lesions is variable and sparsely distributed. The multi-scale diffusion model is more suitable for this variability by generating noise boxes with different sizes and locations. Anchor-free method is more suitable for small, dense targets because it only predicts the center point of the target. Anchor-based method needs to manual anchor design, making it less adaptable to lesions with diverse shapes and sizes. So we adopt diffusion model as baseline. [R1Q2]Unclear technical details on data transition. The transition between two modules is sequential: features aggregated by NA are duplicated, with one copy cached in memory and the other passed to SA for 3D stripe information aggregation. The information aggregated by SA is then used to weight the features in memory. Experiments and results [R1Q1]Limited ablation experiments. ①The NA module adaptively generates neighborhood box based on the noise box. For experiment, we set the size of neighborhood box to 1.5× and 2× of the noise box, reduced mAP@50 by 1.61% and 2.47%, and average FPPI by 4.59% and 7.81%, respectively. This is because larger boxes may include other organs and hinder feature discrimination. ②While transformers’ quadratic complexity offers limited performance gains over the linear complexity of CNNs, it significantly increases computational cost. Our method, however, can run efficiently on most mainstream hardware. ③For training, we adopted the recommended hyperparameters from 2D baseline (DiffusionDet) and Detectron2 framework. Our comparison method, DiffULD, also uses these hyperparameters. [R1Q3&Q5]Insufficient discussion of metrics. We will clarify in main text that mAP@50 is used as the evaluation metric and add FPPI@2 and FPPI@4 to result tables. Results show that DetectDiffuse achieves FPPI@2 and FPPI@4 of 93.81% and 97.15%, which are 3.46% and 3.91% higher than the best comparison method SATr. [R1Q4]Lack of analysis on 0-shot results. The superior 0-shot performance of our method is due to the NA module. During pre-training on the DeepLesion dataset, NA learns lesion-specific features through supervised learning. In 0-shot testing, it compares these features with surrounding tissues, filtering out normal tissue that resemble lesions and enhancing subtle lesions. Other methods only aggregate 3D features at lesion sites, ignoring non-lesion regions, which leads to poorer performance. [R3Q2]Clinical Validation. We have obtained approval from the Ethics Committee of hospital to collect data of liver and thymus lesions and to conduct experiment. Experimental results were independently verified by three specialized physicians, and the results were unanimously approved. Our method consumes 23.1GB of GPU memory and takes 2-3 minutes to reason about 200-300 slices of medical images, which meets the clinical need for speed in lesion screening.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers are positive to accept this work after the rebuttal. Following these ratings, I think this work can be published in MICCAI 2025.



back to top