Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate lung cancer localization and classification in computed tomography (CT) images are vital for effective treatment. However, existing approaches still face challenges such as redundant information in CT images, ineffective integration of clinical prior knowledge, and difficulty in distinguishing subtle histological differences across lung cancer subtypes. To address these, we propose Cross-Modal Detection Auxiliary Classification (CM-DAC), a framework enhancing classification accuracy. It employs a YOLO-based slice detection module to extract lesion areas, which are processed using the Multimodal Contrastive Learning Pretrain (MCLP) module, minimizing redundancy. Specifically, MCLP aligns 3D patches with clinical records via a cross-modal hierarchical fusion module, integrating image and clinical features through attention mechanisms and residual connections. Additionally, we employ multi-scale fusion strategies to further enhance histological distinction by capturing features at different resolutions. Simultaneously, a text path expands category labels into semantic vectors using a medical ontology-driven text augmentation approach. These vectors are encoded and aligned with fusion feature vectors. Experimental results on both private and public datasets confirm that the proposed CM-DAC outperforms competitive methods, achieving superior classification performance. The code is available at https://github.com/fancccc/CM-DAC.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0864_paper.pdf

SharedIt Link: https://rdcu.be/eG4C1

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_8

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/fancccc/CM-DAC

Link to the Dataset(s)

N/A

BibTex

@InProceedings{FanChe_Clinical_MICCAI2025,
        author = { Fan, Chenchen AND Elazab, Ahmed AND Zhang, Songqi AND Wang, Yuxuan AND Liang, Qinghua AND Li, Danna AND Zhang, Yongquan AND Xiang, Ying AND Liu, Bo AND Wang, Changmiao},
        title = { { Clinical Prior Guided Cross-Modal Hierarchical Fusion for Histological Subtyping of Lung Cancer in CT Scans } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {75 -- 84}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a novel end-to-end framework for lung tumor detection in CT images and histological subtyping classification, referred to as Cross-Modal Detection Auxiliarity Classification. The model integrates a 2D YOLOv11 architecture for tumor detection and a contrastive learning approach to incorporate multimodal data, including imaging, clinical, and textual information. In this approach, categorical labels (e.g., invasive adenocarcinoma) are expanded into more detailed semantic descriptions, which are then encoded into vector representations. This Multimodal Contrastive Learning Pretraining module enhances the classification of lung cancer subtypes by capturing subtle differences between similar categories. The authors demonstrate that this method improves performance compared to conventional approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors present a novel approach that effectively integrates CT images, clinical data, and text-based category descriptions to distinguish subtle histological differences among lung cancer subtypes. Rather than directly classifying subtypes, the model learns robust multimodal representations that align imaging and clinical data with text descriptions, leveraging a contrastive learning framework similar to CLIP. This allows the model to encode images and clinical data during inference and compare their representations to text prototypes, selecting the most probable classification. Another key contribution is the introduction of a cross-modal hierarchical fusion model, which integrates multi-scale image features with clinical data, enhancing the model’s ability to capture meaningful correlations. This approach improves the classification of lung cancer subtypes by leveraging richer contextual information.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The title suggests that the model is designed for lung cancer diagnosis, but the focus appears to be on histological subtyping rather than diagnosis itself. The title should be adjusted to reflect the actual task.
2. The authors used both a public and an in-house dataset, but it is unclear whether they combined them for training and evaluation or used cross-validation separately. Additionally, since the datasets contain different categories, it is not specified whether they merged the categories or how they handled discrepancies. It is also unclear how many final categories/classes were considered in the study.
3. The model employs 2D YOLOv11 for lung nodule detection and extracts a 32mm³ volume around the estimated centroid. However, this cube may be too small given the typical size of lung tumors. The authors also mention including a 15mm margin around the lesion, but they do not clarify how this margin was estimated without segmentation masks for the nodules. Additionally, the input image size and whether images were resampled before cropping are not specified.
4. The paper does not specify the value of the temperature parameter (τ) in the contrastive loss function, which is crucial for reproducibility.
5. The authors state that numerical clinical variables were normalized but do not clarify whether they used min-max normalization, standardization, or another method. This should be explicitly stated.
6. The evaluation metrics should be introduced in the Methods section to provide clarity on how the model’s performance is assessed.
7. Issues in Table 1 Presentation: a. Normally, performance improvements should be highlighted in green, and decreases should be in red, but the table follows the opposite convention. b. Since results are reported from cross-validation, they should be presented in the format mean ± standard deviation for proper interpretation. c. Statistical significance tests (e.g., DeLong test for AUC) should be included to validate whether performance differences are meaningful.
8. The results for LPCD dataset using ResNet18 and ViT are identical for all metrics except AUC, where there is an unusually large gap of 14.87%.
9. The authors report detection results for both single-class and multi-class scenarios using YOLOv11, but these settings are not discussed in the Methods section.
10. The authors claim that their model demonstrates robust cross-institution generalization, but it appears that both datasets were combined and used in 5-fold cross-validation rather than being evaluated on an independent test set from a different institution. Without such a setup, their generalization claim is not well supported.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

This paper presents an interesting and promising approach with strong potential. However, improving the clarity of explanations—particularly regarding the model design, dataset usage, and methodology—would enhance readability and comprehension. Additionally, a more structured and thorough presentation of results, including statistical significance tests, would strengthen the paper’s impact and provide more robust support for its claims. Addressing these aspects would significantly improve the quality of the work, which has great potential.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Some methodological details, including number of classes for classification and dataset and training and evaluating strategies, are not well explained, making it difficult to fully assess the effectiveness of the approach. That said, the proposed approach is interesting and has strong potential. If the authors address these issues in the rebuttal—clarifying methodological details, training and evaluation strategies, and including statistical significance tests—it would significantly improve the paper’s clarity, leading to a higher overall score.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal satisfactorily addressed my main concerns regarding task definition, dataset usage, evaluation metrics, and architectural details. While I still have a minor concern about how detection quality might affect downstream classification—an aspect that could benefit from further clarification—the authors have addressed nearly all other key points effectively.

Review #2

Please describe the contribution of the paper

This paper presents a cross-modal framework called CM-DAC for accurate localization and classification of lung cancer. The method integrates CT images and clinical prior knowledge, employing a multi-scale feature fusion strategy to enhance the differentiation of lung cancer subtypes. While the work demonstrates technical innovation merit, there are several aspects that remain concerns.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The problem is well-defined, proposing solutions to key challenges in lung cancer diagnosis, e.g., redundant information, clinical prior knowledge integration, and subtype differentiation.
2. The methodology is comprehensive, encompassing a complete pipeline of detection, multi-modal fusion, and zero-shot classification.
3. The approach is thoroughly validated using 5-fold on both public and private datasets, with ablation studies demonstrating the effectiveness of each component.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The authors reports YOLOv11n variant achieving mAP50=0.701 for single-class detection but only mAP50=0.467 in multi-class scenarios, raising questions about how the subsequent classification framework compensates for this detection weakness.
2. The paper mentions expanding labels like “invasive adenocarcinoma” into more detailed descriptions but why systematic approach to this expansion or provide concrete examples.
3. Table 1 shows improvements in accuracy, precision, and AUC on the LPCD dataset, but decreased recall and F1 scores compared to the TMSS method.
4. The private dataset exhibits significant class imbalance (53.5% invasive, 24.1% microinvasive, 22.4% in situ adenocarcinoma). There is a critical absence of negative samples, including inflammatory lesions and other benign nodules.
5. The multi-level attention fusion algorithm (Algorithm 1) appears unnecessarily complex. The necessity of certain steps, such as “spatial-to-sequence transformation” and “dynamic clinical expansion,” in the fusion process is not well-justified. The authors called the ‘Multi-scale Fusion’ in the begining, but ‘Multi-scale Aggregation’ in the method. The authors claim that multi-level attention as hierarchical, but it seems more like a multi-scale method.
6. Despite mentioning “medical ontology-driven text augmentation” in the abstract, the paper provides little details about the practical implementation of medical ontologies.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. How are multiple lesions within a single slice processed, are they handled independently or processed collectively, and when multiple lesions are present, how are their classification results aggregated
2. In the abstract, The abbreviation CM-DAC appears without pre-definition
3. In the introduction, the authors incorrectly state that Computed Tomography scans are not the gold standard for nodule diagnosis, which is a clear clinical misstatement, since the biospy and patholigical diagonosis is.
4. How do the authors calculate ‘average of 17.3% ± 4.1%’ from [19].
5. While “clinical records” are referenced throughout the paper, their content (demographic details and morphological features) is only briefly mentioned in the dataset section. It would be helpful to demonstrate which feature is most important to the final classification (I know it is difficult to explain in the rebuttal, but I think it would be helpful for this work.)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper proposes a valuable multi-modal framework for lung cancer diagnosis, the current presentation appears not rigorous enough, and the aforementioned issues need to be addressed to enhance the work’s credibility.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

A critical limitation of this study that has not been adequately addressed is the absence of benign pulmonary nodules in both training and testing datasets. While the paper claims to achieve “Accurate Lung Cancer Diagnosis,” one of the most challenging aspects in clinical practice is differentiating between benign and malignant pulmonary nodules. Without including benign cases in the evaluation, the model’s ability to perform this crucial discrimination remains unvalidated. This oversight has significant clinical implications - if deployed, the current method might suggest immediate intervention for all detected pulmonary nodules, including benign ones, which would lead to unnecessary procedures. This represents an unacceptable clinical scenario and undermines the practical utility of the proposed method. For the study to be clinically relevant, it need to demonstrate efficacy in distinguishing between benign and malignant cases, as this differentiation is fundamental to accurate lung cancer diagnosis.

Review #3

Please describe the contribution of the paper

The paper introduces a novel multi-modal deep learning framework for classifying histological subtypes of lung cancer using CT images and clinical data. The proposed pipeline integrates several distinct components: a YOLO-based slice detection module for extracting 3D nodule patches, a 3D ResNet for image feature extraction in multiple scales, and a fully connected layer for processing clinical data. These features are then fused using an attention-based multiscale fusion module. The model is trained using a contrastive loss between the fused features and text embeddings derived from class labels. During inference, the model employs zero-shot prediction based on the fused representations. The method is evaluated on both a private and a public dataset, demonstrating consistent improvements over unimodal and multimodal approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses the important and clinically relevant task of non-invasive histological subtype classification using CT images and clinical data.
- The proposed approach is fully automated and does not rely on manual annotation of nodules, increasing its potential for scalability and clinical deployment.
- Integrating clinical data with imaging features is a strength, as it leverages complementary information across modalities for improved performance.
- The use of multi-scale imaging features allows the model to capture both coarse and fine-grained characteristics of the nodules.
- The method demonstrates consistent improvements over baseline models on both private and public datasets, highlighting its generalizability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper does not exhibit any major methodological flaws; however, there are a few minor issues that could be addressed to improve clarity and reproducibility:
- While the code is provided, additional architectural details would be helpful. Specifically, the type of text encoder used for label embedding, the specific scales at which features are extracted from the 3D ResNet, and the stage of the network at which these features are obtained.
- It is unclear how the multi-class results are computed, whether weighted or macro averaging is used. Including this information, along with standard deviations across cross-validation folds, would improve the statistical transparency of the evaluation.
- In Section 3.2 (second sentence of the second paragraph), it would be beneficial to include the names of the multimodal baseline methods in addition to citing their references.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses an important clinical task with a well-designed, fully-automated multi-modal deep learning approach that combines imaging and clinical data effectively. It demonstrates strong performance on both private and public datasets, and the methodology is sound. While a few clarifications are needed regarding architectural details and result reporting, these are minor and do not detract from the overall quality and novelty of the work.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My initial review did not raise major concerns. The points I mentioned were intended to improve clarity and reproducibility, and the authors have addressed them. I maintain my original recommendation to accept the paper.

Author Feedback

We thank reviewers for their feedback and recognition of our work. All reviewers acknowledged the relevance and methodological contribution, R3 noted our approach is “well-designed, fully-automated,” with “strong performance”, R1 highlighted the novelty of integrating imaging, clinical, and textual data via contrastive learning, and R2 appreciated the “well-defined problem” and “comprehensive pipeline” with effective validation. Below we address the key concerns raised by reviewers. G1: Task definition, dataset partitioning, and generalization scope (R1Q2, R2Q4, R1Q10) The public (LPCD) and private datasets were kept separate, with independent 5-fold CV for each. Label differences required treating them as distinct 3-class tasks. Data augmentation was applied to address class imbalance. Focusing on histological subtypes (e.g., invasive vs. in situ) is clinically critical for surgical/therapeutic decisions. So, our study uses pathologically confirmed cases to aid precision treatment. Generalizability reflects robustness across datasets, not direct model transfer between institutions. G2: Image cropping, lesion handling, and network structure (R1Q3, R2, R3Q1) The 32mm³ crop size aligns with clinical standards (>30mm = mass) and dataset stats (mean: 23.69mm; 75th %ile: 29mm). A 15mm margin was heuristically chosen for context without segmentation. CTs were resampled to 1mm³. Multi-lesion cases were processed independently. For architecture details, text used CLIP encoder; images used ResNet’s last 4 layers (64–512 ch), projected to 256D for fusion. G3: Terminology, title, and presentation issues (R1Q1, R1Q7-1, R2, R3Q3) We’ve revised the manuscript for clarity and consistency. The final version will include an updated title to clarify subtyping, defined abbreviations (e.g., CM-DAC), corrected clinical terms (e.g., CT as imaging, not gold standard), removed estimated statistics, standardized visuals (e.g., table colors), aligned module names, and explicitly listed baselines. G4: Evaluation metrics and statistical analysis (R1Q6, R1Q7-2, R3Q2) Evaluation metrics were introduced in Results section; we will move them to the Methods for better clarity. Mean±SD across folds will be reported in final version. We conducted t-tests on key results, confirming statistical significance. Multi-class evaluation is performed using macro averaging. G5: Detection module design (R1Q9, R2Q1) We employed YOLOv11x for detection. Initially, multi-class detection underperformed due to class imbalance and subtype similarity, so we switched to single-class lesion detection (higher mAP). This localization module feeds our fusion classifier, which combines imaging and clinical features for subtype discrimination. G6: Clinical normalization and label expansion (R1Q5, R2Q2, R2Q6) All numeric clinical variables were normalized using min-max scaling. To enhance semantic alignment between imaging features and class labels, we constructed radiology-inspired textual templates (e.g., “A pulmonary nodule showing histologic features of xx”) for each histological class. These were encoded using the CLIP text encoder. This approach improves semantic richness and fine-grained discrimination in the contrastive learning framework. G7: Metric anomalies and class imbalance (R1Q8, R2Q3) In LPCD, the majority class (71.01%) dominated predictions: at 0.5 threshold, both ResNet18 and ViT predicted mostly this class, yielding identical accuracy/precision/recall/F1. However, AUC (threshold-independent) revealed model differences, e.g., Model B ([0.6,0.7,0.8,0.55]) outperforms Model A ([0.6,0.6,0.7,0.55]) in ranking despite identical binary outputs. The recall/F1 drop despite AUC gains may stem from confidence shifts due to imbalance. R1Q4: Temperature parameter τ in contrastive loss We adopted τ=0.07 (standard in contrastive learning). Sensitivity analysis showed consistent optimal performance across [0.01,0.1] range. This setting will be explicitly stated in the final version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Some important details are missing and explaining them would clarify the solution understanding
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers find the proposed method effective and novel. However, the presentation needs to be articulated in a camera-ready submission.

back to top

Clinical Prior Guided Cross-Modal Hierarchical Fusion for Histological Subtyping of Lung Cancer in CT Scans

Author(s):