Abstract

Diagnosis of lymph node (LN) metastasis in CT scans is an essential yet challenging task for esophageal cancer staging and treatment planning. Deep learning methods can potentially address this issue by learning from large-scale, accurately labeled data. However, even for highly experienced physicians, only a portion of LN metastases can be accurately determined in CT. Previous work conducted supervised training with a relatively small number of annotated LNs and achieved limited performance. In our work, we leverage the teacher-student semi-supervised paradigm and explore the potential of using a large amount of unlabeled LNs in performance improvement. For unlabeled LNs, pathology reports can indicate the presence of LN metastases within the lymph node station (LNS). Hence, we propose a pathology-guided label sharpening loss by combining the metastasis status of LNS from pathology reports with predictions of the teacher model. This combination assigns pseudo labels for LNs with high confidence and then the student model is updated for better performance. Besides, to improve the initial performance of the teacher model, we propose a two-stream multi-scale feature fusion deep network that effectively fuses the local and global LN characteristics to learn from labeled LNs. Extensive four-fold cross-validation is conducted on a patient cohort of 1052 esophageal cancer patients with corresponding pathology reports and 9961 LNs (3635 labeled and 6326 unlabeled). The results demonstrate that our proposed method markedly outperforms previous state-of-the-art methods by 2.95\% (from 90.23\% to 93.18\%) in terms of the area under the receiver operating characteristic curve (AUROC) metric on this challenging task.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0793_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_Semisupervised_MICCAI2024,
        author = { Li, Haoshen and Wang, Yirui and Zhu, Jie and Guo, Dazhou and Yu, Qinji and Yan, Ke and Lu, Le and Ye, Xianghua and Zhang, Li and Wang, Qifeng and Jin, Dakai},
        title = { { Semi-supervised Lymph Node Metastasis Classification with Pathology-guided Label Sharpening and Two-streamed Multi-scale Fusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the authors develop a complete semi-supervised classification framework with teacher-student architecture, pseudo label, dual-stream and 2.5D designs to solve the task of metastatic lymph node classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1)The proposed method is compared with 3D supervised models, 2.5D supervised models and semi-supervised methods, which shows very competitive results.

    2)The authors conducted ablation study to verify the effectiveness of each component in the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1)My major concerns are the novelty. The proposed method lacks novelty. Even though the proposed method shows the competitive performance, it is built of a set of existing techniques. Each component in the proposed method does work and show its effectiveness, but all these components are nothing new. 1-1)For examples, in Table 2, the 2.5D design is widely used in the task of Universal Lesion Detection (ULD) [i], and the mean teacher (MT) is commonly adopted in semi-supervised learning. Multi-scale fusion (MS) is widely used in Feature Pyramid Network (FPN). All the above designs are very mature techniques. [i] Yan, Ke, et al. “MULAN: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation.” MICCAI 2019

    1-2) The pathology-guided label sharpening loss (PGLS) is a simple supervised loss based on pseudo labels. The pseudo labels are the high-confidence predictions of teacher model and are consistent with the medical report. The PGLS is a simple and useful trick, but it is far from a novel contribution.

    1-3)The two-stream network (TS) is also not new. 9 years ago, some previous work[ii] published in CVPR’15 have adopted dual-branch networks that take a global view and a local view as their inputs and fuse their outputs at the end. [ii] Zhao, Rui, et al. “Saliency detection by multi-context deep learning.” CVPR. 2015.

    2) In Table1, what are the meaning of ‘-2.95%’ and ‘+3.36%’? What is the meaning of ‘-‘ or ‘+’ ? 3) In Table1, why use light-weight backbones like MobileNetv3? Why not use a stronger backbone with higher capacity, like ResNet-101 or EfficientNet? 4) What are the data augmentation strategies used in the proposed method? Is the local view in the two-stream network used in the data augmentation? If using a stronger backbone and including local views as augmented samples, the two-stream design may be unnecessary and bring in less improvements. 5) In Table2, the implementation details of achieving the first-row results are missing and unclear. Without using two-stream design, the single-stream network is based on global or local inputs?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please check the weakness of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method has no novelty. Every part of the proposed method is an existing mature technique. With no doubt, the paper cannot be accepted.

    If the source code can be released, the work could be a good workshop paper or a technical report.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The proposed method has no technical novelty. 1) The authors claim that they solve an important task. However, an important task may be based on a new dataset, but is not about new technique. 2) The authors claim that the contribution is the overall framework. As I have already pointed out, every part of the proposed method is an existing technique. The authors have just simply combine them together. This manuscript could be a good techinical report showing how to use apply existing techniques to solve a task, but definitely is not an academic paper.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a semi-supervised learning (SSL) method for lymph node (LN) metastasis classification from CT imaging. The proposed method uses a 2.5D network that incorporates global and local image features, plus a student-teacher knowledge distillation framework to learn from unlabeled data. The authors use a “pathology-guided label sharpening” procedure that uses prior knowledge from the pathology report to more intelligently generate pseudolabels for SSL.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The motivation and related work are clearly described. The authors effectively convey why this is an important and challenging problem that appears to be relatively understudied.
    • The proposed method is interesting and, critically, well-suited to the application domain. For example, the method leverages domain knowledge in the form of LN locations found in the pathology reports in order to constrain the problem and improve the quality of pseudolabels.
    • The writing is clear throughout with a logical flow from passage to passage. Also, the layout and illustrations/tables are effective.
    • The experiments appear to be soundly conducted, with thorough ablation studies and comparisons to relevant prior work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • There is limited methodological novelty, but this is not a concern to me: the paper poses a unique problem and solves it well. For example, there have been many efforts in SSL for medical imaging [1] (many leveraging knowledge distillation) and many efforts leveraging multi-scale learning for medical imaging [2].

    [1] Jiao, Rushi, et al. “Learning with limited annotations: a survey on deep semi-supervised learning for medical image segmentation.” Computers in Biology and Medicine (2023): 107840. [2] Elizar, Elizar, et al. “A review on multiscale-deep-learning applications.” Sensors 22.19 (2022): 7384.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I feel that the method was described in sufficient detail to be reproduced, but an open-source code repository would be helpful.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor comments/questions:

    • I appreciate that the title is specific, but I would consider shortening it.
    • Why exactly is it called label “sharpening”? Is this because you’re using a threshold to generate “hard” (binary) pseudolabels as opposed to soft ones (probabilities)?
    • I would consider citing [1] and/or [2] when explaining the loss function in Eq 1. These prior studies explore the impact of “auxiliary supervision” with additional loss terms computed from intermediate (modality/stream-specific) features.
    • Out of curiosity, why was the MobileNetv3 architecture chosen given that one might argue there are more competitive (and comparably lightweight) alternatives available nowadays? Could the method be further improved with a stronger CNN backbone such as one from the ConvNeXt [3] family?

    [1] Holste, Gregory, et al. “Improved Multimodal Fusion for Small Datasets with Auxiliary Supervision.” 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023. [2] Kawahara, Jeremy, et al. “Seven-point checklist and skin lesion classification using multitask multimodal neural nets.” IEEE journal of biomedical and health informatics 23.2 (2018): 538-546. [3] Liu, Zhuang, et al. “A convnet for the 2020s.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a unique, challenging, and important clinical problem as well as an effective, interesting solution. The experiments are thorough and sound, and the writing is clear throughout. This is perhaps not an extremely “ambitious” submission (i.e., it may not have widespread impact), but I find very few flaws with this paper and appreciate the care that the authors took in conducting and presenting this study.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    As originally indicated, I felt that the submission should be accepted independent of the rebuttal (though the rebuttal addressed a couple of my minor concerns). The paper is interesting and sound, so I am happy to recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    This paper presents a sem-supervised learning framework for the detection of lymph-node metastasis of esophageal cancer. The method makes use of pseudo-labelling and a teacher-student setup and is show to outperform state of the art methods on a dataset of almost 10K lymph nodes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is a generally well written and well executed work that proposes a novel approach to LN detection. The method is shown to outperform state-of-the art techniques.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although this is generally a good paper a point of improvement could be the evaluation. I would recommend to omit the accuracy @75 sensitivity / specificity as these are difficult to interpret if we do not know the class balance. The other metrics (sensitivity @75 and specificity @75) should be enough.

    Please also elaborate on the unit that is being evaluated here. My understanding is that a region of interest containing multiple LNs (referred to as a station) is extracted from the CT. Are the metrics here computed on a ‘station’ level or a LN level? I assume the prior, but it is not immediately clear from the paper.

    It would also be good to include some statistical analysis. Are the improvements meaningful?

    To further improve the evaluation section, the authors could consider performing one or two experiments that provide some insights into what errors each method makes and try to relate that to the algorithmic design of the methods. This is done to some extend “This indicates that both size and context information are helpful for LN metastasis classification, and aggregated multi-scale features also provide supportive information” but can be backed up further with error analysis or subgroup analysis.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Except for the above mentioned suggestions, I have no further feedback.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think this is a strong well written paper that presents a novel application. The work is generally well executed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My comments have been sufficiently addressed. I would like to thank the authors for their effort.




Author Feedback

We thank all reviewers for their comments, especially in noting that our paper is well/clear-written; proposes a novel/well-suited method that solves a challenging/important clinical task; demonstrates the effectiveness in sound and thorough experiments (both ablation and comparison to other work).

Q1: Novelty (R1). Our main contribution and novelty are 3-folds: 1) Diagnosis of lymph node (LN) metastasis is an essential yet challenging/understudied task in esophageal cancer treatment. We work on a dataset of 1052 patients with 9961 LNs (largest dataset for esophageal LN classification to date), and achieves a substantially improved performance. 2) Our main methodological contribution is the proposed overall framework. Compared to previous work with limited metastasis-labeled LNs, we propose a semi-supervised model with label-sharpening strategy via utilizing LN-station’s metastasis status in pathology report, which use more unlabeled LNs to improve performance. This is an intuitive way to incorporate clinical information in LN classification that has not been explored before. 3) Our idea of two-stream multi-scale feature fusion comes from clinicians assessing process, where both local (intensity, texture) and global (size, shape) features are important for diagnosis. It can also overcome the drawback that previous single networks are prone to the bias of LN size. This is different from two branches methods of [2] (by R1), which are designed to incorporate different scopes of context for object delineation. Given the complexity of medical imaging task, we view a new framework that provide means to significantly enhance the performance just as valuable as new networks or algorithms.

Q2: Backbone in supervised learning (R1, R3). Our method is designed independent of the choice of backbone network, hence, we initially choose a lightweight MobileNet for computational efficiency. We replace MobileNetv3 with ResNet-101 as suggested by R1 for both single- and two-stream settings. Results show a ~1% improvement for both settings (single: 0.8788 -> 0.8910; 2-stream: 0.9068 -> 0.9162), which is consistent with our assumption that proposed method is independent of backbone choice. We will add comprehensive experiments using other backbones in the future.

Q3: Statistical analysis and subgroup error analysis (R5) (1) We employ DeLong test to calculate p-value for results in Table 1 (comparison to others) and Table 2 (ablation). E.g., in Table 2, p-value between mean teacher result (row 6) and our SSL result (row 7) is 0.004, which is statistically significant. We will add statistical analysis in final version. (2) Subgroup analysis: Using RECIST criteria, we selected the smallest 25% malignant LNs as a subgroup. The global stream network (preserve LN size) has 0.702 accuracy and proposed two-stream method yields 0.808 accuracy. This result demonstrates the effectiveness of two-stream design when classifying LNs that are small yet malignant. We will add more subgroup analysis in final version.

Q4: Other issues of R1. (1) “-“ and “+” in Table 1: There is only “+” in Table 1, which represents the improvement of our method compared to the 2nd-best performing one. (2) 1st-row in Table 2: We use the global input (preserve LN size) as default single stream comparison. (3) Data augmentation: We use extensive data augmentation similar to nnUNet for both streams (considering its high performance in medical imaging). Pure data augmentation cannot fully compensate the two-streams because of the large LN scale difference (LN size can differ by 10 times ranging from 2-3mm to 20-30mm).

Q5: Other issues of R3, R5. (1) Name of label “sharpening”: Yes, your understand is correct. (2) Two suggested refs: We will include these refs in final version. (3) Evaluation metrics: We agree and will omit accuracy in final version. (4) Unit of evaluation: Metrics are computed on a LN level. For each LN instance, we crop an ROI centered on this LN in CT.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    One reviewer has issues with the novelty of the method. But overall reviewers think this work addresses an important question with promising results.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    One reviewer has issues with the novelty of the method. But overall reviewers think this work addresses an important question with promising results.



back to top