Abstract

The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios. The code will be publicly available at https://github.com/ayanglab/Diff3Dformer.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0750_paper.pdf

SharedIt Link: https://rdcu.be/dVZiL

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_47

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0750_supp.pdf

Link to the Code Repository

https://github.com/ayanglab/Diff3Dformer

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Jin_Diff3Dformer_MICCAI2024,
        author = { Jin, Zihao and Fang, Yingying and Huang, Jiahao and Xu, Caiwen and Walsh, Simon and Yang, Guang},
        title = { { Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {504 -- 513}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a transformer based architecture to classify 3D CT scans. Two 3D CT tasks are included in this work. The first is to discriminate novel coronavirus pneumonia (NCP) from common pneumonia (CP) on CC-CCII dataset. The second is to predict the 1-year mortality of fibrotic lung disease on a public and an in-house dataset. The work emphasize that it applied diffusion auto-encoder for feature extraction, clustering ViT model for improved feature grouping and a interpretable slice fusion module to generate the final patient-level decision.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The writing and organization of the paper is relatively good. The authors proposed three modules to improve the performance of the classification network for 3d CT scans.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors did not aim for solving one clinical problem but two different ones, which indicates the focus of this work would be on the novelty of the proposed method regardless of the clinical problems. However, the two main proposed modules, diffusion based representation and clustering ViT, are not original but borrowed from other literature. The effectiveness of proposed fusion module is not reflected in the experimental results. This makes the innovativeness of the paper is not very high.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Dive deep and improve the novelty of this paper or explain in detail why the proposed modules are original innovation and not incremental would help improve this work.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The two main proposed modules in the method are not original but borrowed from other literature, which are diffusion based representation learning and clustering ViT. The effectiveness of proposed fusion module is also not reflected in the experimental results. This makes the innovativeness of the paper is not very high.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors’ response mitigates my original concern on novelty about this paper.



Review #2

  • Please describe the contribution of the paper

    This paper propose to use Diffusion (DDPM) for input feature refinement then send into clustering ViT for attention-based patient risk classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Clear pipeline design in the overall framework.
    2. The paper is very well written.
    3. It is novel to use Diffusion for enhanced feature extraction prior conducting classification. And the results somewhat validate the effectiveness with marginally improved scores.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The presented results in CC-CCII dataset is not that promising. Dii3Dformer only achieve comparable results as its benchmarks. Additionally, the improved results in FLD dataset is also not significant as shown in ablation study. It’s unclear to me such trivial statistical difference is due to insufficient training in benchmarks/Diffusion feature encoder enhancement.
    2. It’s unclear why and how diffusion enhanced features contribute to the overall model classification performance. some diffusion encoder predicted feature analysis might be more helpful to convince the audience.
    3. From my personal experience, Diffusion is also not good at handling small dataset. Yet the author proposes to use Diffusion to enhance feature extraction for aiding small medical imaging dataset, I am a little uncertain about this.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is overalll very well written with clear organization and description. The pipeline design is concise and logical. The targeted application also have strong clinical impact if the proposed pipeline can be deployed. It is interesting to me to use diffusion for input feature preprocessing. However, it is not very persuasive to me to use diffusion to enhance feature extraction in small dataset since from my understanding diffusion also needs large amount of data training to reach its robustness. Small dataset will still leave a lot of noise residual in the predicted images. This might be why the authors’ improvement in the results section is fairly trivial in comparison to its benchmarks. I am interested in seeing if you apply your pipeline in large classification task whether big improvement can be achieved.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    novel in the idea perspective but results is not persuasive.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    2D training to enlarge the dataset is a common but useful practice. it addresses my concern on small dataset.



Review #3

  • Please describe the contribution of the paper

    This paper propose a deep learning method to improve the Lungs disease classification in 3D using CT scans.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Technically the paper is well written. Introduction, literature review, methods and results are well discussed. The work can be reproduced from the code that authors have shared (if any). Clinically lungs disease classification is a challenging task. This paper propose a method to enhance CT images.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The results discussed are purely objective in nature. There is no results that can be seen on either 3D view or on 2D MPR. There is no discussion on subjective validation. Even though the recent literatures have been referred, they are just described. At few places multiple references are cited together like [8, 9, 6, 19]

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Imaging detais should be mentioned when dataset is described.

    2. Provide critical review of the literatures.

    3. Subjecive validation is not described. How did this result accepted by the clinicians?

    4. Did you do any data augmentation to increase the samples?

    5. The word 3D is extensively used. Is it just because it is 3D in nature? If so, how is the 3D volume reconstructed from the 2D axial slices?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Technically the paper is convincing with good results. However, the subjective validation can be included to make it more effective.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper investigated CT-based 3D classification by combining a diffusion encoder, and a clustering ViT model. The key idea was to extract meaningful representations via diffusion encoder, and then reduce the computational cost and improve the interpretability via clustering.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is well organized, the idea is very interesting, and the empirical results on two public benchmarks showed improvements over baselines.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Minor issues include some unclear explanations and results presentations.

    • In eq(2), I guess the $R$ is the patient-level score? To me, it ignores the varying importance of individual slices within a cluster.

    • The resolution of Supplemental figure 1 and 2 are too low. The authors can probably try vector graphics.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • In eq(2), I guess the $R$ is the patient-level score? To me, it ignores the varying importance of individual slices within a cluster.

    • The resolution of Supplemental figure 1 and 2 are too low. The authors can probably try vector graphics.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • CT-based 3D classification is of great interests to the community
    • The idea of using diffusion model encoder to extract representations and using a clustering vision transformer to achieve interpretability is quite novel
    • Training AI on small scale dataset is also of great interests to the community
    • The empirical results validates the proposed ideas and outperform the that of baselines
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all reviewers for their comments. We are encouraged those reviewers found our work novel (R3,R4), clearly organised (R1,R3,R4,R5), and clinically significant (R3,R4,R5). Below we replied to their other suggestions.

To R1 (Novelty) We will highlight novelties in the revision, which include:

  1. Discovering how Cluster-ViT can mitigate overfitting and effectively manage small datasets.
  2. Introducing a semantically meaningful Diffusion representation for downstream tasks.
  3. Proposing a novel pipeline that enables data-intensive diffusion for small-scale 3D analysis using Cluster-ViT.
  4. A fusion method that identifies the most important clusters to the model’s decision-making process, addressing unexplainable decisions in most existing 3D models.
  5. Given the common challenge of small datasets in 3D medical imaging, our target problem and method hold clinical significance and novelty. (Fusion) As our primary aim was to enhance the explainability of predictions, we prioritised presenting how the fusion module achieves this, rather than comparing it to other unexplainable methods, due to page limits. From the fusion scheme, the contribution of each cluster for each patient’s final prediction could be clearly visualised (Appendix Fig.1). Besides, it could find the most significant clusters among patients (Appendix Fig.3). We will expand the discussion to further show the efficacy of the fusion.

To R3 (Performance) While CCII and FLD are both small datasets, FLD is extremely small with 500 training cases compared to CCII’s 2500. AG-Swin Transformer was comparable on CCII but fell dramatically on FLD. Our method stayed robust, showing better sensitivity and specificity. We will denote performance values in Fig.2 for easier observation. (Diffusion) The consistent superiority of diffusion representation in both sensitivity and specificity across ablation studies highlighted its efficacy in small datasets. However, the size difference between the datasets suggested that diffusion’s advantages over contrastive learning were possibly limited by small datasets. We will explore if diffusion representation can yield greater improvements in larger datasets and perform t-SNE comparison of different representations in our future work. (Training) Diffusion needs extensive training data. In our pipeline, it is used to extract 2D representation, and it is trained by a large 2D dataset of 93967 slices generated from a small 3D dataset. We will detail these in the revision.

To R4 Cluster-ViT calculated the centroid score of all the slices of the same cluster and revised their score by centroid value to reduce the calculation complexity, leading to similar slice scores within a cluster. Taking the mean value for fusion proved to improve fusion performance during our design. We will enhance the image quality in the Appendix.

To R5 (Subjective validation) Our model excels in identifying key clusters for decisions (Appendix Fig.1) and visualising significant features (Appendix Fig.3). While not directly tested clinically, it allows developers to verify if the model’s high-performance predictions rely on biased, recognised, or novel features with experts, which is not feasible with unexplainable models. This adds significant value for further clinical AI evaluation and presents a more suitable AI for disease research. We will add discussions on the practical benefits and relevance for clinicians and researchers. (Literature) We will strengthen our critical review. (Data) We did not use data augmentation for all methods except for 2.5D methods, which inherently benefited from resampling augmentation. We generated eight inputs per patient without replication. We will clarify this in Section 3.2 along with the image details such as slice numbers and make the input samples available with the published codes. (3D) A 3D CT scan was aggregated from a sequence of 2D slices. We referred to the method analysing all slices from a scan as 3D methods.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I have checked the reviews of this paper and there are no issues.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I have checked the reviews of this paper and there are no issues.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top