Abstract

Deep learning models based on medical images have made significant strides in predicting treatment outcomes. However, previous methods have primarily concentrated on single time-point images, neglecting the temporal dynamics and changes inherent in longitudinal medical images. Thus, we propose a Transformer-based longitudinal image analysis framework (LOMIA-T) to contrast and fuse latent representations from pre- and post-treatment medical images for predicting treatment response. Specifically, we first design a treatment response-based contrastive loss to enhance latent representation by discerning evolutionary processes across various disease stages. Then, we integrate latent representations from pre- and post-treatment CT images using a cross-attention mechanism. Considering the redundancy in the dual-branch output features induced by the cross-attention mechanism, we propose a clinically interpretable feature fusion strategy to predict treatment response. Experimentally, the proposed framework outperforms several state-of-the-art longitudinal image analysis methods on an in-house Esophageal Squamous Cell Carcinoma (ESCC) dataset, encompassing 170 pre- and post-treatment contrast-enhanced CT image pairs from ESCC patients underwent neoadjuvant chemoradiotherapy. Ablation experiments validate the efficacy of the proposed treatment response-based contrastive loss and feature fusion strategy. The codes will be made available at https://github.com/syc19074115/LOMIA-T.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2565_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sun_LOMIAT_MICCAI2024,
        author = { Sun, Yuchen and Li, Kunwei and Chen, Duanduan and Hu, Yi and Zhang, Shuaitong},
        title = { { LOMIA-T: A Transformer-based LOngitudinal Medical Image Analysis framework for predicting treatment response of esophageal cancer } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The presented work introduces a modified vision transformer to process time series comprised of a pre- and a post-treatment image. In particular, the network consists of three newly introduced components: first, a Tumor Region Representation Network that extracts tokens from 3D medical images. Second, a loss that compares the extracted intermediate features from pre- and post-treatment images guided by the clinical outcomes. Third, a Deep Feature Fusion Module that fuses features from the pre- and post-treatment images via cross-attention. The proposed method is tested on a small in-house dataset of 170 patients with esophageal cancer and is shown to outperform three baseline methods. Additionally, the benefit of each of the introduced components is demonstrated in an ablation study.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper researches a topic of high scientific interest for the medical image computing community, the processing of time series of medical images.

    • The investigated scenario of making predictions from pre- and post-cancer-treatment images is clinically well motivated.

    • The authors provide a nice overview over existing methods for longitudinal image processing.

    • The methods and experiments sections were clearly structured und adequately illustrated.

    • I found the newly introduced Treatment response-based contrastive loss intuitive and interesting.

    • The authors conduct an extensive ablation study, benchmarking the benefit of each of their proposed network modules.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I found large parts of the proposed network architecture to be highly complex without adding meaningful algorithmic innovation compared to the large number of vision transformer architectures that have already been proposed for medical image processing (see Li, Jun, et al. “Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives.” Medical image analysis (2023): 102762 for an overview). The Tumor Region Representation Network is essentially a feature extractor, while the Deep Feature Fusion Module is a cross-attention module. Crucially, I fail to see how these network modules are “clinically interpretable” as claimed by the authors.

    • I have substantial concerns regarding the evaluation of the proposed method. The sheer amount of architectural choices and tunable hyperparameters make the study design prone to overfitting the proposed method. This issue is amplified as only a single, small, proprietary dataset is used for benchmarking.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • None of the baseline methods were originally developed or tested on the datasets that are used in the presented study, raising concerns about a fair comparison with respect to the available hyperparameter tuning budget for the proposed work and baselines.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • In order to address the aforementioned weakness regarding the overfitting of the proposed method, I strongly suggest that the authors benchmark their method on a publicly available dataset, ideally in the scope of a dedicated challenge.

    • Moreover, It is unclear to me how the three baseline methods were selected. Processing longitudinal time series with deep learning has been a topic of high scientific interest and even the prediction of cancer treatment response from pre- and post-treatment images has been widely researched (e.g. references 1-3 of this work and many others). Ideally, the authors should include more baseline methods as comparison or at least provide rationale why they picked the current three.

    • In the caption of Figure 1, the authors write that T_post and T^cross_post are used, while T_pre and T^cross_pre are not. I fail to understand this statement as in the remainder of the manuscript it appears that both images are processed.

    • In Section 2.1, the authors write that the tumor bounding-box is expanded by 4 pixels in the axial plane to obtain the input images, most likely leading to variable-sized inputs. Later, the authors write that the input was 48x48x32 voxels large. They should clarify whether the input images were of standardized size or not.

    • I was initially unsure what the terms “hard-split” and “soft-split” meant and only understood them after reading on. I suggest properly introducing them at first appearance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the presented work is well motivated and researches an interesting topic, I feel that the proposed method is overengineered and lacks algorithmic novelty. These concerns are further amplified as the method has only been benchmarked on a small proprietary dataset. Ultimately, these issues keep me from recommending acceptance of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    In their rebuttal, the authors have promissed to include a second longitudinal dataset in their evaluation. Moreover, they have clarified several ambiguities regarding the method’s novelty and implementation. Ultimately, these changes address my main concerns, leading me to improve my score.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a Transformer-based longitudinal image analysis framework (LOMIA-T) to contrast and fuse latent representations from pre- and post-treatment medical images for predicting treatment response of Esophageal Squamous Cell Carcinoma (ESCC). A data set of 170 pre- and post-treatment contrast-enhanced CT image pairs were used from from ESCC patients who underwent neoadjuvant chemoradiotherapy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces a way to exploit pre and post CT images based on a treatment response-based contrastive loss. THis is enhancing the ability of the tumor region representation network to discern feature disparities indicative of treatment effects between pre- and post-treatment CT scans.

    Results underscore the importance of longitudinal data for imporving predictive performance in radiomics approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Compared to other longitudinal methods we find an improvement but we do not find an outstanding improvement (considering tables 1 an 2) as compared with all the methods. Adding different steps or stratagies may bring a value but contribution might require a better justification.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code will be released via a github link

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    A more clear justification of the proposed architecture would help (FIg1) explaining the methodological contribution of authors with respect to the existing litterature. Is the whole architecture that improves the methods?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the proposed ethods can be sound for exploiting longitudinal data, we find a huge amount of papers addressring the same issue. The paper is interesting but probably hichglighting the contributions and the impact of each one of the improvements wrt to existing papers would help to improve the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    To utilize the temporal information in longitudinal medical images, this paper uses a token-to-token transformer with time/position embedding to extract features from pre- and post-treatment CT images. Then the pre- and post-treatment features are integrated based on a cross-attention mechanism and a strategy to remove the redundancy, for predicting treatment response of esophageal cancer.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper proposes a novel model that combines feature contrast and fusion techniques for predicting the treatment response of esophageal cancer. The paper provides clear justifications for their design choices in the model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experiment is conducted using only a single dataset, which may limit the evaluation of the algorithm’s generalizability and reproducibility.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The reproducibility of this work is relatively high if they will release the source code upon acceptance as they claimed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It would be beneficial if the paper could provide an evaluation of the results obtained from nnU-Net.
    2. Figure 1.D appears to contain minor errors and seems inconsistent with Equation (4). Additionally, in Fig.1(C), the authors should clarify the meaning of ‘k=7, s=4’.
    3. The paper does not discuss the selection of hyper-parameters. For instance, the rationale behind choosing m=0.5 is not explained.
    4. The description of the dataset is not sufficient. The authors should specify the number of samples for the pCR and non-pCR groups, respectively.
    5. The paper would benefit from a more in-depth discussion about the impact of time embedding for longitudinal image analysis.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach proposed in this paper is innovative, and the authors present encouraging results concerning the contribution of temporal information.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have effectively addressed the majority of the issues raised in the reviews.




Author Feedback

  1. A single proprietary dataset for benchmarking (Reviewer #4/5). The paper validated the proposed LOMIA-T on 170 longitudinal contrast-enhanced CT images of EC patients from two hospitals. All of them underwent neo-adjuvant chemoradiotherapy. The longitudinal CT images is difficult to collect, and we have collected as much data as possible. To enhance model stability, we pre-trained the network using instance discrimination task. We appreciate the reviewers’ suggestion of using a publicly available dataset for benchmarking. Actually, after submission, we have validated the proposed LOMIA-T on a public longitudinal dataset, the Osteoarthritis Initiative, which includes longitudinal radiological (x-Ray and MR) images from 4796 patients for 9-year follow-up. Sorry for that we do not show the new experiment results here according to the rebuttal guideline. We are trying to update LOMIA-T using a gated attention unit.
  2. The lack of algorithmic novelty (Reviewers #5). The algorithmic novelty includes the treatment response-based contrastive loss and the fusion strategy. The reviewers #1/4/5 have affirmed the proposed intuitive and interesting contrastive loss. Common feature fusion strategies include concatenation of features with all- or cross-attention. However, concatenating longitudinal features (T^cross_post and T^cross_pre) can introduce redundancy. Additionally, previous study showed post-treatment features are of more predictive value than pre-treatment ones, which is consistent with the clinicians’ perception. Therefore, we propose to fuse post-treatment features (T_post) and its interaction with pre-treatment features (T^cross_post or T^cross_pre) via a skip connection, which exploits high- and low-level features. The ablation experiment (Table 2) shows the effectiveness of this fusion strategy. That is why T_post and T^cross_post are used, while T_pre and T^cross_pre are not (Fig. 1). These two improvements are consistent with clinicians’ perception and have certain clinical interpretability.
  3. A clear justification of proposed architecture (Reviewers #1). In ablation studies, we evaluated the effectiveness of each improvement of the proposed method and found that each of them can improve the predictive performance. We believe that a good longitudinal medical image analysis framework can work with various backbones. In our ongoing work, we are validating the effectiveness of this framework with other different backbones.
  4. The rationale of the selected baseline methods (Reviewer #5). The related studies can be categorized into feature contrast-based and fusion-based ones. For each category, we selected one of the state-of-the-art methods as the baseline: Siam-CNN (feature contrast-based), DiT (feature fusion-based), and MLDRL (feature contrast and fusion). Codes of these methods are publicly available, and DiT and MLDRL exploited longitudinal images for predicting treatment response. For the mentioned references 1-3, the ref1’s method is based on fusion, refs 2-3’s methods are based on contrast and concatenation. However, only ref1’s code is available but with a CNN backbone. For fair comparison, we selected transformer based DiT rather than ref1’method.
  5. Minor errors and unclear descriptions. Reviewers #4: 1) Our dataset includes 170 cases, and 85 cases achieved pCR. 2) Our previously work used nnU-Net to segment esophageal cancer, the median DSC is 0.865. We can cite this article. 3) We tried token-base, image-based time embeddings. The latter performed slightly better. 4) Hyperparameters are determined through experiments. For m, we tried values from 0.3 to 0.8 with an interval of 0.1. m=0.5 performs best in the validation set. 5) We will correct equation 4 and add the description in Fig. 1. Reviewers #5: (1) In preprocessing stage, we resized the ROI into a fixed size of 32×48×48. We will describe this in the image preprocessing. (2) We will add description of “Hard-split” and “Soft-split” at first appearance.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Novelty of the contrastive loss is better explained. However, the differences wrt significant body of work in this domain have not been clarified. Authors have instead opted to focus on trying to justify specific aspects of their model based on previous work, but this still lacks clarity. Similarly, their parameters etc are only empirically justified and not methodologically rationalized.

    Limited evaluation on a small cohort further lessens enthusiasm.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Novelty of the contrastive loss is better explained. However, the differences wrt significant body of work in this domain have not been clarified. Authors have instead opted to focus on trying to justify specific aspects of their model based on previous work, but this still lacks clarity. Similarly, their parameters etc are only empirically justified and not methodologically rationalized.

    Limited evaluation on a small cohort further lessens enthusiasm.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, all reviewers gave Accept. The problem addressed in this paper (using pre- and post-CT images for prediction tasks) and the designed fusion method are common in the field of cancer imaging. Employing Transformer for modality fusion aligns with the current technological trend. However, due to the very small dataset (170 samples) and the use of only cross-validation without an independent test set, there is a significant risk of overfitting, especially when using the Transformer technique. Overall, this paper is very borderline, and I recommend a Weak Accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the rebuttal, all reviewers gave Accept. The problem addressed in this paper (using pre- and post-CT images for prediction tasks) and the designed fusion method are common in the field of cancer imaging. Employing Transformer for modality fusion aligns with the current technological trend. However, due to the very small dataset (170 samples) and the use of only cross-validation without an independent test set, there is a significant risk of overfitting, especially when using the Transformer technique. Overall, this paper is very borderline, and I recommend a Weak Accept.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers suggest acceptance of this paper after rebuttal. The technical novelty seems ok, while the validation dataset is small (only 170 subjects), making this a broadline paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All three reviewers suggest acceptance of this paper after rebuttal. The technical novelty seems ok, while the validation dataset is small (only 170 subjects), making this a broadline paper.



back to top