Abstract

Automated tools developed to detect multiple sclerosis lesions in spinal cord MRI have thus far been based on processing single MR sequences in a deep learning model. This study is the first to explore a multi-sequence approach to this task and we propose a method to address inherent issues in multi-sequence spinal cord data, i.e., differing fields of view, inter-sequence alignment and incomplete sequence data for training and inference. In particular, we investigate a simple missing-modality method of replacing missing features with the mean over the available sequences. This approach leads to better segmentation results when processing a single sequence at inference than a model trained directly on that sequence, and our experiments provide valuable insights into the mechanism underlying this surprising result. In particular, we demonstrate that both the encoder and decoder benefit from the variability introduced in the multi-sequence setting. Additionally, we propose a latent feature augmentation scheme to reproduce this variability in a single-sequence setting, resulting in similar improvements over the single-sequence baseline.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3549_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3549_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wal_Multisequence_MICCAI2024,
        author = { Walsh, Ricky and Gaubert, Malo and Meurée, Cédric and Hussein, Burhan Rashid and Kerbrat, Anne and Casey, Romain and Combès, Benoit and Galassi, Francesca},
        title = { { Multi-sequence learning for multiple sclerosis lesion segmentation in spinal cord MRI } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a work to perform the Multiple Sclerosis (MS) lesion segemntation in the scenario where one or more sequences are not available. The authors propose a simple missing-modality method named mean imputation. Mean imputation can replace missing features with the mean over the available sequences and the evalutation results prove the effectiveness of the method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work tends to solve the MS lesion segmentation task in the real-world scenario that one or more sequences are missing. The task itself is challenging and meaningful in the clinical practice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The mean imputation is a simple stragey that replaces the missing sequence(s) using the mean of available sequence, the effectiveness of the proposed method is questionable.

    MS lesion segmentation usually relies heavily on T2 sequence. The evaluation results only show a marginal improvement when mean imputation is involved, which is not persuasive to show the benefits of the proposed method.

    Lack of qualiative results (figures) to give a more clearer idea of the ROI of the segmentation task.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed work tends to solve the problem of missing sequence when performing lesion segmentation in spinal cord MRI, while this task heavily relies T2 and authors have shown that the involvement of mean imputation can only provide marginal improvement.

    The dice coefficient of the method is less than 0.5, which is low for such segmentation task, can authors justify this performance from the perspective of clinical practice, i.e., is this performance enough to use for conventional clinical practice?

    Authors may also consider to evaluate the work from a clinical perspective.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The necessarity of the proposed work in such scenario is questionable. The proposed method (mean imputation) is simple and the effectiveness of the method is marginal in terms of the performance. This work lacks evaluation from a clinical perspective.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    In the rebuttal, authors claim that they ‘introducing a problem framework & pipeline’, however, this problem has been well-established and sovled in a previous work:

    Liu, H., Fan, Y., Li, H., Wang, J., Hu, D., Cui, C., … & Oguz, I. . Moddrop++: A dynamic filter network with intra-subject co-training for multiple sclerosis lesion segmentation with missing modalities. MICCAI2022.

    The experimental settings and problem domains of two works are very similar (MS lesion segmentation). Authors did not acknowledge this work thus missing the corresponding evaluation with the related work.

    Regarding author’s feedback on the importance of T2, authors did not address the problem that the trained model performs good with T2Sag only and additional modalities can drop the performance. Also, what if there is no T2 for training, can mean imputation work with only STIR, T1 and MP2RAGE? In addition, the experimental result on one dataset is not persuasive to me for MS lesion segmentation task.

    Authors stated that ‘clinicians rely on lesion counts for patient stratification’, however, the clinical studies also rely heavily on lesion volume as quantitative metrics for brain volumetric analysis. Dice score is important in this case. For these reasons, the significance of the topic is not justified.

    For these reasons, the significance of this work is not justified properly to me. My original recommendation of reject remains.



Review #2

  • Please describe the contribution of the paper

    Authors explore a multi-modal setting for the automated detection of MS lesions in spinal cord SC MRI. The contribution relies on the multimodal setting proposed, exploring missing modalities setting (through mean imputation approach). Finally, authors explored random perturbations of latent space (derived from multi-modal latent space differences with monomodal one) as a method to make monomodal training more robust also. Experiments are carried in an in-house cohort, with a test set on 58 patients. Comparison with monomodal baselines is conducted.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Original study on multimodal setting for MS in spinal cord MRI, seems a quite unique cohort • Paper well written, easy to follow • Results show statistical significant improvement as regards baseline

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • No really technical novelty, applied methods are indeed well stablished, image registration methods, previous leverage of multimodal imaging in missing modalities with mean imputation etc • Single cohort analysis, how it would generalize to other settings is unclear • Missing some cohort details • Lack of visual results

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    There is no mention on release of code nor data. It seems thus this study cannot currently ensure reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I’ve read the paper with interest; it is clearly written and easy to follow. There are though some comments, mostly to clarify some aspects of the study: • This seems to be a privative cohort, I was wondering on the differences in the imaging modalities acquired, I understood FOV might be different, how different is spatial resolution? • Could the authors clarify how many subjects do they have in each of the three subsets for pure testing? • I was wondering also on the rationale on including some contrasts. For instance, I am thinking on mp2rage and T1w without contrast. I assume these two contrasts are redundant, so not both of them are available for each patient, correct? It seems from figure 1 suplementary material that indeed there is either one or the other? Pleas confirm. T2 axial and T2 sagital seem thoug available in the same subjects, are these image treated as different or merged in the registration process? I found overall unclear which sequences are used when. From Table 1 in the results I do not see MP2RAGE, is T1 referring to both T1w and MP2RAGE ? • Could the authors clarify in this cohort, age range, gender, and importantly lesion volume distribution of lesions per patient (histogram?), it is unclear how many lesions are indeed present in these patients. Also it woud be interesting to see how this changes withing the sub-corhort testing sets. • I was curious to undertand if there was one contrast providing more evident information in the weights of the decitions model, maybe this is T2w contrast. Overall I think it would be interesting to discuss on the contribution of each sequence to the model performance, also maybe evocating minimal multi-modal setting. Though the main conclusion is that this model is already useful when inferring in single modality some additional explainability would be great. • It is a pity no reference is made on code/model availablility or data. I understand releasing healthcare data is not easy. It would be though beneficial to release the model or weights as for other researcher to test this out of domain. It would be great to have an idea indeed on the generability of the methods, could the authors report out fo domain performance?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend this paper for acceptance, I think is a fair conference paper. The contribution is not methodological but the derived multi-sequence study is meaningful and with interesting outcomes. I do penalise though that no out of domain analysis was conducted.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose an algorithm for multiple sclerosis lesion segmentation in MR images of the spinal cord. With the goal of using images from all available MR sequences per patient, the authors set up nnU-Net with multi-sequence input, using mean imputation for missing sequences. Subsequent experiments reveal that even single-sequence inputs at inference time profit from multi-sequence training. Based on this not immediately intuitive finding, the authors propose an augmentation scheme in latent space for training a corresponding single-sequence model, which results in comparable performance improvements.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is very well written and easy to follow. The proposed approach appears sound, seems original in its given setting, and is well described in all important aspects (dataset, model architecture, loss terms, etc.). Moreover, starting from what I assume was a chance finding (namely, that single-sequence inference profits from multi-sequence training) and a corresponding thorough analysis, the authors devise an extension to their approach (namely, noise augmentations in latent space) leading to further improvement (namely, increased performance after single-sequence training).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The only major weakness that I see is that there is no indication that the code and weights of the authors’ models and experiments will be made public. Apart from that, I have only minor comments and suggestions for improvement (see detailed comments below).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Regarding the finding that even single-sequence inference profits from multi-sequence training: It would be interesting to see if further performance improvements would also be possible in multi-sequence training (1) by even randomly dropping available sequences on purpose per sample and replacing them via mean imputation, loosely similar to dropout, to further increase data variability and thus feature robustness or (2) by using the proposed latent augmentation on top of mean imputation and multi-sequence training. While, by no means, I would expect such experiments be added, maybe this might be an interesting idea for future research.

    Typographic errors etc.:

    • Section 1, page 2: “deep learning for medical image” should be “medical imaging”, “medical images” or “medical image analysis”, I guess.
    • Section 2.1, page 2: “lumber vertebrae” should be “lumbar vertebrae”.
    • Section 2.1, page 2: The pointer to the suppl mat, “see Supple. Fig 1 for more details”, should, in my opinion, follow directly after the section’s first paragraph (so after “For each subject, we included all the available acquisitions.”), as the contents of the remaining 2nd and 3rd paragraphs of Section 2.1 are not reflected in Suppl Fig. 1.
    • Section 2.1, page 2: “An experienced rater revised …” – was this the same rater always? Also, was it a novel rater (as compared to the experts who produced the delineations)? If so, this could maybe be rephrased to “One independent experienced rater revised …” for clarity.
    • Section 2.3, page 4: “A convolution with 320 output channels was then applied …” – I guess this is a 1x1 convolution, or does it use a larger kernel? This should be clarified for completeness.
    • Section 2.3, page 4: “the skip connections … were taken only from the T2Sag encoder.” – What exactly is the motivation here? I guess it is the same as stated in Section 2.4 (namely, T2Sag is the most common sequence); however, for clarity I would also state it here (or, of course, the real motivation if there was another one).
    • Section 2.4, page 5: “… followed by a Leaky ReLU layer to maintain original feature scale.” – It is not immediately clear to me in which way a leaky ReLU helps correcting the scaling effects that were introduced by the additive noise. Maybe this sentence could profit from an additional clarification (in the Suppl Mat if space is scarce) or a reformulation. Or else, maybe I am just not fully understanding the sentence here.
    • Section 3.1, page 4: “The Mean Imputation method was trained on all sequences” – Although this becomes quite clear from context, maybe a better formulation would be something like “… was trained on all available sequences per patient”.
    • Section 3.2, page 7: “the weights … will be applied to the mean features from all available sequences” – I am not sure if I understand what this sentence wants to say. Is this referring to the fact that, through mean imputation, there is a “cross-talk” between the weights of all sequences in backpropagation? Again, maybe a slight reformulation (mentioning backprop) would help; or else, maybe I am misunderstanding the intent here.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper falls well in the realm of interest for the MICCAI community, in that it tackles a relevant problem (MS lesion segmentation) with an original approach. The proposed approach is well described and thoroughly evaluated. Although no indication is made that the code, weights, and/or data will be made publicly available, at least the method description appears detailed enough to me to make it reproducible.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I stand with the justification in my pre-rebuttal recommendation (see above). Regarding the rebuttal, authors addressed all major concerns of the remaining reviewers adequately.




Author Feedback

We thank the reviewers for their constructive feedback. We address below only the major concerns. Otherwise, we thank the reviewers for detailed feedback on phrasing & clarity (R4) and requested cohort information (R1) which we will include in the paper.

R3’s main criticism was that Mean Imputation yielded only marginal improvements.

  • While using multiple sequences at inference shows only marginal improvements, this is not the main contribution of our paper. The main contributions are: 1) introducing a problem framework & pipeline as the first application of a multimodal approach to spinal cord (SC) lesion segmentation, serving as a foundation for future studies; 2) identifying that multimodal training helped monomodal inference. Specifically, there are notable differences between a T2Sag model and Mean Imputation using only T2Sag at inference, consistent across both metrics and all 3 test cohorts, which is convincing. Half of Sec. 3 is dedicated to analysing this interesting result, demonstrating potential benefits in both the encoder and decoder, and proposing latent augmentation to replicate the multimodal benefit in a monomodal setting. These contributions were acknowledged by the other reviewers.

R3 questioned the necessity of the work given the relative importance of T2 for lesion segmentation (are multiple sequences necessary?)

  • This is an important and open question as radiologists are currently recommended to acquire at least 2 sequences (NAIMS), and multiple scans require further time. As the first study on multimodal segmentation in SC data, our results can inform future research on this topic.

R3 and R1 noted the lack of clinical and out-of-domain evaluations, respectively.

  • Our study is the first application of multiple sequences for this task in SC data. Our aim was to explore various aspects of the problem like the data pipeline and modelling, and to analyse the counter-intuitive results in Sec. 3.2. While evaluation on other datasets was not within the scope, we acknowledge its importance and plan to investigate it in future work.

R3 questioned the clinical benefit based on the Dice score (median=0.5).

  • This seems low compared to brain studies, but the SC poses greater challenges due to more artefacts, partial volume effects, and limited resolution. As outlined in Sec. 3.3, the obtained Dice is similar to clinicians in a referenced study. Additionally, lesion-wise scores are higher than the Dice, which is more important in current clinical practice as clinicians rely on lesion counts for patient stratification. Finally, an ongoing clinical study in our team on a simpler segmentation method for SC lesions has demonstrated clinical benefits in lesion sensitivity.

Reviewers highlighted the lack of released code and data, affecting reproducibility.

  • Due to an NDA agreement with an industry partner the code and models cannot be released. However, the data can be provided upon request and we will add this clarification in the paper. Moreover, we have ensured that the algorithm and experiments are described as clearly and comprehensively as possible to facilitate reproducibility.

R1&R3: Lack of visual segmentation results.

  • We agree with the reviewers that this element is missing. By reorganising the supplementary materials, we can include a figure with segmentation outputs of the model.

R1 asked about the rationale for including contrasts and clarification on “which sequences are used when”.

  • As this study explores multimodal SC lesion segmentation, we included all available data. Indeed we do not expect a patient to have both T1 and MP2RAGE acquisitions but we aim for a model to handle scenarios where either one is available.
  • To clarify Table 1: in the last row, for example, the model’s input is 5 images, 1 per marked sequence. We treat axial and sagittal T2 as separate “sequences” because of significant visual differences.
  • MP2RAGE is MP2 in Table 1 - we will clarify this in the paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    In spit of the flaws pointed out by the reviewer 2, the paper has some merits, e.g. new clinical applications, and worthy of getting more explosure. therefore, I am inclined to agree with other two reviewers’ opinions and accept it.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    In spit of the flaws pointed out by the reviewer 2, the paper has some merits, e.g. new clinical applications, and worthy of getting more explosure. therefore, I am inclined to agree with other two reviewers’ opinions and accept it.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    R2 pointed out some weaknesses of the study. However, the study is clinically valuable and results can inform future research on this topic. Therefore, the paper is suggested for acceptance by the MICCAI conference.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    R2 pointed out some weaknesses of the study. However, the study is clinically valuable and results can inform future research on this topic. Therefore, the paper is suggested for acceptance by the MICCAI conference.



back to top