Abstract

Post-surgical evaluation and quantification of residual tumor tissue from magnetic resonance images (MRI) is a crucial step for treatment planning and follow-up in glioblastoma care. Segmentation of enhancing residual tumor tissue from early post-operative MRI is particularly challenging due to small and fragmented lesions, post-operative bleeding, and noise in the resection cavity. Although a lot of progress has been made on the adjacent task of pre-operative glioblastoma segmentation, more targeted methods are needed for addressing the specific challenges and detecting small lesions. In this study, a state-of-the-art architecture for pre-operative segmentation was used, trained on a large in-house multi-center dataset for early post-operative segmentation. Various pre-processing, data sampling techniques, and architecture variants were explored for improving the detection of small lesions. The models were evaluated on a dataset annotated by 8 novice and expert human raters, and the performance compared against the human inter-rater variability. Trained models’ performance were shown to be on par with the performance of human expert raters. As such, automatic segmentation models have the potential to be a valuable tool in a clinical setting as an accurate and time-saving alternative, compared to the current standard manual method for residual tumor measurement after surgery.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3387_paper.pdf

SharedIt Link: https://rdcu.be/dV5xi

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72089-5_27

Supplementary Material: N/A

Link to the Code Repository

https://github.com/dbouget/validation_metrics_computation https://github.com/raidionics/Raidionics

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hol_Glioblastoma_MICCAI2024,
        author = { Holden Helland, Ragnhild and Bouget, David and Eijgelaar, Roelant S. and De Witt Hamer, Philip C. and Barkhof, Frederik and Solheim, Ole and Reinertsen, Ingerid},
        title = { { Glioblastoma segmentation from early post-operative MRI: challenges and clinical impact } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {284 -- 294}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the area of post-operative brain tumor segmentation. The authors leverage a large dataset to explore the challenges of this task and underscore the vital role of high-quality ground truth data. The authors rigorously examine several factors that impact segmentation performance: 1) sampling strategies for addressing class imbalance caused by reduced post-operative tumor size, and 2) the effect of kernel size and network depth. Their findings suggest that shallower architectures with larger kernels perform better in this context. Additionally, the study explores the impact of ground truth quality by comparing results against expert and novice annotations, indicating that top-performing models align closely with expert consensus. The paper is well-written and presents findings with clarity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The inclusion of a secondary task, classifying residual tumor, is insightful. It highlights the potential for certain models and sampling strategies to achieve high Dice metrics while exhibiting poor specificity. This finding is unexpected and underscores the importance of multifaceted evaluation.

    • The sampling strategies that were evaluated reflect past BRATS challenges experience, but maintaining the same architecture to compare each was insightful and helpful for further studies.

    • The comparison of the consensus annotations among the human annotators and the methods reveals important aspects, regarding the degree of difficulty of the problem and the state of the automatic methods that were similar to the expert consensus.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The criterion used to choose exp5 wasn’t clear and there was at least another method, exp4, very close (or better than it in some of the metrics).

    • The selection of the architecture and some of the methods are not explained, which limit the impact of the work. For instance, why the authors did not consider a type of loss more efficient for imbalanced data, for example, a surface loss [1, 2]. Also, I agree with the use of nn-Unet, as explained by the authors, but the choice of using AGU-Net is not clear.

    [1]. H. Kervadec, J. Bouchtiba, C. Desrosiers, Eric Granger, J. Dolz, I. Ayed, “Boundary loss for highly unbalanced segmentation”. In Medical Image Analysis, Vol 67, 2021.

    [2]. F. Sun, Z. Luo, S. Li, “Boundary Difference over Union Loss for Medical Image Segmentation”, MICCAI, 2023.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The document is well-written and organized. Regarding the text, only minor aspects should be considered: 1) In page 2, line 10th counting from the bottom, the authors have: “Models trained with bot architectures”, shouldn’t be “with both architectures”? 2) bAcc is not defined in caption of Tables 1 and 2.

    2. The paper enunciates the contributions in the end of section 1 in page 3, indicating the study of class imbalance and small and fragmented lesions as the first one. In sub-section 2.2, this goal is restricted to the sampling strategy and the architecture of the neural network (kernel size and depth), and the experiments reflect approaches found in works regarding the segmentation pre-operative brain tumor (BRATS challenges), so this comparative study is interesting; however, I was surprised that no attention was given to the loss function. Considering those works, we find good principles that are followed, such as, ensuring that a batch has at least a sample with the foreground class and computing the Dice loss per batch instead of per sample. Also, some works combine cross-entropy and Dice losses to avoid the impact of batches composed only by the background class. Also, another set of approaches consider a loss that consider the boundary of the object, such as those of references above [1, 2], to cope with imbalanced classes. Can we effectively evaluate the sampling strategy or the architecture without a proper consideration of the loss function? It seems that both are connected, and this is a limitation of the present work.

    3. In sub-section 3.2, the authors refer to the evaluation of the performance in term of the quality of the segmentation, and, as auxiliary task, the classification of the residual tumor and gross total resection. These are not properly defined. Also, how the performance metrics of both should be interpreted in Tables 1 and 2 are not clearly explained.

    4. The motivation for using nnU-Net is motivated in the text, but not why they chose AGU-Net. Also, from sub-section 3.2, we are informed that the training of the nnU-Net was done in another work (reference [1] of the paper) that is not visible due to the blind review. However, since a comparison is done in the paper with nnU-Net that are related to the training, we are limited in a proper appraise of the proposed study. For example, the poor performance of the nnU-Net is the classification task, which is relevant, is due to the training of the nnU-Net or to its architecture? Could the nnU-Net have a better or the best (in this study) performance if used the same sampling strategy and architecture adjustment?

    5. The lack of information on the dataset seems to affect the analysis. For instance, a proper comprehension of the result of exp5 on the test set, where the segmentation is compared with the different types of annotation consensus, requires knowing how the GT of the training set was obtained. Was it obtained by a senior expert? Does this expert participate in the segmentation of smaller test set? If (s)he does, how (s)he differ from the annotation consensus?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The most interesting aspect of the paper is the second task on the smaller dataset that shows that although a method may have a good Dice in segmentation, and it may exhibit a low specificity in classifying the residual tumor.

    However, 1) The elements that were chosen in the study (sampling, number of levels, kernel size) makes sense with the knowledge we have today on how the convolutional networks work, but also there are techniques that we know that are effective in imbalance classes that were not considered, and no justification was given on the selection.

    2) The choice of the architecture to be used in the study is not properly justified. And the cause of the problem imputed to the nnU-Net was not study.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Some of my concerns were addressed by the authors in the rebuttal. If this information is added to the paper, it may be helpful for those starting to work on this problem. However, I still see the impact of the work as limited, so I have only slightly changed my opinion.



Review #2

  • Please describe the contribution of the paper

    The authors conduct several experiments varyiing patch size, architecture and preprocessing strategies for post-operative residual brain tumor segmentation. Therefore, they leverage existing multi-institutional, multi-rater datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • extensive literature review in the introduction
    • high quality of writing
    • deep learning experiments are motivated and not ad-hoc
    • the idea of comparing model performance to human performance is good
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • even though the idea of a human baseline is commendable the exekution is poor. A single novice producing false positives can “poison” the union segmentation employed as a reference label. Therefore the results in Table 2 are highly questionable.

    • There seems to be no technical innovation? If there is one authors need to communicate it more clearly. Using U-Nets for glioma segmentation is not novel at all (even though most papers are concerned with preoperative tumors)

    • It is a bit unclear what serves as training, validation and test set here. How were they defined? What is the external test set of 73 patients?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    it would be nice to provide source code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • it would be good to contrast network performance to inter-rater scores reported in reference #18
    • a better reference should be chosen for DICE similarity comparisons in Tab. 2, e.g. the expert consensus
    • it should be better communicated what serves as training, validation and test set

    • In tables best performing methods should be highlighted
    • Maybe the experiment description should be in the table caption or somewhere close to it (exp1-7 convey no information to the reader)
    • Authors should avoid the term ground truth, as they just have reference annotations here (see https://arxiv.org/abs/2301.00243)

    conference fit: (not affecting the review recommendation):

    • A clinical journal might be an even better fit than MICCAI even though I appreciate the work
    • another good fit might be BrainLes workshop
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Besides the lack of innovation and the above-described weaknesses, the experiment design with the potentially broken reference label casts a shadow of doubt over the reported results in Tab2.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This study explores the development and validation of advanced machine learning models for the segmentation of glioblastoma from early post-operative MRI scans. Leveraging state-of-the-art architectures, the paper reports the adaptation of these models to handle the specific challenges posed by early post-operative scenarios, such as the presence of small and fragmented lesions. The authors compare the performance of their models against human expert raters, suggesting that their automated approach can match expert-level segmentation, offering a potentially valuable tool for clinical settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel Application of Existing Architectures: The paper adeptly applies established architectures like the nnU-Net and AGU-Net to the less-studied domain of early post-operative MRI segmentation. The adaptation to address specific challenges like the detection of small lesions and managing class imbalance is noteworthy.

    2. Comprehensive Evaluation: The evaluation of the models is robust, involving multiple metrics (Dice score, Hausdorff distance) and comparisons against both novice and expert human raters. This comprehensive benchmarking, especially the use of an inter-rater test set annotated by different levels of expertise, underscores the clinical relevance of the findings.

    3. Demonstration of Clinical Feasibility: The study addresses the clinical feasibility by comparing the automatic segmentation results with manual methods currently used in practice, providing a clear argument for the utility of the proposed method in real-world settings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Incremental Innovation: The advancements over prior works, particularly those presented by the same teams in earlier publications, are somewhat incremental. This may limit the perceived novelty of the research.

    2. Lack of Open Resources: For a study of this nature, the availability of the dataset, code, and pre-trained model weights is crucial for reproducibility and further research. The absence of these resources is a significant limitation.

    3. Insufficient Comparison with Own Previous Work: Although the paper mentions previous own studies, it lacks a detailed comparison with these, especially those employing similar methodologies. This would be vital to highlight the specific contributions and advancements made in the current study.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    A significant limitation concerning the reproducibility of the results presented in this paper is that the implementation code, the pre-trained models, and the dataset are not publicly available. Providing access to these resource is crucial to enhance the transparency and reproducibility of the research.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Clarify and Expand on Innovations: The paper would benefit from a more detailed discussion of its innovations, especially in relation to previous work by the authors. Highlighting specific advancements made in adapting the segmentation models to the challenges of early post-operative MRI scans can help underline the novelty of your approach. It would be particularly helpful to elucidate the adaptations made to handle small and fragmented lesions, as well as class imbalance.

    2. Open Access to Resources: To enhance the credibility and impact of your work, I recommend making the dataset, codebase, and pre-trained weights accessible to the research community. Open access to these resources would facilitate reproducibility and encourage further research and development.

    3. Enhanced Comparative Analysis: The manuscript would be strengthened by a more comprehensive comparative analysis with previous studies. This could include detailed benchmarks against existing methods, particularly those using similar architectures. Visual aids such as additional tables or charts to compare these results would provide clearer insights into the strengths and limitations.

    4. Discuss Generalizability: Please consider providing more information on the generalizability of the used models to other types of tumors or to different medical imaging modalities. Using of subset of only 20 patients as the test database may not be enough for testing the generalizability.

    5. Statistical Analysis: A deeper statistical analysis to support why certain models outperformed others could enhance the scientific rigor of your findings. This analysis could include confidence intervals or p-values to statistically quantify differences in performance, providing a more solid foundation for the claimed improvements.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The decision to give a weak acceptance originates from the paper’s solid methodological approach, thorough evaluation metrics, and its relevance to clinical applications. However, the incremental nature of the innovation and the lack of open resources slightly diminish its impact. Improving upon these aspects could potentially elevate the manuscript to a strong acceptance after the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

First, we sincerely thank the reviewers for thorough reviews and for providing us with insightful and valuable feedback. The rebuttal focus on addressing the major concerns raised, and all requested changes will be included in the camera-ready version of the manuscript upon acceptance.

  1. “Lack of technical innovation might limit the perceived novelty of the research.” As stated by one of the reviewers, the use of U-Nets for segmentation is not novel for pre-operative glioblastoma segmentation, but there are still very few works employing these models on early post-operative segmentation. The segmentation of early post-operative images represent distinctive challenges such as small and fragmented tumors, and the aim of the study was to investigate specific sampling strategies and architectural modifications to improve the detection of these. However, none of the experiments lead to significant improvements in the scores. Therefore, the novelty lies in the effort to shed light on these challenges and the importance of the quality of the ground truth references, through a thorough comparison with human expert raters. As the focus of the research community is shifting from pre-operative to early post-operative segmentation, and a post-operative segmentation task is part of the BraTS challenge for the first time this year, the results of the study and in particular the comparison with a human baseline should be of interest to the research community.

  2. “Questionable quality of “ground truth” and consensus agreement annotations.” First, all annotations labelled as “ground truth” were annotated by single experts, independent of the inter-rater annotations. In the consensus agreement annotations, a voxel was only labelled as positive if annotated by at least half of the annotators, which in theory should make them more robust than the ground truth annotations. The motivation behind the inter-rater analysis was to contrast the model performance against human rater performance using a reference completely independent of both, which is why the ground truth annotations were used as a reference in Table 2 in favor of the expert consensus annotation. Investigating the quality of the ground truth annotations by comparison with the consensus agreement annotations was also an important motivation for this analysis, illustrating the difficulty of the task by the disagreement between human raters.

  3. “Source code and data are not provided, limiting the reproducibility of the study.” The source code for the data processing, model implementation and validation pipeline as well as the weights for the best performing model are openly available on GitHub, and will be referenced in the camera-ready manuscript. The dataset cannot be shared openly due to patient privacy, but access can be granted through collaborative projects.

  4. “Lacking justification for selection of experiments, architectures, and loss functions.” The motivation for the experiments on sampling strategies for handling class imbalance, and different network depths and kernel sizes to avoid suppressing the thin and fragmented residual tumor lesions are thoroughly explained in the methods section 2.2 of the paper. Regarding the choice of architecture, it was shown in a previous study that the nnU-Net tends to produce more false positives, achieving a better voxel-wise segmentation performance on the cost of a poor patient-wise classification performance. The AGU-Net achieved similar segmentation performance with a reasonable classification performance. The attention component of the AGU-Net architecture was also deemed important to help the network locate the area of the residual tumor, which is why this was the architecture of choice for this study. Initial experiments with different loss functions were conducted but did not lead to any improvement, and further experiments were not prioritized due to limited time but is part of planned future work.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While the paper has limitations as detailed by the reviewers, the translation value is high and the clinical problem of classifying residual tumor in post-op MRI scans has received little attention. The paper can fit into the Translational track as a poster.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While the paper has limitations as detailed by the reviewers, the translation value is high and the clinical problem of classifying residual tumor in post-op MRI scans has received little attention. The paper can fit into the Translational track as a poster.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper evaluates method for segmenting glioblastoma from postoperative MRI. It tackles important clinical questions (postoperative, small lesion detection, class imbalance). It can thus be of interest to the community. The evaluation is quite extensive. However:

    • it is important to improve the reproducibility of the method, in particular for such experimental studies
    • some parts of the experimental design remain unclear (in particular in relationship to the results of Table 2)
    • the methodological choices need to be better motivated

    The authors need to carefully address all the concerns raised by reviewers in the final version including clarification of experimental design (in particular in relation to Table 2), clarify the methodological choices and provide a more extensive description of the dataset.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper evaluates method for segmenting glioblastoma from postoperative MRI. It tackles important clinical questions (postoperative, small lesion detection, class imbalance). It can thus be of interest to the community. The evaluation is quite extensive. However:

    • it is important to improve the reproducibility of the method, in particular for such experimental studies
    • some parts of the experimental design remain unclear (in particular in relationship to the results of Table 2)
    • the methodological choices need to be better motivated

    The authors need to carefully address all the concerns raised by reviewers in the final version including clarification of experimental design (in particular in relation to Table 2), clarify the methodological choices and provide a more extensive description of the dataset.



back to top