Abstract

Deep learning-based (DL) models have shown superior representation capabilities in medical image segmentation tasks. However, these representation powers require DL models to be trained by extensive annotated data, but the high annotation costs hinder this, thus limiting their performance. Active learning (AL) is a feasible solution for efficiently training models to demonstrate representation powers under low annotation budgets. It is achieved by querying unlabeled data for new annotations to continuously train models. Thus, the performance of AL methods largely depends on the query strategy. However, designing an efficient query strategy remains challenging due to limited informa- tion from unlabeled data for querying. Another challenge is that few methods exploit information in segmentation results for querying. To address them, first, we propose a Structure-aware Feature Prediction (SFP) and Attentional Segmentation Refinement (ASR) module to enable models to generate segmentation results with sufficient information for querying. The incorporation of these modules enhances the models to capture information related to the anatomical structures and boundaries. Additionally, we propose an uncertainty-based querying strategy to leverage information in segmentation results. Specifically, uncertainty is evaluated by assessing the consistency of anatomical structure and boundary information within segmentation results by calculating Structure Consistency Score (SCS) and Boundary Consistency Score (BCS). Subsequently, data is queried for annotations based on uncertainty. The incorporation of SFP and ASR-enhanced segmentation models and this uncertainty-based querying strategy into a standard AL strategy leads to a novel method, termed Structure and Boundary Consistency-based Active Learning (SBC-AL).

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3047_paper.pdf

SharedIt Link: https://rdcu.be/dY6fW

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_27

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zho_SBCAL_MICCAI2024,
        author = { Zhou, Taimin and Yang, Jin and Cui, Lingguo and Zhang, Nan and Chai, Senchun},
        title = { { SBC-AL: Structure and Boundary Consistency-based Active Learning for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {283 -- 293}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes to add two modules on top of a segmentation models to better capture anatomical structures and boundaries. These modules improve the model performance (by refining the output segmentation) and are used to measure uncertainty for the AL query strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper’s focus on anatomical structures and boundary consistency for AL selection is interesting.

    • The paper provides some ablation studies regarding the segmentation model used and proposed module.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The Boundary consistency score (BCS) does not look novel. It looks like the general definition of Haussdorf distance.

    2) Similarly, the SFP and its DisBlocks look very similar to the decoder part of UNets. The noverly seems limited.

    2) How many times were the experiments repeated? Were different initial labelled sets tested? The init. set can have a big impact of the model performance, so experiments should be repeated with different sets to make improve result robustness.

    3) Comparing with a diversity-based AL method such as Coreset and hybrid methods such as BADGE would make the results stronger.

    4) The Tables 1 do not contain standard deviation information, and from the description, the experiments do not seem to have been repeated several times. However, this is particularly important in AL, where the initial labelled set can have an big impact on the future model performance.

    5) Showing the results of SBC-AL with random sampling would give more weight to the results, as it is unclear whether the improved performance comes from the training or the AL selection.

    6) Why are the comparative methods different for ACDC and KiTS19 (Table 1, 2 and 4)? Results for VAAL and Mean STD are missing in Table 2.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • Size of test set not mentioned. Same for validation set size (assuming the hyperparameters were tuned using a validation set).

    • How were the initial labelled sets chosen? This is not mentioned.

    • Training with the comparative methods is not explained. Were the methods obtained with the same UNet model without the ASR and SFP modules? Were the same hyperparameters used, if only the data changed?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The term “SFE module” is used twice but never introduced. Is that a typo for SFP?

    • Numbering the equations would make it easier to refer to them

    • For clarity, state over what the sums are computed in the provided equations

    • In Table 1, why is the dice for 6.67% train data (init. set) different from the result with all the other methods except VAAL, but similar to all other methods with 100% train data? The latter is surprising if SBC-AL has additional modules and loss during training (should also improve training with all data).

    • In terms of results, a plot with the AL curves would be easier to read.

    • It might be interesting to show the results for individual classes in the same dataset (and not just the average results over all classes)

    • To better understand the benefits of the ASR module, close-up images of the UNet’s output segmentation and of the ASR’s output segmentation would be beneficial.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The experimental component of the paper should be expanded: adding more comparative methods (Coreset, BADGE, etc.), comparing them in all datasets, repeating the experiments with different initial sets, and showing results with SBC-AL training but random sampling
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper present a novel Active Learning (AL) approach for efficient segmentation of medical images. It addresses the labeling bottleneck by attempting to find new candidate images for labeling with the highest possible uncertainty, according to an initially trained model. They validate their method on two publicly available datasets and show that their method outperforms other commonly used AL strategies when only training on subsets of the total label set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents some innovative solutions to dealing with unlabelled data. Not only that, but they appear to be effective. The SBC-AL strategy outperforms all other commonly used AL strategies that they attempted.

    The authors perform a good literature review of the current field of Active Learning, and their paper is well referenced.

    The methodology they report is very comprehensive and rigorously defined. Their method contains a lot of moving parts and the authors a quite thorough in the descriptions of all the different modules and how they relate and interact with one another.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness with this paper is that the method is simply to complicated. I understand that the complexity may be required to achieve the reported results, but between the SFP and ASR modules, the SBS and BCS consistency scores, and the various loss functions, there is simply too much complexity for a MICCAI submission.

    To be fair to the authors, they did do a thorough job explaining all the modules, consistency checks, etc… but because of that, there was not enough room in the paper for any meaningful discussion of the advantages and limitations of the proposed system. There is also no argument made for the generalizability or scalability of the method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The complexity inherent to the proposed method would make it quite challenging to replicate this method. Other than that, they provide a good amount of detail in regards to their model architecture.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The main suggestion I have is to include some more figures. The paper only has a single, rather small, figure of the overall architecture, which does not do a good enough job illustrating with different modules do. Including images of what the SFP and ASR would output for a given input image would be hugely important for helping the reader understand what these modules do. Same with the SCS and BCS. Providing illustration of high vs. low consistency samples again would be very helpful. A picture is worth a thousand words, and I believe that including such illustrations would allow you to cut down on the lengthy descriptions you provide (which at times do repeat themselves). In that same vein, I think the authors could stand to try and reduce the amount of mathematical definitions they provide to just the bare essentials. The paper is needlessly rigorous at times, and some of that can and should be moved to the supplementary materials.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This proposed method shows promise in their reported results, however the lengthy explanations in the paper need to be trimmed down to make room for more figures and better discussion of the impact and limitations of the system.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    If the authors are indeed able to make room for some more figures to elucidate their proposed methods, then I am happy to keep my recommendation to accept this paper for publication. I found their method to be well developed and implemented, it just took a bit too long to grok, largely due to the lack of figures detailing the system architecture and outputs.



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel mechanism to effectively select the best samples in active learning. The mechanism involves studying the uncertainty in segmentation predictions, including structure and boundary. The approach has been evaluated with two popular benchmarks for different ratios.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper discusses an important problem of effectively picking samples for annotation in active learning, which can in turn substantially reduce the cost.
    • The authors are aware of the directions being considered in the specified research and have discussed it in the introduction section. The modules SFP and ASR are intuitively designed to extract reliable consistency scores.
    • The usage of both structure and boundary consistency is also reasonable.
    • The experimental setup with ACDC, and KiTS for different ratios, study with different backbones is appreciable.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The notion of using segmentation quality to understand the input sample is not entirely novel, it is a common technique in model calibration, test time adaptation, and out-of-distribution detection.
    • Is there an ablation study to understand the impact of structure and boundary consistency scores ?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The author addresses an important problem in medical imaging community with an interesting and feasible solution.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I am staying with my original decision, as I believe the segmentation-driven uncertainty technique has potential to be useful for active learning and related tasks.




Author Feedback

We appreciate the reviewers’ feedback (R#1, R#3, R#4), and address these concerns below. -Figures (R#1): we will improve figures about the architectures of SFP and ASR and include figures to show figures of high and low consistency samples. Then we will discuss the advantages and limitations of the proposed system in camera-ready paper. -Figures (R#1, R#4): due to the page limitation, we only showed segmentation results quantitatively in tables. We will show output segmentation results and plots with the AL curves in camera-ready paper. -Novelty notion (R#3): we believe using segmentation results to understand input samples in SBC-AL is novel. It has not been adopted to evaluate uncertainty for querying in AL by considering the consistency in both anatomical structures and boundaries. -Ablation study (R#3): We showed the results for the overall uncertainty, but in early AL stage, we evaluated the correlation of SCS and BCS by calculating them separately for unlabelled data. It can help to understand their differences and be used as ablation study. We will discuss it in camera-ready paper. -Novelty (R#4): We believe our SBC-AL is novel in motivations and designs. Our boundary consistency score (BCS) is motivated by Hausdorff distance, but it has not been adopted for querying in AL before. It is used for evaluating segmentation accuracy for labeled data, but BCS is used for evaluating uncertainty in unlabeled data. The calculation of BCS is also different from Hausdorff distance. Our SFP and ASR are proposed as novel and efficient modules. They have not been published before, and their motivation is to facilitate evaluating uncertainty and calculating consistency scores during querying. Overall, we have proposed this well-designed and well-motivated AL methods. The novelty has been highlighted by other reviewers. -Experimental details (R#4): the initial labeled set was generated randomly and was the same for all AL methods in each experiment. We repeated experiments 5 times with different initial sets. In our paper we showed the average values of these experiments, and we will include standard deviations in camera-ready paper. -Experimental results (R#4): The same model was used on all AL methods with same initial set. When 100% train data, the results are the upper limit for the model performance and used for comparison. In 6.6% data, different AL methods have the same dice since this segmentation model is initialized by the same initial set. VAAL has a higher dice, since it has a UNet encoder with self-encoding training. We will further describe plausibility of our results in camera-ready paper. -Comparison methods (R#4): In Table 4, we aim to show that our querying is superior in different segmentation backbones. Thus, we showed several superior AL baselines, and compared results of our querying with them due to the page limitation on MICCAI paper. Similarly, we did not show results of VAAL and Mean STD in Table 2. They are inferior to other baselines, so we only showed results of several superior methods for better comparison. We will improve our descriptions in camera-ready paper. -Random sampling (R#4): the suggestion of random sampling is helpful. Our solution was to evaluate the effectiveness of our query strategy and segmentation networks separately. We used the same networks for our querying strategy and others, demonstrating the superiority of ours (Tables 1 and 2). We also evaluated the effectiveness of our modules by end-to-end training of these modules without AL querying (Table 3). -Individual classes (R#4): In KiTS19, we used kidney labels and avoided tumor labels to avoid the effect of unbalanced data. In ACDC, samples are queried based on the uncertainty of individual samples by calculating consistency scores for all foreground regions, not any specific class, so we only showed average results over all classes. -Typos and Equations (R#4): we will correct typos for SFP and update our equations in camera-ready.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper can be accepted. However, the authors have to enhance again the quality of the paper by considering the reviewers comments. Also, it is important to represent graphically the results.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper can be accepted. However, the authors have to enhance again the quality of the paper by considering the reviewers comments. Also, it is important to represent graphically the results.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    R4 asking for additional results is not feasible. I believe the paper has merit and can be accepted after reading the reviews and rebuttal

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    R4 asking for additional results is not feasible. I believe the paper has merit and can be accepted after reading the reviews and rebuttal



back to top