Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Lung ultrasound (LUS) has become an indispensable tool at the bedside in emergency and acute care settings, offering a fast and non-invasive way to assess pulmonary congestion. Its portability and cost-effectiveness make it particularly valuable in resource-limited environments where quick decision-making is critical. Despite its advantages, the interpretation of B-line artifacts, which are key diagnostic indicators for conditions related to pulmonary congestion, can vary significantly among clinicians and even for the same clinician over time. This variability, coupled with the time pressure in acute settings, poses a challenge. To address this, our study introduces a new B-line segmentation method to calculate congestion scores from LUS images, aiming to standardize interpretations. We utilized a large dataset of 31,000 B-line annotations synthesized from over 550,000 crowdsourced opinions on LUS images of 299 patients to improve model training and accuracy. This approach has yielded a model with 94% accuracy in B-line counting (within a margin of 1) on a test set of 100 patients, demonstrating the potential of combining extensive data and crowdsourcing to refine lung ultrasound analysis for pulmonary congestion.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3582_paper.pdf

SharedIt Link: https://rdcu.be/dY6jz

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72083-3_54

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3582_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Asg_Can_MICCAI2024,
        author = { Asgari-Targhi, Ameneh and Ungi, Tamas and Jin, Mike and Harrison, Nicholas and Duggan, Nicole and Duhaime, Erik P. and Goldsmith, Andrew and Kapur, Tina},
        title = { { Can Crowdsourced Annotations Improve AI-based Congestion Scoring For Bedside Lung Ultrasound? } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {580 -- 590}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper aims to investigate the potential of utilizing crowdsourced data beyond the amount that expert clinicians can provide for training U-net based models for B-line segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This study introduces a method of using crowdsourced annotations to improve B-line segmentation in lung ultrasound analysis.
2. This research has practical implications in clinical settings, especially for managing heart failure patients, by providing a reliable method for monitoring pulmonary congestion.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. This paper only compares results of 5 Attention U-Net trained by dataset with different size. So what’s the main contribution about algorithm?
2. Experimental results shown in table1 seems to be no significant difference, how do the authors justify their conclusions?
3. The experiment was only conducted on private data sets, and there seems to be a certain bias. Do authors have any plans to make the dataset publicly available?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

No
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

see strengths and weaknesses .
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper introduces a method of using crowdsourced annotations to improve B-line segmentation in lung ultrasound analysis, seems interesting. However, the main contribution about algorithm seems not clear. Besides, experimental results are not enough to support authors’ claim.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

Good response, hope to see the released dataset ASAP.

Review #2

Please describe the contribution of the paper

This paper introduces a gamified crowdsourcing technique for B-line identification segmentation in lung ultrasound analysis. The authors collected over 550,000 B-line annotation opinions within three weeks. The findings suggest that incorporating larger amounts of crowdsourced data into the training dataset can improve accuracy in B-line segmentation and quantification.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The investigation on the potential of crowdsourcing for medical image analysis is interesting
- The methodology and implementation details are well presented.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The experiments are insufficient.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- In Section 1, paragraphs 3-4, I would recommend the authors add a table on either the main text or supplemental materials for better clarity, by summarizing the data size and performance of those presented related works.
- I feel the challenge for clinicians annotating the B-lines is not well presented. How long will it take for one clinician to annotate the B-lines of one patient? How many expert hours were taken in annotation?
- Please add the number/proportion of the frames used for training in Table 1. What is the performance of the Attention U-Net trained on a dataset with pure expert annotations?
- Some solutions may be worth comparing in the future. For instance, 1) a model can be trained based on the initial expert-annotated dataset and use the inference results of the rest of the datasets for either the final prediction or the clue of crowdsourcing; 2) some semi/weak-supervised approaches; 3) using the synthetic data.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although using crowdsourcing to obtain high-quality annotations of large-scale datasets is interesting, this paper lacks enough experiments/evidence to support the necessity or superiority of the proposed method.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors addressed most of my concerns, and the presentation has been improved. Although the experiment is still at a borderline level to the MICCAI criteria, I understand this because the time/policy of rebuttal is not enough/valid for re-constructing the experiment. So, I will level up my score to weak accept.

Review #3

Please describe the contribution of the paper

This paper demonstrates the utility of leveraging crowdsourcing for enhancing the labels of a dataset. They train a highly effective network for counting B-lines, which are an important feature in Lung Ultrasound. It would be interesting to see this in clinical translation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors use effective modeling approach for the problem, and utilize crowdsourcing as a way to supplement the labels for the dataset. Overall, it combines knowledge of deep learning models with a creative approach to enhance the labels the data.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

My only major criticism is in the description of the pre-processing. I don’t really understand what’s being described. I think this can be addressed easily with revision.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

Will the dataset be made available? I think the process is straightforward to follow and could be reproducible with the dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
- I think a really cool image can be an image of the results. Show an example image, the determined label truth, and the output prediction of your model for the segmentation.
- typo: the word ‘consenus’ instead of ‘consensus’ first sentence in 2.4
- polar coordinates are typically in angle (often ‘theta’) and radius (typically ‘r’). the references to X and Y are confusing to me; are the images projected to rectangle
- I’d be interested in particularly ‘difficult’ cases in this cohort. I’m impressed with the performance on your dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I want to accept, because I think this paper is impressive and does deserve inclusion. However, I think it is critical that the preprocessing section can be clarified for reproducibility purposes. I also strongly recommend showing the results images.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

I think author responded well to the criticism.

Author Feedback

We thank the reviewers for their insightful comments and suggestions. We address critiques below:

This paper only compares results of 5 Attention U-Net trained by dataset with different size. So what’s the main contribution about algorithm (R4): We investigated four network architectures (U-Net, Attention U-Net, UNETR, and Swin UNETR) and selected Attention U-Net based on performance (Section 2.5) for further experiments. Our main contribution is achieving a statistically significant improvement in the performance of deep segmentation models by supplementing expert-labeled data with up to two orders of magnitude more crowd-labeled data, sourced using a novel gamified approach with extensive quality control.

Pre-processing clarity needed (R5): We have completely rewritten the image pre-processing section (2.4) to clarify. We agree that the previous description of angle and radius using the X and Y naming was confusing and have rectified it. Indeed, the reason for the previous naming was due to projection to a Cartesian coordinate system from the original polar scan coordinates.

Attention U-Net performance with pure expert annotations (R6): The results after training on the very small set (400) of expert annotations are 18% correct B-line count (vs. 55-71% from crowdsourcing), 24% within 1 count of error (vs. 84-94%), and 29% within 2 counts of error (vs. 90-99%).

Experimental results shown in table1 seems to be no significant difference, how do the authors justify their conclusions? (R4): We inadvertently omitted noting the statistical significance of the conclusions. We have now added it to the paper. As expected for low data volume, training on 400 expert labels resulted in low accuracies (see response 3). To ensure fairness, we used the smallest crowd-augmented dataset, nearly a 13-fold increase (~5K frames = 13x400 expert-labeled frames), as the baseline for significance testing. Training with 25x expert-labeled data significantly improved accuracy compared to the 13x baseline across all three measures (p=0.001 for exact B-line count, p<0.0001 for within 1 B-line count, and p<0.0001 for within 2 B-line counts). Results for training with 32x and 40x expert-labeled data were also significant at the α=0.05 level with FWER control.

Show an example image, the determined label truth, and the output prediction of your model for the segmentation (R5): We have added these in what is Figure 3 now while maintaining the page limit.

…challenge for clinicians annotating the B-lines is not well presented. How long will it take for one clinician to annotate the B-lines of one patient? How many expert hours were taken in annotation? (R6): To construct ground truth annotations via expert consensus, 400 frames were annotated by 5 experts. Each expert took an average of 15 seconds per frame, totaling 8.3 expert hours. By crowdsourcing annotations for our full dataset (31K frames), we saved 650 expert hours. Per patient, 1 expert hour would be required assuming an 8-scan protocol with 6-second scans, a frame rate of 20 fps, and annotating every 4th frame as done in this study.

Add number/proportion of the frames used for training in Table 1 (R6): Done. We have also modified the names of the “models” to reflect their training dataset size. (This is now part of Fig 4).

…there seems to be a certain bias (R4): The data is a convenience sample from a single institution. We are working to rectify the demographic bias (46% female and 66% white) as we grow our collection of data.

In Section 1…add a table on either the main text or supplemental materials for better clarity, by summarizing the data size and performance of those presented related works (R6): We will include this summary table in the supplementary materials.

Data availability (all): A subset of the data (113 consented patients) is available with a data usage agreement. Unfortunately, we do not yet have institutional approval for sharing the rest.

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

back to top

Can Crowdsourced Annotations Improve AI-based Congestion Scoring For Bedside Lung Ultrasound?

Author(s):