Abstract

Eye gaze that reveals human observational patterns has increasingly been incorporated into solutions for vision tasks. Despite recent explorations on leveraging gaze to aid deep networks, few studies exploit gaze as an efficient annotation approach for medical image segmentation which typically entails heavy annotating costs. In this paper, we propose to collect dense weak supervision for medical image segmentation with a gaze annotation scheme. To train with gaze, we propose a multi-level framework that trains multiple networks from discriminative human attention, simulated with a set of pseudo-masks derived by applying hierarchical thresholds on gaze heatmaps. Furthermore, to mitigate gaze noise, a cross-level consistency is exploited to regularize overfitting noisy labels, steering models toward clean patterns learned by peer networks. The proposed method is validated on two public medical datasets of polyp and prostate segmentation tasks. We contribute a high-quality gaze dataset entitled GazeMedSeg as an extension to the popular medical segmentation datasets. To the best of our knowledge, this is the first gaze dataset for medical image segmentation. Our experiments demonstrate that gaze annotation outperforms previous label-efficient annotation schemes in terms of both performance and annotation time. Our collected gaze data and code are available at: https://github.com/med-air/GazeMedSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1675_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1675_supp.pdf

Link to the Code Repository

https://github.com/med-air/GazeMedSeg

Link to the Dataset(s)

https://github.com/med-air/GazeMedSeg

BibTex

@InProceedings{Zho_Weaklysupervised_MICCAI2024,
        author = { Zhong, Yuan and Tang, Chenhui and Yang, Yumeng and Qi, Ruoxi and Zhou, Kang and Gong, Yuqi and Heng, Pheng-Ann and Hsiao, Janet H. and Dou, Qi},
        title = { { Weakly-supervised Medical Image Segmentation with Gaze Annotations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors use gaze estimation for weakly supervised image segmentation. Since using a single threshold to binarize the gaze heatmap does not perform well, they use multiple thresholds (levels). The overall loss function also considers cross-level consistency. On two databased improved segmentation performance is shown compared to previous label-efficient annotation schemes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Use of gaze estimation for computer vision in general and for medical image analysis is a timely topic and has good potential. There is still little study for medical image segmentation.

    The authors try to overcome the difficulty of using a single threshold to binarize the gaze heatmap. They use an ensemble approach with multiple thresholds, but also take cross-level consistency into consideration. The proposed method is general and can be easily combined with typical processing pipelines. The proposed method outperforms previous label-efficient annotation schemes.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The need of extra hardware, which may not even be cheap, limits the use of gaze estimation based tools in general.

    It can be suspected that the annotators may have some uncertainties during annotation. One may be uncertain, for instance, if the boundaries are really sufficiently “seen”. Does it mean that the segmentation objects are potentially “marked” larger than they are, just to be sure for the annotators? This also leads to the issue of inter-observer variability, which is per se a problem in some medical image analysis problems, and may be magnified by using gaze estimation. The inter-observer variability, perhaps even given the same expertise level of the annotators, may further influence the performance, as can be seen in Figure 5(c).

    Only a standard U-net was trained for testing. Additional, more sophisticated architectures would strengthen the performance evaluation part.

    The achieved segmentation performance is clearly below that by using the full labeling ground truth. This is generally not addressed in the paper.

    The constant c to stabilize the training is not further detailed. How stable is the overall process for different c values? How critical is to find a suitable value for it?

    It is unclear how the used thresholds are chosen. It may be a critical issue to be detailed.

    The proposed approach does not really technically consider the hierarchical nature of the thresholded gaze heatmaps. For instance, given the sequential nature, is it really necessary and meaningful to consider all pairs of (i,j) in Eq. (3)?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The writing could be improved. Eq. (1) is in fact a possible realization of a simple general weighted mean scheme. It would be better to be a bit more conceptual. While the paper is overall easily readable, the last part of Section 3.2 related to Eq. (2) requires more explanation. It is too compact and several things are undefined there.

    It is a bit surprise that labeling methods like BoxTeacher takes more time than the gaze approach.

    Figure 5 caption, “Effects of the hyper-parameter m and λ (we use m = 2 and λ = 3 by default)”. Better “hyper-parameter m (λ = 3 by default) and λ (m = 2 by default)” for clarity.

    Many references are given in arXiv. Most of such papers were published later at conferences or in journal.

    Figure 3, annotat´ıion

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Gaze estimation is a timely topic to consider. The work intends to overcome the inherent difficulty of thresholding the gaze heatmap. A main drawback is failing to achieve the segmentation performance using the full labeling ground truth. There are several unclarified important details.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addressed all my comments. Particularly, they emphasized their approach as one of weak-supervised learning. One could expect that gaze annotation should enable much more supervised learning than possible with points or boxes only. From this perspective, this work is somewhat disappointing. The performance evaluation is limited with one single standard u-net only. Since gaze estimation for medical image analysis is a timely topic and there is still little such work, this paper presents an interesting work and can be accepted.



Review #2

  • Please describe the contribution of the paper

    Paper proposes to efficiently collect gaze annotations to train deep networks with a novel gaze annotation scheme for medical image segmentation. The proposed method can be seamlessly integrated into a standard training pipeline. Experimental results show that gaze annotation achieves good results and annotation time tradeoff compared with other annotation forms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A novel gaze annotation scheme is proposed to collect dense annotations for the segmentation task in an annotator friendly and efficient manner. A multi-level approach is proposed in the training process, which trains multiple different deep networks to integrate information from different levels of human discriminative attention.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Paper trains m deep networks simultaneously supervised by pseudomasks generated from m different thresholds, simulating multi-level human attention. But it’s actually a way to mitigate noise. What I want to know is if there is a way to adaptively select the optimal threshold. Another is that the dataset is annotated with too few experts, which inevitably leads to personal bias.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The motivation is clear, and the approach is sensible, straightforward, and highly applicable.The annotated data also contribute to the discipline development. But the dataset is annotated with too few experts, which inevitably leads to personal bias.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation is clear, and the approach is sensible, straightforward, and highly applicable.The annotated data also contribute to the discipline development. But the dataset is annotated with too few experts, which inevitably leads to personal bias.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author’s answer solves some of my questions well, but I still have some concerns about the way the data is constructed and the innovation of the corresponding network model. So I keep the score the same.



Review #3

  • Please describe the contribution of the paper

    The authors proposed a weekly supervised learning method, based on gaze annotation, for binary segmentation tasks. The proposed method trains an ensemble of models jointly, each of which uses different pseudo-masks generated from the same heat maps (gaze annotation) with varying levels of thresholding. Additionally, a consistency loss, which measures the similarity of features from different models, is employed to compel the model to learn consistent features and reduce noise present in the labels. The proposed method was evaluated on two datasets and demonstrated performances that are close to fully labeled methods but with significantly fewer labeling efforts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper writing is clear.
    2. The effectiveness of the method was sufficiently evaluated.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Novelty is trivial. Some method details are missing. For example, how the inference is performed with the trained models is not clear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Some details about the method are missing, as specified in the detailed comments below:

    1. How are the heatmaps for gaze attention generated?
    2. How does the method perform inference?
    3. How are the thresholds determined?
    4. Fig 5 (b) is difficult to interpret. Table 1: It is unclear why point labeling is more time-consuming than Gaze labeling, considering that Gaze annotation requires annotating multiple points.
    5. The authors claim m=2 is sufficient, but it is not clear what the threshold levels are for the corresponding pseudo-mask generation.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty, motivation, quality of paper writing, experiments

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We are grateful to the reviewers’ appreciation: our topic is “timely new” (R1), our method is “novel”, “sensible and highly applicable” (R4), and “sufficiently evaluated” (R5), our gaze data “contribute to the discipline development” (R4). We thank all reviewers for their suggestions on writing and details, which will be revised.

——— Common questions:

— R1 & R4 & R5: clarifying threshold selection. We consider two scenarios: (1) Without any full labels, two thresholds are chosen based on radiologists’ feedback to generate heatmaps closest to GT, with one tends to erode (under-activate) and the other dilate targets. The pair simulates multi-level attention and compensates each other (Sec 3.2 and 3.3). (2) Given a labeled subset, we adaptively select such pair based on the Dice and object size of generated masks compared to GT. On Kvasir-SEG, thresholds of 0.3/0.5 (Fig. 1) and 0.35/0.47 are chosen by the two methods, respectively. We use the first method due to the negligible difference. For m>2, more thresholds are interpolated.

— R1 & R4 noted personal bias, which we addressed with a regularized annotation pipeline (Sec. 2) and a multi-level method modeling subjectivity (Sec. 3.1). In our study with 3 annotators (Fig. 5c), all consistently outperform SOTA, showing that personal bias hardly affects our conclusions. The bias is an essential factor in human-AI interaction, and we are among the few that study and release gaze from multiple observers. We will further investigate it in future work.

——— R1:

— “A main drawback is failing to achieve the performance of full label”. This might be a misinterpretation of our setting and key results. We will revise for clarity. We focus on weakly-supervised segmentation, widely acknowledged as label-efficient but with performance upper bounded by full labeling [4,6,20,25]. Such performance/efficiency trade-off is critical, so evaluations should cover both aspects, not just performance. Our results highlight: (1) Compared to SOTA, our method excels in both aspects (Table 1), narrowing weak-full performance gap while being 7 times more efficient. (2) Under the same annotation time, our method significantly outperforms SOTA and full labeling in performance (Fig. 3, where full labeling is excluded due to its poor performance in the limited annotation time).

— Uncertainties in gaze. In Sec. 1, we analyzed various uncertainties and summarized two key properties of gaze: “discriminative” and “noisy”, which indeed motivate our method. The case noted by R1 belongs to “discriminative”, where some parts may be marked larger or smaller than they are. The solution was discussed in Sec. 3.1. Our method effectively solves these uncertainties, as shown in our results.

— R1 suggested a hierarchical method. Since we use m=2, the suggested and our methods converge to the same one. We retained our current version since it performs slightly better by using all other networks for compensation in our previous experiments with m>2.

— Clarifying constant c. The constant c is a common trick for correspondence. We follow SOTA [7, arXiv:2011.10043] and set c as 0 to remove negative correlations.

— The concern on extra hardware is discussed in Sec. 5. And the suggestion on more backbones is helpful to future work.

——— R5:

— Method details. As detailed in Sec 3.1, we apply Gaussian kernel on gaze points to generate heatmaps and ensemble outputs of all models in inference. We will highlight these for clarity.

— Novelty. We proposed a multi-level method to simulate discriminative human attention and correspondence-based cross-level consistency for compensation. The two modules are mutually reinforcing in balance (Sec 3.3) and motivated by the key properties of gaze-supervised segmentation (Sec 1), which is not studied before.

— Efficiency. Dense gaze points are tracked at a high frequency (1k Hz), which is acknowledged [9,23] as more efficient than points or boxes by saving significant manual operation time




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes using gaze estimation as weak supervision for image segmentation. This is an interesting avenue for collecting large amounts of data with little burden on the clinicians, and the reviewers are all positive, with increased rankings after the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper proposes using gaze estimation as weak supervision for image segmentation. This is an interesting avenue for collecting large amounts of data with little burden on the clinicians, and the reviewers are all positive, with increased rankings after the rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top