Abstract

Shortcut learning is a phenomenon where machine learning models prioritize learning simple, potentially misleading cues from data that do not generalize well beyond the training set. While existing research primarily investigates this in the realm of image classification, this study extends the exploration of shortcut learning into medical image segmentation. We demonstrate that clinical annotations such as calipers, and the combination of zero-padded convolutions and center-cropped training sets in the dataset can inadvertently serve as shortcuts, impacting segmentation accuracy. We identify and evaluate the shortcut learning on two different but common medical image segmentation tasks. In addition, we suggest strategies to mitigate the influence of shortcut learning and improve the generalizability of the segmentation models. By uncovering the presence and implications of shortcuts in medical image segmentation, we provide insights and methodologies for evaluating and overcoming this pervasive challenge and call for attention in the community for shortcuts in segmentation. Our code is public at https://github.com/nina-weng/shortcut_skinseg .

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0423_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0423_supp.zip

Link to the Code Repository

https://github.com/nina-weng/shortcut_skinseg

Link to the Dataset(s)

https://challenge.isic-archive.com/data/#2017

BibTex

@InProceedings{Lin_Shortcut_MICCAI2024,
        author = { Lin, Manxi and Weng, Nina and Mikolaj, Kamil and Bashir, Zahra and Svendsen, Morten B. S. and Tolsgaard, Martin G. and Christensen, Anders Nymark and Feragen, Aasa},
        title = { { Shortcut Learning in Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces the reader to 2 different examples of “Shortcut Learning” in the context of medical image segmentation. They define Shortcut Learning as when an ML model latches on to easy-to-learn features of the training data, which may not exist during test, leading to very non-robust models. The examples they use are 1) learning the pixels of a caliper icon, instead of the underlying anatomy, and 2) the use of zero-padded convolutions on center-cropped training sets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper was very well written and easy to follow. While many papers have demonstrated the use of shortcut learning in classification tasks, the authors here demonstrate its effect on segmentation, which appears to be novel. For both examples (caliper placement and zero-padded convolutions+center-cropping), the authors effectively demonstrate the effect of the shortcut learning taking place, they clearly explain the mechanism by which they occur, and they suggest steps to mitigate their effects. This paper would be useful in educating someone about the dangers of Shortcut Learning and the serious problem it presents to the robustness of models trained under certain conditions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of this paper is that it problems that the authors are trying to warn us of are fairly self-evident. Furthermore, the mitigation strategies they suggest are equally obvious. Taking the second Shortcut as an example: The authors seek to point out that center-cropping your training images (when using zero-padded convolutions) presents a shortcut. The mitigation strategy they propose is to use random-cropping, instead of center-cropping. From my experience, the use of RandomResizeCrop is very prevalent in ML papers. I have never read a paper that used center-cropping. I.e. what they present as a problem in ML has largely already been solved. Their other example details the problem of training a model on data with burnt-in calipers. To me, any burnt-in annotations present in my training data would raise alarm bells, however I acknowledge that this might not be the case for everyone, and therefore this paper might still be useful in educating someone who was not aware of ML models’ ability to learn such shortcuts.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    They mention some of the hyperparameters and settings used to train their models, but it is not a complete list. I would have a hard time implementing their exact model with only the information provided in the paper and supplementary materials. The datasets were well documented.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This is tricky. I believe the authors did a good job showing what the papers set out to prove, I just doubt the overall importance/usefulness of the paper’s goal to begin with. While I believe Shortcut Learning is an issue that people working in the ML space should be aware of, this paper does not progress the field in a meaningful way. My suggestions for the author, if they wanted to continue their work in this field, would be to perform a meta-analyses of Shortcut Learning in the current ML landscape. I think it would be more impactful to focus on actual examples of Shortcut Learning occurring in real-world ML solutions, rather than artificially creating your own examples for demonstration purposes. Such real-world examples would require more sophisticated and interesting mitigations strategies, whereas the mitigations to the artificial problems presented in this paper are so simple that they do not really add much to the field beyond a basic warning to the reader.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe the 2 examples of Shortcut Learning presented in this paper (as well as their mitigations) to be self-evident and already well known in the modern ML community. I don’t think this paper adds enough new information to the field of ML to warrant acceptance to MICCAI.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Upon reading the author’s rebuttal, I have updated my recommendation slightly to “Weak Reject”. I still think the content of the paper is fairly well-known, however as the authors point out, it may be more widespread than I previously imagined.

    Personally, I still find the issue of shortcut learning (and data leakage in general) to be fairly self-evident, however I can only speak for myself and not the entirety of the MICCAI community. If the meta-reviewers feel that the community needs to be made more aware of this issue, then I do recommend accepting this paper for publication, as it is well written and very clearly explained.



Review #2

  • Please describe the contribution of the paper

    The paper points out that with classification and segmentation tasks in medical imaging, the training dataset may be carelessly biased. For example tubing or calipers left during image acquisition might be learned by the model and guide segmentation or classification. Similarly, the practice of centering and cropping regions of interested in the training set biases models to use these cues during segmentation. The authors propose to mitigate those by random cropping as augmentation during training.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is perfectly clear and makes a simple yet important point. The proposed remediation is also simple and effective.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Such sources of biases are well known but however keep getting ignored by the community.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I believe code availability would be a strong positive for the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The article is crystal clear and makes an important point, even though its scientific contribution is technically weak. The contribution is akin to a recommendation of best practice to always look for shortcut cues when building training datasets, and to ensure random crop augmentations are used during training times to reduce shortcut usage.

    It would have been helpful to add some investigations on what the trained models truly learn when using the shortcut cues, e.g. by using for example GradCam or similar.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I’m not sure this sort of investigation and recommendation is novel. Many well-known articles have pointed out that the wrong cues are learned sometimes. E.g. the well-known image of a dog mistaken for a wolf because snow is present in the background, or the tank urban legend

    https://gwern.net/tank

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper investigates the issue of shortcut learning for segmentation, which refers to the phenomenon where models exploit correlations in data, potentially leading to biased or unfair outcomes. The research specifically addresses this problem in two contexts: calipers and texts in fetal ultrasound images, and center crop images for skin segmentation. The authors also evaluate some mitigation strategies, demonstrating their effectiveness and robustness across both tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper in general is well written and organized. and although
    • Although most of the previous works are focused on classification tasks the paper investigatest the issue for image segmentation showing that indeed the problem exist, making it relevant for the MICCAI community.
    • The problem are explored in two different scenarios, the mitigations strategies are simpler and significantly enhances the overall results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper does not exhibit major weaknesses; however, I seek clarification on the following points:

    • In the case of skin lesion segmentation, the images are captured with a dermatoscope, positioning the lesion at the center of the image. Could the authors elaborate on potential scenarios where we could encounter non-centered lesions thus proving useful their mitigation strategy?
    • Considering the ISIC dataset, prior research has examined biases beyond centered lesions, including rulers, markings, and skin hair [1]. Have the authors addressed any of these biases in their analysis?
    • Could the authors provide insights into the potential implications for Visual Transformers (ViT)? Given that images are cropped into individual patches, might the center-cropped phenomenon be mitigated?
    • Furthermore, could the authors discuss potential biases introduced by such a dilated crop? For instance, the model may exhibit a preference for darker areas in the image, potentially misinterpreting them as lesions (as suggested by the fourth column in Figure 3). It would be intriguing to observe the model’s performance in scenarios without lesions and with diverse pigmented skin tones.

    [1] (De)Constructing Bias on Skin Lesion Datasets. Bissoto A. et al.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I thoroughly enjoyed reading your paper, which presents a significant contribution to the field of image segmentation. I would like to offer some constructive suggestions for further enhancing the discussion. Firstly, while the paper does not present any major weaknesses, it would greatly benefit from a deeper exploration of biases present in the ISIC dataset. I encourage you to consider potential biases that may arise when applying your approach to images featuring rich pigmented skin tones or only healthy skin (and show them in the Figure 3). Finally, it would be interesting to see if ViT are also affected by those shortcuts or not.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents findings that the shortcut learning not only affect classification tasks but also segmentation tasks. The authors explore the problem in two scenarios and proposed some mitigation strategies that effectively enhance the outcomes. This insight is poised to significantly contribute to future advancements in medical segmentation techniques.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Thanks to the authors for their responses. Considering the work’s potential benefits for the MICCAI community, I will maintain my previous score.




Author Feedback

Thank you for your valuable feedback!

Some reviewers are concerned that the shortcuts discussed in this paper are self-evident. As we illustrate next, they are not.

== Our findings are surprising [R3-4] == We show that, besides the well-known effect of shortcuts on classification, shortcuts also affect segmentations. This is novel and surprising, which is actually highlighted by the “tank urban legend” brought forward by R4: This blog claims that models affected by shortcuts would fail in segmentation (point 6 under “Could it happen”). As pointed out here, you would indeed expect false positive segmentations due to shortcuts to fail drastically at outlining something meaningful. However, false negatives (failure to segment) can most definitely happen as a result of shortcuts – as shown by the examples brought forward in our paper.

This point is missed by the blog, and we have made the same mistake in the past, which makes this important. We thank R4; the blog helped finetune our understanding and we will emphasize this in the final version.

== Our findings are important for medical imaging [R3] == We first note that ‘center-cropping’ in our paper refers not to augmentation, but to dataset construction, where the region of interest is centered. We apologize for this confusion. This has been dubbed the ‘common photographer bias’ in computer vision (Kirillov et al., CVPR’23), but is also prevalent in MICCAI challenges (e.g., SegRap, PENGWING, BONBID-HIE, FH-PS-AOP, Acoustic-AI) and popular medical datasets, see our Fig. 6 as well as Fig. 1 of Ma et al. (Nat. Commun’23) and Fig. 5 of Isensee et al. (Nat. Methods’21).

Our illustrations are not artificial: similar effects are evident in published models, see e.g. Fig. 2 in (Wang et al., MICCAI’23) and Fig. 7 in (Dai et al., MedIA’22) which show models like U-Net and CA-Net failing at recognizing border-adjacent lesions. Our research explains these failures, offers a mitigation strategy, and suggests that many model performances reported in MICCAI publications and challenges may not reflect performance in the wild, where cropping can be less consistent. This should be a community concern.

Our suggested mitigations are not common in prior works: less than 15% of skin lesion segmentation papers since 2014 used random cropping augmentation (Mirikharaji et al., MedIA’23), possibly because random cropping risks discarding key objects (Cho et al., MICCAI’23), making it less favorable for accuracy on center-cropped validation sets. Removing calipers in fetal ultrasound segmentation is also not standard in MICCAI papers, e.g. (Sophia et al., MICCAI’21; Pu et al., JBHI’22; Zhou et al., MICCAI’23).

== Details ==

  1. Indeed, our mitigation is simple yet effective (R3), based on understanding shortcut mechanisms, which is our main contribution.
  2. The requested GradCAM saliency map (R4) is found in the supplements and shows the model utilizing calipers/texts in segmentation.
  3. Telemedicine cell phone applications are an example that might suffer from the center-cropping shortcut (R5), as image acquisition is less controlled. In addition, some dermatology datasets do not have centered lesions (e.g., Groh et al., CVPR’21 and Roxana et al., Sci. Adv‘22), where our experiments offer practical insights (R3).
  4. We will publish our code and metrics as public tools for evaluating segmentation robustness (R3, 4).
  5. We fully acknowledge additional biases in dermatological datasets (R5) and discuss them under related work. We do not expect these biases to affect our main conclusion on segmentation shortcuts, as centralization happens regardless of rulers or hair.
  6. As ViT applies no padding and processes image patches, this could mitigate the phenomenon (R5) as the spatial location of the pixel is no longer included. However, the positional encoding in ViT might still encode the center bias and lead to a similar shortcut, leaving the shortcut effect in ViT an interesting open problem.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After rebuttal, the reviewer turn the reject to weak reject. The overall evaluation of this paper is positive. Hence, I would recommend ‘Accept’.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After rebuttal, the reviewer turn the reject to weak reject. The overall evaluation of this paper is positive. Hence, I would recommend ‘Accept’.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The work is interesting, which will benefit the medical imaging community. The one reviewer has raise the score and the overall rating is positive.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The work is interesting, which will benefit the medical imaging community. The one reviewer has raise the score and the overall rating is positive.



back to top