Abstract

Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focus on those background features instead of on the image content. We propose a simple, yet effective random background augmentation method called BackMix, which samples random backgrounds from other examples in the training set. By enforcing the background to be uncorrelated with the outcome, the model learns to focus on the data within the ultrasound sector and becomes invariant to the regions outside this. We extend our method in a semi-supervised setting, finding that the positive effects of BackMix are maintained with as few as 5% of segmentation labels. A loss weighting mechanism, wBackMix, is also proposed to increase the contribution of the augmented examples. We validate our method on both in-distribution and out-of-distribution datasets, demonstrating significant improvements in classification accuracy, region focus and generalisability. Our source code is available at: https://github.com/kitbransby/BackMix



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2248_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/kitbransby/BackMix

Link to the Dataset(s)

WASE-Normals, Medstar, StG, MAHI and UoC datasets are proprietary. CAMUS dataset: https://www.creatis.insa-lyon.fr/Challenge/camus/databases.html

BibTex

@InProceedings{Bra_BackMix_MICCAI2024,
        author = { Bransby, Kit M. and Beqiri, Arian and Cho Kim, Woo-Jin and Oliveira, Jorge and Chartsias, Agisilaos and Gomez, Alberto},
        title = { { BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a background-based authentication method to reduce the shortcut learning issue.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The shortcut learning issue is very important in medical image analysis, and the authors provides a simple and effective solution.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Are the classification accuracy and %E / %F trade-off metrics? Why does the proposed method usually make the classification accuracy drop while increasing the %E / %F?

    2. The original CutMix method is applied to the whole image, how about applying the CutMix method to the background region?

    3. How do you apply the BackMix if the candidate samples have different shapes of foreground and background?

    4. The experiments are conducted on ResNet-18, how about the results on other backbones?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Are the classification accuracy and %E / %F trade-off metrics? Why does the proposed method usually make the classification accuracy drop while increasing the %E / %F?

    2. The original CutMix method is applied to the whole image, how about applying the CutMix method to the background region?

    3. How do you apply the BackMix if the candidate samples have different shapes of foreground and background?

    4. The experiments are conducted on ResNet-18, how about the results on other backbones?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty and the experimental settings.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces the BackMix background augmentation method for echo frames, which randomly samples backgrounds from other training samples to prevent the model from learning spurious correlations. The authors trained ResNet18 models using BackMix on one public dataset and externally validated the model on another public dataset. The ResNet18 with BackMix outperformed the other relevant methods in terms of classification accuracy and region focus on the external testing set. The authors further extended their methods to a semi-supervised setting and proposed a loss weighting mechanism, wBackMix. Even with only 5% of segmentation labels, the positive effects of wBackMix persisted.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents a novel and easy-to-implement augmentation method for echo frames. Specifically, it is domain specific, aiming to exclude metadata and background pixels outside of the sector from influencing outcome predictions. Notably, the authors achieved substantial improvements in external classification accuracy and region focus using this method on public datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Firstly, the experiments utilized only a single CNN architecture. It raises concerns that whether the proposed augmentation method could improve model generalizability for different model architectures (e.g., vision transformers). In comparison, other studies (such as references [11] and [19]) have demonstrated their augmentation methods across various model architectures. Secondly, determining the appropriate lambda value for different f values in wBackMix remains unclear. When the validation set originates from the same source as the training set, hyperparameter tuning on the internal validation set tends to favor lambda = 0. Although the authors provided specific cases for f=0.05 and lambda = 1 or 2 in Table 2, a general guideline is lacking.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    One potential extension of this work is to combine BackMix with foundation models such as the SAM. The SAM could be used to generate segmentation mask for the sector, which could be later used for BackMix.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel and easy-to-implement augmentation method for echo frames. The method can be easily incorporated in the training step and has been demonstrated to improve model generalizability. However, the experiments utilized only a single CNN architecture, raising concerns that whether it would work for other model architectures. Secondly, it is not clear how to pick the hyperparameter for the wBackMix method in the semi-supervised setting.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a simple and effective data augmentation method BackMix to mitigate the shortcut learning introduced by the background information appeared in echocardiography images when conducting view classification task. The BackMix technique created new images by randomly superimpose a background image with a foreground (echo view) from training data. They also introduced a weighted loss function to focus more on augmented images. Their experiments demonstrated the proposed method can improve the classification results as well as enforce the network to focus more on the foreground (echo view).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Simple yet efficient augmentation technique: with the help of background and foreground masks, it’s quite simple to generated new images with mixed background image. In addition, the experiments demonstrate the superior performance compared with other augmentation strategies.
    2. Comprehensive experiments with different settings: the authors also performed experiments with various percentage of training data used for augmentation, as well as the choice of parameter lambda. They also evaluated the model on out-of-distribution data to demonstrate the generalisability.
    3. No preprocessing at inference time: the BackMix technique demands training-time preprocessing (echo sector segmentation), while at inference time there is no extra preprocessing to conduct.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Demand of extra preprocessing of training data: the BackMix technique demands ultrasound sector segmentation, where human check is needed. Even with 5% training data used for BackMix augmentation, the number of samples for mask generation and manual check is about 728, which is quite large.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors are going to release the codes. The training dataset is public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Unclear illustrations: the echo examples (Fig.2, 4, 5) superimposed with attention maps make it difficult to see the original ultrasound images. It would be better that the authors add one row of original echo images to show the distinct appearance of different views.
    2. Some questions: in Fig.4, the label of second column should be A4C? Or A2C? In Fig.5, only examples of PSAX, A2C and PLAX views are present. Are there examples of A4C?
    3. What’s the distribution of different views in in-distribution data and out-of-distribution data?
    4. It would be interesting to see the improved performance for each view separately. It can help readers to understand which view has benefited most from mixed background. It may be inspiring for the improvement of BackMix in the future, too.
    5. Ambiguity in Equation of %F in section 2.4: do the authors want to calculate the sum of Zp for highly activate d pixels or they simply want to count the number of pixels? The math language here is not clear.
    6. Are the standard augmentations mentioned in 2.1 also applied to Black, Noise, Shuffle etc. methods?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written and easy to understand. The proposed BackMix method efficiently improved the in- and out-of-distribution performance as well as deduced the correlation between background info and class category.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I would like to thank the authors for their responses to my questions under the comments section. However, it seems that they didn’t pay attention to the character limit so that their last response to my comment on weakness was cut-off. I will keep my score unchanged.




Author Feedback

We thank the reviewers for their thoughtful comments. We are pleased that the reviewers found our work ‘novel’ (R1), ‘effective’ (R3, R5), ‘well-written’ (R5), and praised the ‘substantial improvements’ (R1) and ‘comprehensive experiments’ (R5) presented in our study. We are particularly pleased that the reviewers highlighted the improved ‘generalizability’ (R1) and reduced preprocessing at inference (R5), as we’d argue these are the main contributions of our paper which have clinical impact. We address their feedback below.

R1 & R3 noted that a single CNN architecture is used for experiments. The purpose of our work is to address a data-specific problem in echo, which is architecture independent. This differs from [7, 11] which propose generalized computer vision augmentations. Focusing on a single architecture allowed for in-depth validation using both i.d and o.o.d datasets and in a semi-supervised setting, which is of more relevance to our problem. Furthermore, because the data used in this work is typically paired with ResNet [a, b], this backbone is more relevant. We will add these arguments to Section 3.3.
[a] Huang et al “TMED 2” ICML 2022. [b] Wessler et al “Automated Detection of Aortic Stenosis” JASE 2023

R3 questions the accuracy-focus trade-off. For o.o.d data, there is no trade-off as improved focus on the sector results in improved accuracy. Decreased classification performance on i.d data is expected as the spurious shortcuts aiding performance are not learnt due to improved sector attention (see Section 3.3). The discovery of background shortcuts which artificially boost i.d performance is a contribution of our work, rather than a limitation.

R3 suggests applying CutMix to the background only. We considered this idea but decided against it for two reasons: (1) the background is mostly black pixels, so a background-only CutMix algorithm would mostly mix empty crops, which is inefficient; (2) mixing full backgrounds maintains a realistic image, and we found augmentations that cause unrealistic disruptions to the image reduced performance (Section 3.3).

R1 comments that determining the best λ value in wBackMix remains unclear, and the validation set favors λ=0. There may be a small misinterpretation of Table 2 by R1, as the optimum choice of λ for f=0.05 is 1 or 2, suggesting that weighting the loss in favor of the augmented exampled has a positive impact on results, but the choice of λ does not have a big impact. A large λ may restrict learning from unaugmented examples, while a small λ does not place any emphasis on learning from the augmented ones. We recommend a grid search to identify the best configuration. We’ll add this to Section 3.2.

R3 asks what happens when applying BackMix when images have different shapes. This is a scenario we encountered as different scanners often acquire images at different resolutions. We followed typical preprocessing [17,18] by resizing all images to the same resolution. We’ll add this information to Section 3.1.

R5 had additional recommendations to improve clarity. Our responses below: (1) We will update Fig 2,4,5 to improve visibility of the ultrasound. (2) We will provide examples of A4C in Fig 4. (3) We’ll include a class distribution in Section 3.1.
(4) We did not find any significant variance in performance between views. (5) In %F equation, we are counting the number of highly activated pixels.
(6) Standard augmentations are applied to all methods.

R3 cites ‘novelty’ as the first reason for a weak rejection, however there is no mention of novelty within “weaknesses” or “constructive comments”, neither has this been raised by any other reviewer so we are unable to address this critique in any way. Our novel contributions are listed in Section 1.

R5 questions the resources needed to manually check the sector masks. We’d argue that a human check is not necessary, however we wanted to verify that our masks were reliable, which they were




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top