Abstract

Self-supervised learning (SSL) has emerged as a promising paradigm for medical image analysis by harnessing unannotated data. Despite their potential, the existing SSL approaches overlook the high anatomical similarity inherent in medical images. This makes it challenging for SSL methods to capture diverse semantic content in medical images consistently. This work introduces a novel and generalized solution that implicitly exploits anatomical similarities by integrating codebooks in SSL. The codebook serves as a concise and informative dictionary of visual patterns, which not only aids in capturing nuanced anatomical details but also facilitates the creation of robust and generalized feature representations. In this context, we propose CoBooM, a novel framework for self-supervised medical image learning by integrating continuous and discrete representations. The continuous component ensures the preservation of fine-grained details, while the discrete aspect facilitates coarse-grained feature extraction through the structured embedding space. To understand the effectiveness of CoBooM, we conduct a comprehensive evaluation of various medical datasets encompassing chest X-rays and fundus images. The experimental results reveal a significant performance gain in classification and segmentation tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3580_paper.pdf

SharedIt Link: https://rdcu.be/dY6fw

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_3

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sin_CoBooM_MICCAI2024,
        author = { Singh, Azad and Mishra, Deepak},
        title = { { CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {23 -- 33}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a solution facilitating anatomical similarities by integrating codebooks in SSL.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This study performed sufficient experiments to give convincing results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The author needs to highlight their contribution and novelty compared with BYOL. It seems the author only introduced self attention layers in addition to the vanilla BYOL.
2. Why the Qunatizer used Euclidean distance as similarity measurements? Did the author try other similarity measurements?
3. Is the codebook binary? If so, what is the differences between the this ‘codebook’ and hashcode in retrieval task?
4. The results in Table 4 should include statistical analysis to show the significance, especially in the ‘ALL’ context.
5. How did the author acquire the diagnostic maps? More details are needed.
6. From table 1 it seems that the Decoder is not necessary. The authors should clarify this.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

highlight the contribution and carefully revise the paper according to the comments.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

mixture of existing studies which cannot attract my interests
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

I carefully reviewed the other reviewers’ comments and the authors’ response, and I have decided to change my score to Weak accept as the author have addressed most of my concerns.

Review #2

Please describe the contribution of the paper

The paper proposes a novel SSL strategy that aims to enhance the integration of anatomical information by incorporating codebooks and a specialized fusion module for integration of such features.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors present a novel approach to SSL that utilizes a codebook to learn image representations. Furthermore, the proposed DiversiFuse module improves upon existing methods and achieves better results in various contexts/datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although the authors claim that the incorporation of a codebook provides anatomical information that should then enhance performance in learning representations, the manner in which this was taken into account in the study is not entirely transparent. Specifically, the source of the utilized codebook/code words is not explicitly mentioned in the manuscript, nor is the nature of the anatomical information it contains or the manner in which it was calculated. The description of the datasets and the learning scheme can be better described in order to improve the understanding of achived results.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

No
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The text is difficult to follow due to the unclear nature of the figure and its explanation. It would benefit from being simplified and organized in a more clear and concise manner, both in the figure and the accompanying text. In addition, the learning scheme should be more clearly explained. For instance, it is unclear if the L_cb is calculated using all the code words. Also as mentioned before, provide details of the codebooks utilized is necessary. Furthermore, there are several typographical errors and issues with the presentation of the manuscript. For example, it is stated that DiversiFuse is a subsection of the Methodology? or of the “Qunatizer” section?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methodology may be innovative, but it is not fundamented within the text. There is an absence of clarity regarding how the proposed approach effectively integrates the information that has been proposed for inclusion and the method lacks of clarity and set up details.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper proposes a self-supervised leaning framework for medical image representation, integrating continuous and discrete representation using a codebook. The proposed method is evaluated on several datasets (NIH Chest X-ray, SIIM, MuRedD and ODIR) and compared with several state-of-the-art algorithms. The proposed method outperforms other SOTA methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method includes a Quantizer module which uses a predefined codebooks to quantize the low-dimensional continuous feature maps from the target encoder, with the smallest Euclidean distance. The DiversiFuse sub-module within the Quantizer module captures complex patterns and dependencies within the data using a multi-head cross-attention mechanism that minimizes the similarity score between discrete and continuous representations. The DiversiFuse sub-module enhances the model’s ability to capture complex patterns and improve the overall performance.

The proposed CoBooM model outperforms other SOTA methods on four medical image datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Section 5 presents the evaluation results where three different implementations of the CoBooM model are used (Ours w/o Dec., Ours w/ Dec. and Ours w/o DF.) From both tables, there are scenrios where the first model (Ours w/o Dec.) is better and there are scenarios where the second model (ours w/ Dec.) is better. Furthermore, the discussion on Page 6 (Table 1) regarding improvement comparing with SOTA methods involves both models: 65.1% for 1% NIH (Ours w/o Dec.); 57.5% for SSIM (Ours w/ Dec.). It is not clear or conclusive which model is better. The authors should commit to one model and compare. The ablation study should be discussed separately.

Similar comments for the results in Table 2, where it is not clear which model is better (w/o or 2/ Dec.). The authors also mix the best results of these two models in the discussion: 65.8% for 1% NIH (w/o Dec.); 84.8% for 10% MuReD (w/ Dec.); 75.8% for 10% ODIR (w/o Dec.).

Did the authors use 5-fold cross-validation to evaluate the robustness of the proposed model (randomly pick training and testing data)?

There are several typos: Page 3 “medial”, Page 4 “Qunatizer”, page 6 “the our”,
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

If the authors do not plan to release the source code, they should provide pseudo-code for the Quantizer module and the DiversiFuse sub-module.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The authors should provide either source code or pseudocode for the Quantizer module and the DiversiFuse sub-module. The authors should pick the best method to evaluate and compare with other SOTA methods (w/o Dec. or w/ Dec.). The authors should evaluate the robustness of the proposed method. The authos should present the ablation study separately.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed Quantizer module that integrates continuous and discrete representation is interesting and could be beneficial to other research.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors address all of my concerns

Review #4

Please describe the contribution of the paper

This paper proposes a novel self-supervised learning model, using Quantizer module to discretize embedding feature vectors, and use the DiversiFuse to calculate multi-channel attention over the feature vectors, then to input the weighted vectors into the decoder.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The novelties are clear, including the proposal of two novel modules, Quantizer and DiversiFuse, in the self-supervised learning framework.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Over the paper is well-written and the contribution is clear.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Overall the paper is well-written, with only minor issues.
1. Introduction, citation BYOL [13] is overflowing the pagemargin. Using Latex command like \sloppy or breakcites packages might resolve this issue.
2. Figure 1, \sigma(y_d^q, z_c^k) – here z_c^k seems not defined within the text. Could this variable be y_c^k? Please clarify. In l_r(x, x’) – the right closing parenthesis is missing.
3. Section 3.1 title, Qunatizer – is a typo.
4. Section 4, Experiment Setup. The codebook size is fixed as 1024*512, it would be really interesting to know how the model performance would change due to different sizes of the codebook.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelties of the paper is clear and performance improvement can be seen in the quantitative results comparing to other state-of-the-art works.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We thank the reviewers for encouraging comments. We provide rebuttal to address their concerns. Novel contributions (R3): While our model draws inspiration from non-contrastive SSL, we introduce following novel components: 1) Discretization of continuous representation space using a codebook to enable quantization to obtain discrete codewords. Each codeword represents a group of similar attributes within the continuous space. (2) Introduction of DiversiFuse, a sub-module of Quantizer module (R4), for enhancing continuous representation with discrete features by employing cross-self attention. This allows to take the advantage of discrete and continuous features, resulting in informative representations (R3). Codebook details: The codebook is not binary. It comprises K codewords, which are randomly initialized vectors of size D (R3). Once trained, these become the representatives of the continuous features (y_φ) in the quantized space. For quantization, a pair-wise distance is computed between each feature vector in y_φ and each codebook vector e_k to find the nearest codeword for each feature vector in y_φ (R3, R4). Following the common practices ([24,32] from the paper), we use the Euclidean distance, which is also motivated by the simplicity and efficiency of K-means clustering (R3). This enables quantization of y_φ space to be partitioned into discrete regions represented by the codewords. As a result, the model represents common anatomical features using a compact set of discrete codebook vectors (R4). For more clarification about the codebook, please refer to [24,32] from the paper. Codebook Learning: The model is trained in two phases, pre-training and downstream training. In particular, codebook is trained during pre-training phase where the loss L_cb containes e_k representing the quantized features in the form of the aforementioned nearest codewords. L_cb effectively results in MSE loss between the continuous and discrete representations (R4). Decoder’s significance: As mentioned in the Ablation Studies section, the results from classification tasks are comparable w/ and w/o decoder, however, in segmentation tasks, the obtained results demonstrate a superior performance w/ the decoder (Table 1 and 2). By reconstructing the input image from the output of the DiversiFuse sub-module, the decoder encourages the model to focus on capturing fine-grained details which are critical for segmentation (R3, R7). Also, we thank the reviewer for highlighting the mixing of the best results for different implementations of our model. We consider the model w/ decoder as our best model. We will make the discussion of the results consistent in the camera-ready version (R7). Results: A paired t-test comparing our model with best baseline method DiRA (Table 2, SIIM dataset) yielded a significant p-value of 0.012, indicating performance differences. The 5-fold cross-validation on SIIM in Table 1 results into 57.7 and in Table 2 59.2 (R7). The diagnostic maps are obtained during the downstream phase with the help of Gardcam using the available ground truth details for a subset of NIH samples, including the annotated regions and labels (R3). Additional details on data and training: As mentioned in section 4, We pre-trained on NIH and EyePACS datasets, using official splits for NIH and 35,126 samples from EyePACS. In downstream phase, 7,000 samples are in ODIR and 2,208 in MURED, with 20% allocated as the test set. The SIIM provided 12,047 samples, using equal numbers of positive and negative samples and allocating 20% for validation. Downstream training involved 100 epochs, batch size of 16, and a 1e-3 learning rate with Adam optimizer (R4). Typos correction: We will address the citation overflow and correct all the typos in the camera-ready version (R4, R6, R7). In Fig 1, we will rectify the variable inconsistency, which should be y_c^k instead of z_c^k, and add the missing closing parenthesis (R6). Codes will be released upon acceptance.

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors have demonstrated this through sufficient and convincing experiments, providing a novel formulation and application that could potentially enhance clinical image analysis. While there are concerns regarding the novelty compared to existing methods like BYOL and the detailed implementation of their Quantizer module, the authors have addressed these in their rebuttal satisfactorily. The implementation of self-attention layers and codebooks represents a thoughtful approach to improve learning in SSL frameworks. The authors are encouraged to further clarify the computational advantages and expand the statistical analysis to solidify the significance of their findings. However, the contributions as they stand are promising and merit recognition at the conference, potentially inspiring further research in the domain.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

The authors have demonstrated this through sufficient and convincing experiments, providing a novel formulation and application that could potentially enhance clinical image analysis. While there are concerns regarding the novelty compared to existing methods like BYOL and the detailed implementation of their Quantizer module, the authors have addressed these in their rebuttal satisfactorily. The implementation of self-attention layers and codebooks represents a thoughtful approach to improve learning in SSL frameworks. The authors are encouraged to further clarify the computational advantages and expand the statistical analysis to solidify the significance of their findings. However, the contributions as they stand are promising and merit recognition at the conference, potentially inspiring further research in the domain.

back to top

CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

Author(s):