Abstract

Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the groundwork for our novel concept of “observations” and “verdicts.” This approach refines the {Entity, Position, Exist} triplet into binary questions, guiding towards a clear “verdict.” We also innovate in visual pre-training with a Meijering-based masking, focusing on features representative of medical images’ local context. By integrating this with our text conversion method, our model advances cross-modal representation in a multimodal contrastive learning framework, setting new benchmarks in medical image analysis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0290_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0290_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gow_Masks_MICCAI2024,
        author = { Gowda, Shreyank N. and Clifton, David A.},
        title = { { Masks and Manuscripts: Advancing Medical Pre-training with End-to-End Masking and Narrative Structuring } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a framework, based on MaskVLM and MedKLIP, for self-supervised pre-training of multi-modal medical data (X-ray images, medical report). The contribution is using Meijering filter-based masking for image encoding.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The only contribution is the use of Meijering filter during masking.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Lack of novelty. The framework is based on the previous work, MaskVLM [18] and MedKLIP [30], with very limited change. Fig.1, the overview pipeline, is almost the same as in the paper MaskVLM [18], except that X-ray images instead of natural images, are being used. Section 2.1 is adapted from MaskVLM but for X-ray images. Section 2.2 is adapted from MedKLIP. Section 2.3 and 2.4 are adapted from MaskVLM. All the loss functions, equation 1, 2, 3, 4, seem to be copied from MaskVLM paper with minor changes to variable subscripts.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    For the reproducibility, it’s hard to understand how exactly the masking using Meijering filtering is performed. Is the process to use Meijering filter to enhance the image first, than to perform random masking? Or the masking is based on Meijering filtering? If the latter, can the authors further clarify the steps?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Fig .1, the figures seem to be of low resolution and the texts are hard to read on printed version. Variables, like v_m, v, w_m, w, are not explained in this paper. Fig 2., the caption mentions two strategies and there are four subplots in the figure, which is very confusing. It would be more clear if the authors can further annotate which subfigure corresponds to which masking strategy. Sec 2.1. It’s mentioned that “further in supplementary materials, empirically outperforms random masking as evidence by our ablation study”. Since the Meijering filtering is the major contribution of this paper, we recommend showing the ablation study in the main body of the paper. Besides, in the supplementary material, the ablation study is only qualitative, shown in Fig. 1. A quantitative ablation study is recommended here.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Reject — must be rejected due to major flaws (1)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation is based on the lack of novelty in the paper. The only contribution is the Meijering filter, and there is no quantitative ablation study comparing Meijering filter to random masking to show the effectiveness.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors study the pre-training strategy in medical image analysis. They propose novel pre-processing methods based on the MaskVLM’s structure, which has shown promising results in both segmentation and classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method outperforms several well-estabilished methods on multiple datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Despite the overall effectiveness of the proposed method, the authors did not compare their method against MaskVLM with the original masking strategy in Table 1-5. The ablation study in Fig. 4 and 5 (and I think they should instead be Table 6 and 7) only somewhat shows that the learned features can be a good starting point, but it’s more straightforward to see the effectiveness by testing MaskVLM in the same setting.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would increase my rate to “accept” once the authors show the superiorty over MaskVLM with its original masking strategy.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors address my concern and thus I raised my score as promised.



Review #3

  • Please describe the contribution of the paper

    The manuscript deals with pretraining of deep learning models that take both text and image as inputs. The authors propose to (i) convert the text into triplets (entity, location, truth) and (ii) filtering the images with a tubularity filter.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Extensive experimental evaluation shows that the pretrained model performs better on the downstream tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The way the paper is written it is hard to see the main goal until quite late.

    The method is not described very clearly, some symbols are not well defined (e.g. a missing relationship between, f,g,d and phi in eq.1) and there is not enough details to reproduce the work (e.g. text generation, model fine-tuning)

    Using only 224x224 images seem small.

    The interest of using the tubularity filter is only justified experimentally.

    Minor points: images are too small, exponential notation should not be written as e.g. “1e-5”

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The method is clearly complex and uses a lot of existing software, so achieving full reproducibility would be difficult. However, more information would be helpful, as well as access to code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I appreciate the amount of work that has been done, especially in the experimental evaluation. However, the description could be more focused and more clear.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good experimental results.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper proposes a multi-modal self-supervised pretraining framework for chest X-ray classification. This work is built upon two existing frameworks, namely MaskedVLM [23] and MedKLIP [30]. The overall training flow (Eq. 4 in paper #290) is inspired by MaskedVLM, and the medical record encoding strategy is partially inspired by MedKLIP. The major contributions of this work are a novel masked (medical) image modeling based on Meijering filter and a new manuscript generation method to rephrase extracted medical report triplets (extracted by MedKLIP). The proposed framework (M&M) was evaluated on seven chest x-ray datasets and compared to various previous methods in a supervised (using different ratios of data) and zero-shot classification settings, a consistent large improvement is observed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The presentation is clear, and the manuscript is easy to follow.
    2. The major contributions, although they are not transformative, are well-justified. I like the idea of applying Meijering filter for masked (medical) image modeling pre-training. The proposed report rephrase method from triplets seems simple but outperforms the method proposed by MedKLIP. I want to praise the technical soundness as well as the straightforward and effective components.
    3. Comparisons with other methods and ablation studies are comprehensive. M&M consistently outperforms other methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. No open-source code provided.
    2. The submitted version did not cite MaskedVLM in Section 2.3 Conditional Reconstruction.
    3. A few limitations could be discussed and added to the manuscript.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No open-source code was provided. There is a critical component (report generation) that seems not 100% reproducible based on the manuscript itself.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please also cite MaskedVLM[23] in your section 2.3. This section is essential same as the Section 3.1 Joint Reconstruction in MaskedVLM paper. You should appropriately acknowledge that this is from MaskedVLM.
    2. While the Meijering filter seems to improve the performance in the discussed lung X-ray classification, the reviewer is unsure if this component can be beneficial to other modalities, especially MRI, which is noisier and more heterogeneous. This kind of enhancement/vessel filter (i.e., Meijering) may not work well on MRI. This limits the impact of this component, as it may not be used as a standalone measure to improve MIM for MRI. This might be a limitation and should be discussed in the manuscript or supplementary materials if there is no space.
    3. The authors should emphasize the efficiency in the conclusion (using only 100 epochs vs. others’ 800 epochs). And which similar studies use 800 epochs for training? Please be specific in Section 3.2. If there is a future journal version, it would be interesting to see if the performance can be further improved by extending training to 800 epochs.
    4. The improvement over vanilla MAE (over 8%) is a little too good to be true. The reviewer wonders if the vanilla MAE is only trained for 100 epochs and therefore is unconverged. The MAE paper shows that after 100 epochs, MAE has not converged yet (Fig. 7, MAE paper). Are the other optimal settings of vanilla MAE (e.g., masking ratio) used in the ablation studies? The authors could feature that M&M are faster to converge and should point out that comparing methods are trained for the same epochs, not trained to convergence.
    5. There is a numbering issue of tables. The tables of ablation studies should be Table 6 and 7. However, their captions show Fig. 4 and 5. And they are referred as Table 4 and Table 5 in the text (Section 3.6).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposed two incremental components (i.e., new MIM method and medical record rephrase method) to the existing framework (i.e., MaskedVLM and MedKLIP). Although they are not transformative approaches, the work is very solid and is comprehensively evaluated. The reviewer also praises the technical soundness. I recommend a clear acceptance without reservation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    After reading author’s rebuttal and other reviewers’ comments, I’d like to reiterate my rating of a “5 - Accept”. I’d like to increase my confidence from 3 to 4 after the rebuttal. I rank this paper 1 (best) in my stack of rebuttal papers (n=3). I’d like to thank the authors for their efforts in preparing the manuscript and rebuttal.

    I’d also like to defend the authors regarding “novelty”. I still believe that there is no “transformative” novel approach in this paper, as stated in my initial review. However, the incremental change over the previous image masking method and text preprocessing method in the self-supervised pretraining task and the overall framework design, as presented by this work, can be regarded as a “novel” engineering work. And all components are justified well in the manuscript.

    Meanwhile, we can see there are many examples similar to this paper that have incremental engineering novelty and got accepted to the top venues, such as [1]. [1] proposed the most suitable masking strategy for video MAE pretraining, just as this work proposed a better masking strategy and a better text preprocessing strategy for x-ray multi-modal SSL pretraining. So, I disagree with the strong reject rating simply because of “lack of novelty”.

    Although there might be some limitations (such as it may not work well for other modalities, but currently, most multi-modal medical image SSL pretraining focuses on X-ray), as long as authors are willing to acknowledge them, I do not think this should be a deal-breaker.

    Again, I would like to emphasize that I recommend a “5 - accept” rating without any reservations. This final recommendation is based on the technical soundness, comprehensive evaluation, author’s rebuttal, as well as other reviewer’s comments.

    Ref: 1.Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 35, 10078-10093.




Author Feedback

Thank you for your valuable feedback and suggestions. We appreciate the recognition of our experimental results [R3,R6,R7] and proposed pre-processing methods [R3, R9]. We address the reviewers’ concerns below:

Lack of Reproducibility [R3, R7, R9]: We plan to release our code upon paper acceptance to address reproducibility concerns.

Need for Meijering Filter [R3, R9]: We hypothesize the need for a ridge filter in Section 2.1. Ridge filters convert X-ray images into forms suitable for reconstruction, addressing the fine-grained nature of medical data. We use the Meijering filter output to guide our random masking. The ablation study, mistakenly referenced to the supplementary material, is in fact in Figure 4 of the main paper. This strategy outperforms other masking strategies like MAE (random masking) [10], AttMask (region masking) [17], and AutoMAE (learnable masking) [3], showing a significant AUC improvement of 9.08 over random masking. The training here follows MaskVLM and only the visual masking has been changed for fair comparison. We will correct out the reference error as this has in fact already been in the main paper and not in the supplementary.

Lack of Novelty [R9]: We stress our innovations at the beginning of Section 2. “In Figure 1, we outline our approach, adopting MaskVLM’s [18] architecture. Our innovations primarily lie in visual and textual data pre-processing, rather than the architecture itself, which serves as a foundation for our training methodology.” Preprocessing significantly enhances existing strategies, evidenced by Figures 4 and 5 (ablation tables) in the main paper. While Section 2.2 acknowledges our foundation in MedKLIP, with notable similarities, it is in Fig 3 (b) and Figure 5 (ablation) where we detail our modifications and improvements, specifically over the KE-Triplet outputs from MedKLIP.

Comparison with original MaskVLM [R6]: In MaskVLM, the authors use random image and text masking followed by multimodal alignment. In Fig 4., the ‘random masking’ numbers would correspond to MaskVLM, but with our proposed report generation used for text masking. Without our report generation and using existing reports, using random masking directly on the reports, we get 58.87, 14.96, 66.69 (AUC, FI and ACC) respectively, which is much lower than the proposed M&M. We did not include this in order to properly ablate our contributions with the limited space of the paper. We will highlight this in the final version. Thank you.

Comparing with vanilla MAE [R7]: We ran the vanilla MAE for 800 epochs, but the reconstructions were subpar. The vanilla MAE typically excels with datasets containing millions of images, a scale our datasets don’t match. We tried different masking ratios to no avail. Our ridge filter-based masking decreases the need for such extensive pre-training, achieving comparable results in just 100 epochs. Extending training to 300, 500, or 800 epochs did not improve accuracy. We will clarify this.

Discussion of Limitations and Errors [R7]: We will discuss limitations, particularly concerning the variability in medical images and potential failure modes. We will also correct the missing citation for MaskVLM and fix table numbering errors.

Paper writing and size of images [R3]: We will first motivate the approach before discussing it to improve clarity. Apologies,ϕ was used for reducing complexity of the equations. We should have mentioned this clearly. We followed notations used by MaskVLM. This will be fixed, for example, ϕ_txt = g^de_txt(g_txt(f_im(I), f_txt(T_m))). About the size of the image, we use 224x224 for fair comparison with other approaches such as MedKLIP. Larger sizes may result in unfair gains in our approach.

Clarity Issues in Figures 1, 2 [R9]: We apologize for any confusion and will improve the resolution and captions in these figures to clarify that ‘v’ and ‘v_m’ refer to extracted and masked visual features, respectively, and ‘w’ and ‘w_m’ to text features.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top