Abstract

Multi-modality learning, exemplified by the language and image pair pre-trained CLIP model, has demonstrated remarkable performance in enhancing zero-shot capabilities and has gained significant attention in the field. However, simply applying language-image pre-trained CLIP to medical image analysis encounters substantial domain shifts, resulting in significant performance degradation due to inherent disparities between natural (non-medical) and medical image characteristics. To address this challenge and uphold or even enhance CLIP’s zero-shot capability in medical image analysis, we develop a novel framework, Core-Periphery feature alignment for CLIP (CP-CLIP), tailored for handling medical images and corresponding clinical reports. Leveraging the foundational core-periphery organization that has been widely observed in brain networks, we augment CLIP by integrating a novel core-peripheryguided neural network. This auxiliary CP network not only aligns text and image features into a unified latent space more efficiently but also ensures the alignment is driven by domain-specific core information, e.g., in medical images and clinical reports. In this way, our approach effectively mitigates and further enhances CLIP’s zero-shot performance in medical image analysis. More importantly, our designed CP-CLIP exhibits excellent explanatory capability, enabling the automatic identification of critical regions in clinical analysis. Extensive experimentation and evaluation across five public datasets underscore the superiority of our CP-CLIP in zero-shot medical image prediction and critical area detection, showing its promising utility in multimodal feature alignment in current medical applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3750_paper.pdf

SharedIt Link: https://rdcu.be/dV1Vi

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72384-1_9

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yu_CPCLIP_MICCAI2024,
        author = { Yu, Xiaowei and Wu, Zihao and Zhang, Lu and Zhang, Jing and Lyu, Yanjun and Zhu, Dajiang},
        title = { { CP-CLIP: Core-Periphery Feature Alignment CLIP for Zero-Shot Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {88 -- 97}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors provided a novel CP-CLIP method for zero-shot medical image analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is quite innovative and has been experimentally validated on multiple datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Is the choice of datasets representative?
    2. The experimental results are not very ideal and are poorly presented.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Table 3 should include more case analyses, as there are more than two datasets used in your experiments.

    2. Presenting ablation studies in table format might be better and more conducive to statistical analysis.

    3. The performance evaluation metrics for the experimental results are too simplistic. For diagnostic tasks, multiple quantitative metrics could be used for assessment. The current version feels somewhat lacking in terms of experimental results.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presentation of figures and tables in this paper is somewhat lacking.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The quality of the article is slightly better than expected, taking into account the feedback from other reviewers and the author’s responses.



Review #2

  • Please describe the contribution of the paper

    The authors tackle the challenge of training/finetuning CLIP on a medical dataset to enhance its zero-shot ability, specifically designed for handling medical images and their reports, which exhibit distinct patterns compared to the original CLIP dataset. Given the small size of the medical image dataset, the authors employ an additional core-periphery-guided neural network as an extra processing step to refine the image and text embeddings, bring them into unified feature space, before aligning them in a similar manner to CLIP. The authors demonstrate improved performance on 4 out of 5 datasets compared to MedCLIP, and superior performance on all datasets compared to CLIP.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and highly comprehensible, addressing a problem of considerable interest to the community. The proposed method offers a simple module that can be readily attached, resulting in performance improvements. Notably, this method appears applicable not only to CLIP but also to other architectures, potentially including MedCLIP. By employing the same network for both image and text embeddings, the method ensures a more effective alignment. The heatmaps presented by the authors showcase compelling results with enhanced localization properties compared to CLIP. Overall, the method demonstrates promising results across multiple datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited Novelty: The novelty of the CP network, a core methodology proposed in the paper, appears to be limited as it shares similarities with the approach outlined in paper “CP-CNN: Core-Periphery Principle Guided Convolutional Neural Network” and is also connected to “Core-Periphery Principle Guided Redesign of Self-Attention in Transformers”.

    Overstatements: Regarding the claim that the method is tailored for handling medical images and reports, it seems overstated. While CP is a method for generating sparse neural networks, there aren’t specific components within it tailored specifically for medical images and reports. The motivation mentions CLIPs does not perform well due to scarce dataset, however I don’t see any specific component which addresses this point. Maybe I am missing something.

    The paper asserts that CP with shared weights facilitates a unified latent space for further feature alignment. However, since feature alignment is a key aspect of the paper’s proposition, it would be necessary to compare it with CLIP + CP network without shared parameters to validate this claim effectively. Without such a comparison, the statement lacks substantiation.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This method could be experimented with other CLIP type algorithms which would strengthen your claims and would be much more interesting to the community due to the simplicity of the proposed methodology

    For the visualizations, it would be better if we could know the best and the worst case. Is it always that CLIP finds shortcuts and CP-CLIP avoids

    Although the results look convincing, the method underperforms MedCLIP in ChestXray dataset which has more number of samples than all the other four combined. More discussion on this part would make the results more convincing.

    See weaknesses

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Major factors leading to this score was limited novelty and its high similarity to the methods already existing in the literature. The feature alignment selling point is also not convincing and there is no proof why it is needed in the first place. The result in the ChestXray dataset having the most data but underperforming MedCLIP, also led to this overall score for the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I would like to thank the authors for taking the time to write a well-constructed rebuttal. However I still have unaddressed concerns.

    1. Feature alignment + limited novelty - The authors mention that their key innovation lies in designing a single common network (CP network) which is used as an auxiliary network in the CLIP model, to bring images and text features into a unified space. However CP network is not entirely new and the feature alignment idea is missing the important ablation study wherein they have to ablate with the opposite case, i.e different CP networks. Their arguments does not elucidate this concern of why this system of shared network is even necessary.
    2. Underperform MedCLIP in ChestXray: Although this is not a major concern, their argument still does not answer why for chestXray, a substantially big dataset, has a poor performance for CP-CLIP compared to MedCLIP. Further questions arises like, why does multiple prompts, not affect the results for smaller datasets, or is there a reason as to why the authors did not use multiple prompts, etc. Data wise ChestXray is much bigger than other 4 datasets and the authors arguments does not address my concern. Based on these pertaining issues, I stick to my original decision of weak rejection.



Review #3

  • Please describe the contribution of the paper

    This paper proposed to tackle zero-shot medical image classification problem, by utilizing images and text information with a vision language model named CP-CLIP. Multiple sources of public datasets of medical images were collected to demonstrate the performance of the proposed model. Grad-CAM were used to visulize the model performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Main strengths of the paper can be multi-fold: 1) Challenges in zero-shot tasks. 2) multi-organ and multi-center large scale public data. 3) Writing and illustration of the model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Main weaknesses of the paper are about the experimental implementation details: 1) Pre-processing needs to be elabroated in the revised version. 2) When it comes to classification, AUC or G-mean can be better evaluating metrics, which needs to be addressed in the revised version.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper proposed to tackle zero-shot medical image classification problem, by utilizing images and text information with a vision language model named CP-CLIP. Multiple sources of public datasets of medical images were collected to demonstrate the performance of the proposed model. Grad-CAM were used to visulize the model performance.

    Main strengths of the paper can be multi-fold: 1) Challenges in zero-shot tasks. 2) multi-organ and multi-center large scale public data. 3) Writing and illustration of the model.

    Main weaknesses of the paper are about the experimental implementation details: 1) Pre-processing needs to be elabroated in the revised version. 2) When it comes to classification, AUC or G-mean can be better evaluating metrics, which needs to be addressed in the revised version.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper tackles a challegning task of zero-shot medical image classificaiton using vision language model. Multi-center and multi-organ images were collected to demonstrate the proposed model.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Minor weaknesses have been addressed in the rebuttal.




Author Feedback

We thank the reviewers for their valuable comments and their recognition of the strengths in our work:

  1. Innovative method and validated by experiments. (R4)
  2. Well-written, highly comprehensible, and promising resulsts. (R5, R6)
  3. Challenges in zero-shot tasks, covering multi-organ and multi-center large scale public data. (R6)

Here, we address the main concerns due to chars constraints. We will incorporate all other suggestions in the revision and publish our code.

  1. Is the choice of datasets representative? (R4) No. To verify our method, we implemented extensive experiments on 5 publicly accessible datasets covering multi-organ and multi-center, which was appreciated by Reviewer R6.

  2. The experimental results are not well presented and seems not very ideal (R4). We would like to point out that the other two reviewers (R5/R6) think our paper is well-written and comprehensible. We would also like to emphasize that we are focusing on challenging zero-shot tasks. Compared to the baselines, our method shows promising improvements, which are appreciated by the other reviewers. We hope the reviewers realize that our task is zero-shot without supervision labels.

  3. Table 3 should include more case analyses, as there are more than two datasets used in your experiments (R4). We only have two tables in the paper. We believe the reviewer should refer to Figure 3. Please refer to the appendix for more cases.

  4. Statistical analysis and more evaluation metrics. (R4) MICCAI rules prohibit new experiments but we will add discussion on other metrics.

  5. Limited Novelty. (R5) We differentiate our method from existing methods of CP-ViT and CP-CNN in two key aspects. First, our method aims to achieve better feature alignment, while the other two aim to make networks sparse. Second, our CP guided neural network is added to CLIP as an auxiliary network, while the other two involve redesigns of the ViT and CNN.

  6. Overstatements/ Comparison with CLIP+CP network without shared parameters. (R5) The CP component helps map image and text features to a unified latent space, facilitating feature alignment. This component can be extended to other applications, and we will revise the statement in the revision.

There is a slight misunderstanding by the reviewer about the paper. Our key innovation lies in designing a single CP network that maps both image and text features into a unified latent space. Having two CP networks, each processing image and text features separately would contradict our design and purpose of mapping both image and text features into the same latent space.

  1. More comparisons/ underperforms MedCLIP in ChestXray. (R5) MICCAI rules prohibit conducting new experiments, but we will address this aspect in the discussion. Unlike MedCLIP, which utilizes multiple prompts in their experiments, we consistently employ a single text prompt for all experiments. This distinction contributes to the observed performance differences.

  2. Experimental implementation details and other evaluation metrics (R6). We will add more details and publish our code. Other evaluation metrics will be added to the discussion.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewer #4 and Reviewer #5 initially provided mixed feedback, highlighting both the novelty and limitations of the CP-CLIP method. Reviewer #4 had concerns about dataset representativeness and the presentation of results, rating the paper as a weak reject. Post-rebuttal, R4 shifted to a weak accept, acknowledging the clarity of the author’s responses and improvements over standard models. Reviewer #5 praised the comprehensibility and novel application but criticized the limited novelty and lack of specific adaptation to medical images and kept the original rating. Reviewer #6 gave a stronger endorsement, highlighting the paper’s methodological strengths and clinical applicability across diverse datasets. The author’s rebuttal addressed many concerns effectively, particularly clarifying the experimental setup and dataset choice. The combination of the reviewers’ updated scores and detailed author feedback suggests a consensus leaning towards acceptance, contingent upon addressing the outlined concerns in the final manuscript.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Reviewer #4 and Reviewer #5 initially provided mixed feedback, highlighting both the novelty and limitations of the CP-CLIP method. Reviewer #4 had concerns about dataset representativeness and the presentation of results, rating the paper as a weak reject. Post-rebuttal, R4 shifted to a weak accept, acknowledging the clarity of the author’s responses and improvements over standard models. Reviewer #5 praised the comprehensibility and novel application but criticized the limited novelty and lack of specific adaptation to medical images and kept the original rating. Reviewer #6 gave a stronger endorsement, highlighting the paper’s methodological strengths and clinical applicability across diverse datasets. The author’s rebuttal addressed many concerns effectively, particularly clarifying the experimental setup and dataset choice. The combination of the reviewers’ updated scores and detailed author feedback suggests a consensus leaning towards acceptance, contingent upon addressing the outlined concerns in the final manuscript.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper concerns about developing medical vision-language pre-training techniques. It is a very active research area with quite many papers published in the last two years. Major criticisms include limited methodological innovation compared to existing literature, and underperforming MedCLIP in method evaluation. Another constructive comment is that comparing to more methods (currently only MedCLIP) in the field would better support the paper.

    Considering the reviewers’ comments, as well as reading the rebuttal and ranking against other papers, this paper may not be of high enough rank in my batch of papers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper concerns about developing medical vision-language pre-training techniques. It is a very active research area with quite many papers published in the last two years. Major criticisms include limited methodological innovation compared to existing literature, and underperforming MedCLIP in method evaluation. Another constructive comment is that comparing to more methods (currently only MedCLIP) in the field would better support the paper.

    Considering the reviewers’ comments, as well as reading the rebuttal and ranking against other papers, this paper may not be of high enough rank in my batch of papers.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Based on the reviewers’ feedback and the authors’ rebuttal, I recommend accepting this paper, provided the authors integrate the additional information from their rebuttal into the final manuscript. The paper introduces CP-CLIP, for zero-shot medical image analysis, validated across multiple datasets. Although the methodological novelty was initially questioned, the authors effectively addressed these concerns in their rebuttal, clarifying experimental setups and dataset choices. The reviewers’ updated scores and detailed author responses indicate a consensus leaning towards acceptance, contingent on incorporating the outlined improvements.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Based on the reviewers’ feedback and the authors’ rebuttal, I recommend accepting this paper, provided the authors integrate the additional information from their rebuttal into the final manuscript. The paper introduces CP-CLIP, for zero-shot medical image analysis, validated across multiple datasets. Although the methodological novelty was initially questioned, the authors effectively addressed these concerns in their rebuttal, clarifying experimental setups and dataset choices. The reviewers’ updated scores and detailed author responses indicate a consensus leaning towards acceptance, contingent on incorporating the outlined improvements.



back to top