Abstract

Contrastive Language-Image Pre-Training (CLIP) based models enable zero-shot classification in radiology but often struggle with detecting normal cases due to rigid intra-sample alignment, which leads to poor feature clustering and increased false positive and false negative rates. We propose OFF-CLIP, a simple and effective refinement that introduces an off-diagonal loss term to promote the clustering of normal samples explicitly. In addition, it applies sentence-level filtering to remove typical normal phrases embedded within abnormal reports. OFF-CLIP does not require architectural changes and does not compromise abnormal classification performance. In the VinDr-CXR dataset, normal classification shows a notable 0.61 AUC improvement over the state-of-the-art baseline CARZero. It also improves zero-shot grounding performance by increasing pointing game accuracy and providing more reliable and precise anomaly localization. These results clearly demonstrate that OFF-CLIP serves as an efficient plug-and-play enhancement to existing medical vision-language models. The code and pre-trained models are publicly available at https://github.com/Junhyun-Park01/OFF-CLIP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3740_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Junhyun-Park01/OFF-CLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ParJun_OFFCLIP_MICCAI2025,
        author = { Park, Junhyun and Moon, Chanyu and Lee, Donghwan and Kim, Kyungsu and Hwang, Minho},
        title = { { OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {382 -- 391}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduced off-diagonal term loss to enhance normal sample clustering in CLIP, which reduced FPs, and removed misaligned normal statements from abnormal reports, which reduced FNs. Extensive experiments have demonstrated the validity of the proposed method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The method proposed in this paper introduces no parameter burden and is efficient in reducing false positives by only precisely modifying the loss function pair of the CLIP. 2.The text filtering strategy also brings performance improvement at a low cost.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Some of the methods compared in the experiments are rather dated, and in fact there have been a number of papers in this field in the last one or two years that have not been cited and compared.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method in this paper is innovative and does not introduce additional parameter burden. The method may have instructive implications for subsequent work in this field. Numerous experiments show the effectiveness of the method for performance improvement.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces OFF-CLIP, a refinement to the CLIP model that improves normal case detection in radiology. It addresses issues of poor normal sample clustering and high false positives/negatives by enhancing normal sample alignment through an off-diagonal term loss and reducing misalignment with sentence-level text filtering. OFF-CLIP significantly improves normal classification, achieving a 0.61 AUC increase over CARZero, while maintaining or improving abnormal classification performance. It also enhances anomaly localization in zero-shot grounding tasks, making it a robust and efficient enhancement for medical vision-language models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The core motivation and method of the paper is very reasonable. The issue of FPs and FNs in medical image-report pairs has always been a challenge in VLM training. The approach of addressing these problems from the perspective of pseudo-labeling is interesting and feasible.
    2. The paper has open-sourced the code and provides sufficient details for reproduction.
    3. The experimental results significantly demonstrate that the proposed method is effective and has been validated on multiple datasets and tasks.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While the paper’s starting point is promising, I personally believe that it falls into a “chicken or egg” dilemma. The paper requires a robust sentence-level classifier, but if we already have a good sentence-level classifier, we could use its classification results as supervisory signals to guide the training of the vision encoder, without the need for the traditional CLIP architecture. Additionally, the paper lacks an in-depth discussion of this classifier, which is actually crucial for pre-training. How do classifiers with varying performance impact pre-training? This requires further clarification.
    2. The issue of FP and FN has already been explored in X-ray studies. For example, MedCLIP[1] uses UMLS to map reports to different entities, achieving more fine-grained pseudo-labeling (whereas this paper only focuses on normal vs. abnormal). The authors should clarify the advantages of using a sentence-level anomaly detector compared to MedCLIP and provide experimental comparisons.
    3. Personally, I believe the main contribution of the paper comes from the pre-trained sentence-level classifier or, rather, some predefined prior rules and knowledge. This type of knowledge, which enhances VLM training, is actually quite obvious and has been widely discussed in the literature [1, 2, 3, 4]. I would suggest that the authors could explore how such predefined rules (such as the anomaly detector based on LLM discussed in the paper) can be efficiently integrated into the CLIP architecture. The comparison should ideally involve CLIP models that also incorporate prior knowledge (possibly through different loss designs or prompt usage) rather than comparing with CLIP models that lack any prior knowledge in their future work.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation of the paper is very reasonable, and the experiments are quite convincing. However, there are some unclear aspects, particularly regarding the design of the comparative experiments. I understand this might be due to space limitations. Therefore, I am inclined to give a weak accept. If the authors could provide a comparison of the strengths and weaknesses of their approach with methods like MedCLIP, I might reconsider raising my score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes OFF-CLIP to address limitations in radiology contrastive language-image pre-training models. It improves normal detection and anomaly localization, outperforming the baseline.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Identifies key limitations in radiology CLIP models and proposes effective solutions.
    • Achieves significant improvement in normal detection and anomaly localization.
    • Framework agnostic design may be applicable to other models.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • OFF-CLIP was only tested on the CARZero baseline. Its effectiveness across different architectures remains uncertain, which restricts the generalizability conclusion.
    • OFF-CLIP relies on pretrained models like GPT-4o for text prompting and a sentence-level anomaly classifier for pseudo-labels. Their quality impacts OFF-CLIP’s performance.
    • This paper lacks a real-world clinical utility assessment. It’s unclear how well OFF-CLIP would perform in actual medical practice scenarios.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Thank you for your thoughtful and constructive comments, and we sincerely appreciate your positive evaluation of our work, which led to its early acceptance. To Reviewer # 1,

  1. While our original manuscript aimed to cover relevant vision-language models, we acknowledge the omission of important works such as MedCLIP, MedKLIP, and BiomedCLIP. In the revised version, we will update the Related Work section to include these models and clarify their distinctions from OFF-CLIP.

To Reviewer #2,

  1. We address vision-language alignment through binary anomaly classification, labeling each sentence as normal or abnormal. Constructing a fine-grained, multi-disease classifier is challenging (the “egg”), while binary classification is simpler due to clearer semantic boundaries. Thus, a sentence-level classifier serves as a practical foundation. Our classifier is directly adopted from prior work [1], which used GPT-3.5 to generate sentence-level labels without manual supervision. Specifically, it utilized RadBERT encoder trained with supervised contrastive learning. We will clarify its architecture and integration in the revised manuscript.
  2. MedCLIP’s pseudo-labeling is constrained to 14 predefined UMLS entities, which limits its ability to generalize to rare or unseen conditions. In contrast, OFF-CLIP avoids reliance on fixed taxonomies and instead leverages CLIP’s open-vocabulary capability. Abnormal cases often exhibit high variability in severity, location, and appearance, making them unsuitable for consistent clustering. Normal cases, however, are visually and semantically uniform, allowing for effective latent space clustering. OFF-CLIP exploits this asymmetry by clustering only normal samples, thereby improving anomaly detection. We will clarify this design rationale in the revised manuscript.
  3. We appreciate your insightful comment and agree on the importance of integrating prior knowledge more directly into the CLIP architecture. Our current work serves as a first step: by leveraging a simple yet effective sentence-level classifier to cluster normal samples, we highlight the utility of such priors. In future work, we aim to explore deeper integration of both language-based (LLM) and vision-side priors. We also agree that future comparisons should include CLIP models that incorporate prior knowledge through various mechanisms to better contextualize our method’s impact.

To Reviewer #3,

  1. We acknowledge that our validation focuses on the CARZero baseline. However, most medical CLIP models, including CARZero, use InfoNCE loss over a similarity matrix. Since OFF-CLIP introduces off-diagonal term loss and text filtering in this training structure, we expect it to yield similar benefits across other architectures. While MICCAI policy limits new results post-submission, we have tested OFF-CLIP on other models and observed comparable improvements.
  2. OFF-CLIP relies on a sentence-level classifier and GPT-4o prompts. However, both are robust: our classifier achieves AUC 0.977 [1], and GPT-4o generates consistent, high-quality prompts in our setup. While more ablations are valuable, we believe the risk of performance degradation is minimal. We will clarify this and note it as future work.
  3. We acknowledge the limitation of not directly evaluating real-world clinical utility. However, our validation sets, VinDR, CheXpert, PadChest, and Open-I, are manually annotated by board-certified radiologists and excluded from training. OFF-CLIP demonstrates strong performance on zero-shot classification, achieving balanced false positives and false negatives, with overall AUCs ranging from 0.80 to 0.90. While this does not fully reflect real-world deployment, we believe it serves as a strong proxy and highlights OFF-CLIP’s potential clinical applicability.

[1] Kim, K. et al. Integrating ChatGPT into secure hospital networks: A case study on radiology report analysis. Conference on Health Inference and Learning, pp. 72–87, PMLR, 2024.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top