Abstract

How do we transfer Vision Language Models (VLMs), pre-trained in the source domain of conventional echocardiography (Echo), to the target domain of few-shot portable Echo (fine-tuning)? Learning image causality is crucial for few-shot learning in portable echocardiography quality assessment (PEQA), due to the domain-invariant causal and topological consistency. However, the lack of significant domain shifts and well-labeled data in PEQA present challenges to get reliable measurements of image causality. We investigate the challenging problem of this task, i.e., learning a consistent representation of domain-invariant causal semantic features. We propose a novel VLMs based PEQA network, Causality-Adapting Visual Scoring CLIP (CausCLIP), embedding causal diposition to measure image causality for domain-invariant representation. Specifically, Causal-Aware Visual Adapter (CVA) identifies hidden asymmetric causal relationships and learns interpretable domain-invariant causal semantic consistency, thereby improving adaptability. Visual-Consistency Contrastive Learning (VCL) focuses on the most discriminative regions by registing visual-causal similarity, enhancing discriminability. Multi-granular Image-Text Adaptive Constraints (MAC) adaptively integrate task-specific semantic multi-granular information, enhancing robustness in multi-task learning. Experimental results show that CausCLIP outperforms state-of-the-art methods, achieving absolute improvements of 4.1%, 9.5%, and 8.5% in view category, quality score, and distortion metrics, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0745_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_CausCLIP_MICCAI2024,
        author = { Li, Yiran and Cui, Xiaoxiao and Cao, Yankun and Zhang, Yuezhong and Wang, Huihui and Cui, Lizhen and Liu, Zhi and Li, Shuo},
        title = { { CausCLIP: Causality-Adapting Visual Scoring of Visual Language Models for Few-Shot Learning in Portable Echocardiography Quality Assessment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a method for predicting image quality in portable cardiac ultrasound images. It begins by fine-tuning CLIP on conventional ultrasound images, utilizing text prompts that describe the classification of echocardiography views, assessment of quality scores, and identification of ultrasound distortions. Subsequently, a new module is applied to discover hidden asymmetric causal relationships between text and vision features. These causal features then augment and adapt the visual and textual data when training on the new domain (portable ultrasound). This is achieved using a so-called “multigranular loss function,” which, as I understand, is composed of text and image components that receive higher loss weights if identified as causal features. The paper evaluates this method using an internal benchmark for image quality of portable cardiac ultrasound and compares it against existing image quality methods in computer vision.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (*) The paper addresses an interesting problem of domain adaptation from conventional to portable ultrasound imaging.

    (*) It proposes the use of CLIP with augmented text prompts as a solution, which is a novel approach for this specific application.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (*) The paper is difficult to follow, filled with acronyms, and lacks intuitive explanations and domain-specific examples. Instead of stating the summary advantages at the end of each sub-section, I recommend using more intuitive terms and providing examples. For instance, no examples of asymmetric causal relationships from the echocardiography domain are provided. Consequently, I am either unconvinced or do not fully understand why discovering causal relationships is crucial and how it is not included in the andard CLIP fine-tuning.

    (*) Figure 2 resembles a poster rather than an overview figure. I suggest simplifying it to make it more intuitive. The flow of the figure is unclear, including what follows what, and the brief caption provides little clarification.

    (*) The roles of few-shot and zero-shot learning are not well-explained in the paper, which should be an integral part of the method description. I found myself needing to read the paper several times to grasp these concepts.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    see below my question about data

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (*) Do you also plan to release the conventional and portable ultrasound data alongside the code?

    (*) Can you define and elaborate more on the “multigranular anchor” and “multigranular loss function”? I believe these terms should also be briefly included in the paper for clarity.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presentation of the method lacks clarity, and I am not fully convinced about its practical utility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I thank the authors for their response, and I appreciate the discussion with all other reviewers. As highlighted in my earlier review, while I recognize some technical contributions in this paper, I find it to be poorly written and organized. A major revision is necessary, including the addition of intuitive examples that support the main claim—that asymmetric causal relationships need to be explicitly incorporated into the model (although I would argue that these could potentially be learned during training). Improved presentation and organization of the content could make the paper more convincing.



Review #2

  • Please describe the contribution of the paper

    While the paper addresses the important issue of quality assessment in portable echocardiography, pinpointing its contribution in its current form is somewhat challenging.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    While the paper addresses the important issue of quality assessment in portable echocardiography, pinpointing its primary strengths in its current form is somewhat challenging.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a discrepancy between the title and the actual content of the paper. Despite being titled as a paper on visual language models, the content predominantly focuses on visual processing, with limited information on language processing. This inconsistency raises some questions.

    The paper repeatedly mentions “inter-image causality” without providing a clear definition or explanation. Additionally, it is unclear how the “k*k causality map” mentioned in section 2 relates to this concept.

    The caption for Figure 1 is somewhat confusing as it lacks sufficient information about the contents of the figure. It repeats the first sentence from the introduction section and introduces the term “domain-invariant causal semantic consistency,” which is not clearly defined within the context of the caption.

    The author claims to have improved discriminability and collaborative optimization as part of their contributions. However, it’s unclear how these terms are specifically applied or what they entail within the context of the paper.

    Section 2 of the paper contains numerous mathematical notations that lack clear explanations.

    The order of components in Figure 2 appears unusual, as “a” comes after “b”, and the caption does not provide sufficient detail about the contents of the figure.

    The concept of “interpretable causal features” mentioned in Section 2 is unclear regarding whether it is the result of the author’s intervention or Grad-CAM.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I suggest that the author should provide sufficient information about the language processing aspect, similar to the level of detail provided for the visual components. Additionally, I’ve outlined the weaknesses of the paper as constructive comments aimed at helping the author enhance their work. I recommend that the author carefully review these comments for potential improvements.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The decision predominantly hinges on the paper’s lack of clarity, as elucidated in the weaknesses I’ve outlined.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Some of the comments has been addressed, however a clear explanation on collaborative optimization with respect to MAC and Tab 2 as they author pointed out will be needed in the final manuscript. Also as other reviewers pointed out, the overall clarity and organization of the paper especially when it comes to figures and its captions needs some improvement.



Review #3

  • Please describe the contribution of the paper

    The paper describes a novel method to learn domain-invariant casual semantic features during vision language model training to overcome limitations when fine-tuning on a new limited dataset. The method was validated on the task of training on VQA on conventional ultrasound and finetuning on portable ultrasound. The method was benchmarked against other methods and could show superiority.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The quantiative results look very promising given that even the zero-shot configuration outperforms many of the other methods. Also the dataset size is comparatively large and it would be great if it could be made available for further research.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some details about the data and experiment design were not quite clear and the Figures could be improved/made more readable. Unfortunately the code and the data will not be published.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is not quite clear whether all the implementation details are provided to fully reproduce the results (ie the causality map) especially without any data published. However, the authors described the training process with the hyperparameters.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In general, this is an interesting paper and exploring casual relationships for a better classification is regarded as an important topic. I have some questions and remarks:

    • The relationship between the quality, the view and the distortion could be described a bit more detail
    • Both Figures take quite some time to understand. I would recommend to simplify Fig. 1 a bit and make the different steps (and where they belong) a bit clearer in Fig 2.
    • How were the 16,000 images labeled, were they labeled manually? Labeling ultrasound image quality is an extremely subjective task, how did you ensure the quality?
    • What happens if there are multiple distortions?
    • As far as I understood the negative vision prompts are other views? Are the other components besides the view in the negative text prompts augmented?
    • Can the authors elaborate more on how different the portable echo is compared to standard echo? Are the quality attributes directly transferable?
    • Minor: Intro (p.2 first section): “Can PEQA benefit from specific knowledge during the transfer learning? Therefore, finding a causality-driven domain transfer method is crucial …” this is not the answer to the question
    • Minor: Multiple typos and grammar errors, please revise (CauseClip, on on etc)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are some questions open but no major drawbacks and the results seem to be promising and important to the community.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Some of the remarks were addressed by the reviewers, however, I agree with the other reviewers that the concept presentation should be made clearer for example as pointed out by improving the figures prior to publication.




Author Feedback

Thanks to the reviewers for your valuable comments of our work. Especially, thanks to R4 for accepting the paper directly. We appreciate your recognition of:

  1. Our contributions (R1: ”a novel approach for this specific application” R4: “show superiority” R4: “promising and important to the community”)
  2. Sufficient experiments (R4: “the quantitative results look very promising”)

We will release the code, datasets, and detailed implementation for reproducibility to address all constructive comments.

Novelty: This is the first time that asymmetric causality has been achieved in quality assessment, comprising two aspects and providing clinically interpretability:

  1. Asymmetric causality: The interaction between statistical features and the real topological structure. The Invariant topology generates consistency in feature interventions. For example, while the A3C view necessarily includes the aorta, the presence of the aorta alone does not confirm the view as A3C view since the A5C view also displays the aorta. This asymmetry helps infer the determinative features of the A3C view, beyond just the presence of the aorta.
  2. Domain-invariant consistency. The relationship between causal semantics and cross-domain features, enabling generalization to new distributions. For example, the aorta in the A3C view is barely visible in low-quality portable ultrasound, but its presence can still be determined due to the consistency with high-quality conventional ultrasound.

Q1: Why discovering causal relationships is crucial (R1): A1: Because causality is the key factor in cross-domain topological consistency, domain mismatch with quality variations can hardly be resolved without it. Our experiments demonstrate its importance.

Q2: How it is not included in the standard CLIP fine-tuning (R1). A2: Standard CLIP optimizes statistical relationships but lacks asymmetric causal relationships.

Q3: Clarifications about multi-granular anchor and multi-granular loss function (R1): A3:

  • Multi-granular anchor addresses the training imbalance for text prompt. We set anchors based on three key metrics (category, quality, and distortion) and create additional multi-granular text prompts to enhance the balance of multi-task learning.
  • Multi-granular loss function adaptively adjusts weights for each prompt, ensuring alignment with causality.

Q4: Clarification about the image labels, multiple distortions, negative vision prompt and domain differences (R4): A4:

  • Two expert sonographers annotated the labeling process and established a rigorous quality control protocol to ensure objectivity.
  • We select the predominant ones among multiple distortions, as stated in the 8th sentence in Sect.2.
  • The negative vision prompts are generated from views different from the input image.
  • Portable Echo provides unclear details, but its quality attributes are transferable.

Q5: Limited information on language processing (R5): A5: Language processing is crucial to our framework, establishing the first specific prompt pattern for echocardiography quality assessment and focus on key metric interaction. For example, image with “offset” distortion is accompanied by “bad” quality, as stated in the 11th sentence in Sect.2.

Q6: Clarification about k×k causality map (R5): A6:

  • The k×k causality map represents the probability of a feature appearing at a specific location, corresponding with the invariant topology in causality. For example, the aorta appears in the right of the A3C view with 100% probability, but 0% in the A2C view.

Q7: Clarification about discriminability and collaborative optimization (R5): A7:

  • Fig. 3 supports the discriminability by visualize our CausCLIP, which focuses on the most discriminative regions and is more sensitive to the detail features.
  • Tab. 2 supports the collaborative optimization of MAC in achieving balanced training for metrics. Its removal keeps the “category” unchanged, while the “quality” and “distortion“ decline significantly.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Borderline accepted paper. The authors should comprehensively revise the manuscript according to the reviewer’s suggestions.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Borderline accepted paper. The authors should comprehensively revise the manuscript according to the reviewer’s suggestions.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a method for predicting image quality in portable cardiac ultrasound images by fine-tuning CLIP on conventional ultrasound images using text prompts for classification, quality assessment, and distortion identification. It introduces a novel module to uncover hidden asymmetric causal relationships between text and vision features, enhancing the training process on portable ultrasound images through a “multigranular loss function.” Despite its innovative approach and promising quantitative results, there are some weaknesses including that the paper suffers from unclear explanations, numerous acronyms, and a lack of intuitive examples and detailed performance evaluation. There is also a discrepancy between the title and content, with a predominant focus on visual processing and limited information on language processing. Given the strengths and weaknesses of this paper, I would suggest accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents a method for predicting image quality in portable cardiac ultrasound images by fine-tuning CLIP on conventional ultrasound images using text prompts for classification, quality assessment, and distortion identification. It introduces a novel module to uncover hidden asymmetric causal relationships between text and vision features, enhancing the training process on portable ultrasound images through a “multigranular loss function.” Despite its innovative approach and promising quantitative results, there are some weaknesses including that the paper suffers from unclear explanations, numerous acronyms, and a lack of intuitive examples and detailed performance evaluation. There is also a discrepancy between the title and content, with a predominant focus on visual processing and limited information on language processing. Given the strengths and weaknesses of this paper, I would suggest accepting this paper.



back to top