Abstract

Echocardiography is a critical imaging technique for diagnosing cardiac diseases, requiring accurate view recognition to support clinical analysis. Despite advancements in deep learning for automating this task, existing models face two major limitations: they support only a limited number of cardiac views, insufficient for complex cardiac diseases, and they inadequately handle out-of-distribution (OOD) samples, often misclassifying them into generic categories. To address these issues, we present EchoViewCLIP, a novel framework for fine-grained cardiac view recognition and OOD detection. Built on our collected large-scale dataset annotated with 38 standard views and OOD data, EchoViewCLIP integrates a Temporal-informed Multi-Instance Learning (TML) module to preserve temporal information and identify key frames, along with a Negation Semantic-Enhanced (NSE) Detector to effectively reject OOD views. Additionally, we introduce a quality assessment branch to evaluate the quality of detected in-distribution (ID) views, enhancing the reliability of echocardiographic analysis. Our model achieves 96.1% accuracy across 38 view recognition tasks. The code is available at https://github.com/xmed-lab/EchoViewCLIP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4443_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xmed-lab/EchoViewCLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SonSha_EchoViewCLIP_MICCAI2025,
        author = { Song, Shanshan and Qin, Yi and Yang, Honglong and Huang, Taoran and Fei, Hongwen and Li, Xiaomeng},
        title = { { EchoViewCLIP: Advancing Video Quality Control through High-performance View Recognition of Echocardiography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose EchoViewCLIP, a model for echocardiography view classification and out-of-distribution (OOD) detection. It classifies echocardiographic videos into 38 standard views, detects potential OOD views, and assesses the quality of in-distribution views. The model is trained and evaluated on an in-house labeled dataset. EchoViewCLIP leverages a CLIP-based vision-language (ViL) model combined with multi-instance learning to incorporate temporal information across the full video. To detect OOD views, the authors introduce negation semantic-enhanced learning. The model achieves strong performance both on their in-house dataset, and is also evaluated on the external CAMUS dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents a somewhat novel and creative framework, with the commendable aim of not only classifying echocardiographic views but also performing OOD detection and quality assessment. The results on the in-house dataset are promising when compared to the state-of-the-art baseline. It is positive that they plan to share their code and models. The ablation studies are informative and clearly demonstrate the advantages of the individual components of the proposed framework.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is unclear how the method actually performs OOD detection. To me, it reads as though the authors treat OOD detection as a classification problem, simply replacing the “Other” class with “OOD” — a strategy they themselves critique in the introduction: “Second, existing models struggle to handle OOD videos that deviate from predefined view categories (see Fig. 2). These models often misclassify them into arbitrary classes or assign them to a generic ‘Others’ class. This introduces noise during training and undermines model reliability in real-world scenarios, where non-standard views are frequently encountered.” While the authors introduce a Negation Semantic-Enhanced (NSE) module, it remains unclear how this is actually utilized for OOD detection. The paper should clarify how the proposed approach differs from simply assigning videos to an “OOD/Other” class. Although the authors claim that the NSE module enables OOD rejection by leveraging negation semantics, the implementation appears to operate at a per-class level by computing sim_pos × (1 − sim_neg) for each predefined class. This still yields class-wise scores and does not inherently support OOD detection unless a rejection mechanism (e.g., thresholding) is explicitly introduced — which the paper does not describe. As it stands, the approach seems more like a re-weighting of class confidence scores than a principled method for identifying truly out-of-distribution inputs. It is also unclear why the quality assessment module is not used as an additional signal for OOD detection.

    • Regarding quality assessment, the paper provides very limited information. It is unclear what the quality module outputs, how it is trained, or what loss function is used.

    • The evaluation of OOD detection compared to state-of-the-art baselines is also poorly explained. It is difficult to interpret the meaning of this sentence in Section 3: “For OOD detection, it can be noted that for models lacking inherent OOD detection capabilities, we followed previous methods by adding additional OOD training data.” What does this entail in practice? Are the baseline models trained with additional labeled OOD samples? If so, how is fairness ensured in the comparison?

    • Figure 1 is somewhat confusing. It does not clearly illustrate how the outputs from the different visual experts are used for view classification, OOD detection, and quality assessment. This is not clearly explained in the text either.

    • The evaluation of the quality assessment module on the CAMUS dataset is difficult to interpret from Figure 3. The statement “As shown in Fig. 3, the results demonstrate that leveraging our high-performance model and its feature distribution significantly improves the effectiveness of the ID view quality control task, particularly for distinguishing between good’ and poor’ quality views.” is not obviously supported by the figure. What is being tested, and how can we observe this improvement? The setup is vague. The only clarification is: “Furthermore, we evaluated the quality control ROC curves on the public CAMUS quality classification dataset, comparing two approaches: direct classification and utilizing our view recognition features (hpos, hneg, and TML weights).” However, the paper does not specify which model is used in each case, what the CAMUS quality labels look like, or how the comparison was carried out. These details are necessary for interpreting the results.

    • The authors should evaluate the view classification performance on a public benchmark dataset.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Strange sentence in Sec. 1 Introduction: “Among these, ViFi-CLIP [14] achieves SOTA performance by averaging pooled frame-level features and finetuned on CLIP [13].“

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though the framework appears novel, creative, and well-performing, there are major ambiguities that need to be addressed before I could consider recommending anything other than rejection. There are several methodological uncertainties, particularly regarding how the outputs from the different visual experts or modules are used. For instance: how is OOD detection actually carried out, and what does the quality assessment module output? Both the text and Figure 1 need to be clarified to explain these components more clearly. Moreover, the method should preferably be evaluated on a public echocardiography view classification dataset, especially if the authors do not plan to release their own dataset. While there are some attempts to evaluate on CAMUS (for quality assessment), these are difficult to interpret.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed the main concerns I raised in my review. They clarified the OOD detection mechanism, and explained the role and training of the quality assessment module. These responses resolve several key ambiguities, and I appreciate that the authors explicitly state they will incorporate these clarifications into the revised manuscript. However, some elements, such as architectural flow and the quality control results, could still benefit from clearer presentation. Assuming the paper is revised to reflect the rebuttal clarifications, I believe the work makes a meaningful contribution to the area of echocardiographic view classification and OOD detection. I am therefore recommending acceptance.



Review #2

  • Please describe the contribution of the paper
    • The paper proposed a novel framework, EchoViewCLIP, designed to recognise an extended set of 38 cardiac views while effectively detecting out-of-distribution (OOD) samples from echocardiographic videos.

    The authors integrated two key modules within EchoViewCLIP: a Temporal-informed Multi-Instance Learning (TML) module to effectively identify key frames within each cardiac video and a Negative Semantic-Enhanced Detector (NSED) module to classify whether the video belongs to an out-of-distribution (OOD) category.

    • The proposed EchoViewCLIP also outputs a quality indicator for in-distribution cardiac views, helping assess the reliability of the view classification.

    • Apart from the methodology, the paper also contributes a large-scale echocardiography video dataset comprising 20,617 videos, each annotated with one of 38 view labels and OOD indicators. The annotations were performed by nine experienced cardiologists, enhancing the dataset’s reliability and clinical relevance.

    • The proposed EchoViewCLIP achieved superior performance in cardiac view recognition and competitive performance in OOD detection, outperforming or matching state-of-the-art image- and video-based methods in both tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of formulating cardiac view classification (typically static) as an action recognition task (dynamic) is insightful, as it leverages the temporal context across frames. This approach helps the model become more robust against failures caused by a single noisy or ambiguous frame, which is particularly relevant given that cardiac videos are typically available in clinical practice.

    • The justification for the proposed components, TML and NSE modules, within the EchoViewCLIP framework is well-supported by the ablation study, which provides strong validation of their necessity and contributions to the overall performance.

    • The curated large-scale dataset is a notable strength of the paper, subject to making it publicly available, as it could serve as a valuable benchmark for future research in echocardiographic view classification and OOD detection.

    • The performance evaluation against a broad range of image- and video-based state-of-the-art methods clearly reflects the transparency and credibility of the claimed superiority of the proposed framework.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The name of the proposed model includes “CLIP,” but there is no direct explanation or connection to the original CLIP model (Contrastive Language-Image Pre-training) in the paper. It would be helpful to clarify the reasoning behind using the “CLIP” suffix in the model’s name for an individual without having background knowledge of CLIP.

    • The paper mentions that the model classifies 38 standard cardiac views, but the actual list of these views is not provided. Including the full set of view labels would significantly improve the clarity and reproducibility of the study, as readers would better understand the classification task and its clinical relevance.

    • The proposed EchoViewCLIP presentation issues; addressing those will provide comprehensive insights into the model. – For TML, I am not clearly sure how it utilizes the multi-instance. Does a single video can be labelled as multi-views? – Information for different notations used in equations and the diagram are not explained and are complicated to grasp; – In section 2.1, “frame-wise visual representations h_pos = {h_1, h_2,..,h_T}……..”, what does the “pos” refer? positive or position? – What are the dimensions of h_pos, q, M, f_pos, k_pos, t_pos? – How the overall temporal semantic features t_pos is different from the mentioned ViFi-CLIP’s averaging pooled frame-level features? – It is unclear how the concatenation is taking place C(k_pos, t_pos). – Equation (2) shows the computation of f_pos, but it is hard to connect in the method diagram and the data flow between components is not completely understandable. – Some notations in the diagram have no explanation such as E_pos, E_neg, or TML_slc. – I am guessing C represent the number of views in equation (4), but a clear definition should be presented. – Although TML is considered one of the novelties of the paper, no architectural information is provided. – NSE is also a major focused component, but it is missing from the diagram. Similarly, no architectural information is provided. – In the diagram, two-stage training is depicted with different color arrows, but an explanation of how the model is trained in two stages is missing within the text.

    • Manuscript issues need to be addressed: – AUROC for OOD recognition of EchoViewCLIP is stated as 0.992 in the text but 0.993 in the table. – Inconsistency with the abbreviation; NSE or NSED? – The full form of TML or NSE with an abbreviation in the parenthesis is used quite frequently. Using the abbreviation after the first full form could save space and allow for to addition of other information. – In-distribution is abbreviated as ID in the caption of Fig. 2 without stating it at first. Later it was found in “ID Quality Control Analysis”. – Fig. 1 jumps in the text out of nowhere, referring to it from a text would clarify what is expected to see in the figure. – Please check for simple grammar mistake like “Videos that does not” in Fig. 1 should be “Videos that do not”. – In the introduction section, “Second, existing models struggle to handle OOD videos that deviate from predefined view categories (see Fig. 2)”. Did you refer to Fig. 1? – In “We introduce two specialized visual experts to learn positive and negative semantics independently and NSE text branch…….”, should it be ‘and’ or ‘in’ before “NSE text branch”?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented model demonstrates potential in methodological contribution, with strong motivation and relevance to clinical practice. However, the clarity in presenting the model components and their novelty remains limited, making it difficult to assess the true extent of its innovation. Addressing the concerns outlined in the major weaknesses—particularly with regard to architectural explanations and data flow—would help eliminate ambiguity and provide the transparency needed to better evaluate the novelty and effectiveness of the proposed framework.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thank you for the rebuttal and the efforts. While I appreciate the clarifications provided for a few comments, I find that the core concerns I raised regarding the model architecture and data flow weren’t addressed. Although the authors emphasize the novelty of their method in the paper, but the architectural contribution and its practical implications are not clearly or convincingly articulated. As a result, it remains difficult to assess the true innovation of the work based on the current explanation.

    Given these unresolved issues, I believe that a major revision would be required to bring the paper to a publishable standard, particularly with regard to the transparency of the architecture and the articulation of methodological novelty. Unfortunately, such a substantial revision may fall outside the scope of what is feasible during the MICCAI rebuttal phase.

    Therefore, I encourage the authors to take the time to revise the manuscript more extensively, with a focus on improving architectural clarity, data flow explanation, and overall readability. I believe these efforts will significantly strengthen the quality and impact of the work for future submissions.



Review #3

  • Please describe the contribution of the paper

    The paper presents EchoViewCLIP, a novel framework for cardiac view recognition and out-of-distribution (OOD) detection in echocardiography videos. The main contributions include a Temporal-informed Multi-Instance Learning (TML) module for capturing temporal dynamics and identifying key video frames for precise view recognition, and a Negation Semantic-Enhanced (NSE) Detector to effectively reject OOD views. Additionally, the framework integrates an in-distribution quality assessment branch to evaluate video quality and enhances echocardiographic analysis reliability. The model supports a wider range of cardiac views (38 views) than prior models and achieves a 96.8% accuracy, and upon acceptance, the code and model will be released open-source.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel Formulation: The introduction of the TML module helps preserve temporal dynamics, which is crucial for effective view classification in echo videos. This module allows the model to focus on key frames, improving the accuracy and robustness of video categorization by leveraging spatiotemporal information.
    • Handling Out-of-Distribution Data: The NSE Detector is an innovative approach for OOD detection, utilizing semantic contrast to effectively reject non-standard views, thereby reducing noise and enhancing model reliability in practical scenarios.
    • Large-scale Dataset: The authors curated a comprehensive dataset of echocardiography videos annotated with 38 view classes and OOD samples, labeled by expert cardiologists, providing a robust benchmark for training and validation, aligning with real-world clinical scenarios.
    • State-of-the-Art Performance: EchoViewCLIP achieves superior performance over existing models in both view recognition and OOD detection tasks, demonstrating clinical feasibility and reliability.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Novel Application Context: While the application to echocardiography view recognition is novel, it might face challenges in generalizing across different types of medical imaging scenarios where temporal features and OOD detection might not be equally applicable.
    • Complexity of Implementation: The introduction of multiple modules (TML, NSE, and quality assessment) may lead to increased model complexity, potentially making it harder to translate into scalable clinical workflows without significant computational resources.
    • Limited Comparative Study on Quality Assessment: Though the model includes quality assessment capabilities, there is limited attention devoted to comparing this aspect against existing techniques for echocardiographic quality evaluation.

    • Quality Review (QR) Block Integration:
    • The QR block is a crucial component for ensuring comprehensive video quality assessment but is not adequately reflected throughout the entire pipeline. A more systematic integration and evaluation within both the QR block and the in-house test set should be included. The QR evaluation is presented primarily through ROC curves. Additional metrics, such as precision, recall, and F1 score, should be provided to offer a more complete assessment of the QR block’s effectiveness and accuracy.

    • Public Dataset Mention:
    • The use of the CAMUS dataset is referenced, but details about its role, relevance, and how it interacts with EchoViewCLIP’s evaluation are insufficiently covered. A more thorough discussion on why this dataset was selected and how it complements the in-house dataset is necessary.

    • Text Encoding Justification:
    • The necessity of text encoding in the design should be clarified. While it helps bridge video data with semantic labels, the underlying logic for integrating text encoding and how it enhances the view recognition task needs further explanation.

    • Significance of Add-On Modules and Statistical Tests:
    • Improvements from additional modules in Table 2 should be backed by statistical significance testing or cross-validation results indicating robust performance gains.
    • In Table 1, it should be stated whether statistical tests (such as t-tests or ANOVA) were conducted to validate claims that EchoViewCLIP significantly outperforms other state-of-the-art approaches. This data point would add rigor to comparative claims.

    • Figures Readability and Interpretation:
    • The figures, especially Fig. 4, are noted to be small and difficult to interpret. Larger, more detailed diagrams are needed to aid reader comprehension.
    • The baseline method in Fig. 4 is not identified, nor is the rationale provided for its selection. Clearer justification for choosing this particular baseline and showing how temporal weights influence the depicted results will improve interpretability.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation for acceptance is primarily driven by the paper’s innovative approach to view recognition and OOD detection, aspects that are both novel and clinically important. The strength of the proposed modules (TML and NSE) enhances cardiac imaging analysis, and the comprehensive dataset supports its applicability in real-world settings. Despite some implementation complexity, the performance gains and the promise of open-source code further support the recommendation. Moreover, the significant improvement over existing models solidifies its contribution to medical imaging and echocardiography research domains. While my concerns needs to be addressed clearly until I could make a final decision.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    thanks for addressing my concerns




Author Feedback

We sincerely thank reviewers for their valuable feedback. The reviewers praised our motivation as insightful and commendable, our method as novel and creative, and our experiments as informative, robust, and well-executed, highlighting the use of a large-scale dataset and promising results. The main concerns mainly involved the method details and paper clarity. Below, we provide point-by-point responses, and will include these in our revision. [How OOD Detection is Performed R1] (1) We would like to clarify that ours does not treat OOD detection as an N+1 (others) classification. We actually introduced the thresholding rejection mechanism. Specifically, we use NSE to reject OOD samples based on the threshold applied to the sum of class-wise scores: sum(per_class(sim_pos × (1 − sim_neg))) < 0.5 (this threshold can be adjusted as needed). (2) Quality assessment mainly focuses on ID samples. Therefore, it is not used as an additional signal for OOD detection. [OOD Evaluation R1] Ours effectively rejects OOD using only ID training data and is better, whereas others require OOD-specific supervision (ours trained OOD is shown at Ablation b). [Quality Assessment Setting R1] The quality assessment outputs the quality label (good, middle, poor). It is trained as a classification task using doctor-annotated labels (see Dataset) and CE loss. [Public Benchmark R1] To the best of our knowledge, there is no publicly available dataset that supports ID (more than 10 classes) and OOD view classification. [CAMUS’s Role R2] The use of CAMUS aims to address potential subjectivity in manual quality scoring on in-house data. So, we included CAMUS for further validation to ensure robustness. [The CAMUS evaluation R1R2] (1) The improvements are shown in the lower-right of Fig. 3: good (+0.03), middle (+0.01), and poor (+0.02). (2) Both CAMUS studies are based on our view classification model. The difference is whether view recognition features from our ID and OOD experts are used during QC expert fine-tuning. This aims to validate that accurate view classification can improve quality control. The output is categorized as good, middle, or poor. [Limited Study on Quality Assessment R2R1] Our main focus is SOTA ID and OOD view classification, and to show that the performance of ID quality control can also be enhanced when view classification is accurate. The goal of quality control experiments is to validate this argument, instead of proposing another dedicated method for quality control. More validation will be included in revision if permitted. [Complexity of Implementation R2] Our model’s size is equivalent to the CLIP-based video model (1029MB). Each part is modularized for clinical needs. Hence, it is sound to translate into an efficient clinical workflow. [Different scenarios R2] This work focuses on the important multi-view Echocardiography analysis; generalizing more scenarios would be a promising future direction. [CLIP and necessity of text encoding R3R2] Our framework is a CLIP-based model and we will add the introduction of CLIP. Text encoding is necessary because labels include both the view type and modality. Text encoding helps the model capture relationships between classes (e.g., PLAX color and PLAX 2D, or PLAX color and A4C color) rather than treating all 38 labels as completely independent classes. This allows the model to better understand semantic correlation across classes and improves performance. [list of 38 views R3] We will add this list. [Presentation issues R3] Here we clarify some unclear details: Multi-instance: a single video has a single label but multiple frames that some may not be standard. pos: positive. Difference: ViFi-CLIP uses averaging pooled but we use TML to weight frame features. Manuscript issues: Thanks for your detailed reviews, we will update all these in the revision. [Equations and Figures Readability and Interpretation R3R1R2] Thanks for your advice, we will refine them based on all your suggestions.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper enjoyed the most robust and engaged reviewer pool of my whole stack. My congratulations to the authors for their good fortune, and my compliments to the reviewers. The vote was 2-1 in favor of acceptance, and given the engagement of all reviewers, and weighting slightly for experience, I recommended Acceptance. However, the “against” reviewer raised valuable and salient points. I urge the authors to carefully assess these comments, and modify the paper, to the degree feasible, to address them before camera-ready deadline.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a novel view classification work for echocardiography in a CLIP based architecture with NSE for OOD detection and TML for temporal adaption. The main concerns about the OOD detection, quality control evaluation, and architecture details were partially addressed in the rebuttal and need further clarification in the revised paper.



back to top