Abstract

In the realm of medical data analysis, medical cross-modal hashing (Med-CMH) has emerged as a promising approach to facilitate fast similarity search across multi-modal medical data. However, due to human subjective deviation or semantic ambiguity, the presence of noisy correspondence across medical modalities exacerbates the challenge of the heterogeneous gap in cross-modal learning. To eliminate clinical noisy correspondence, this paper proposes a novel medical cross-modal prompt hashing (MCPH) that incorporates multi-modal prompt optimization with noise-robust contrastive constraint for facilitating noisy correspondence issues. Benefitting from the robust reasoning capabilities inherent in medical large-scale models, we design a visual-textual prompt learning paradigm to collaboratively enhance alignment and contextual awareness between the medical visual and textual representations. By providing targeted prompts and cues from the medical large language model (LLM), i.e., CheXagent, multi-modal prompt learning facilitates the extraction of relevant features and associations, empowering the model with actionable insights and decision support. Furthermore, a noise-robust contrastive learning strategy is dedicated to dynamically adjusting the intensity of contrastive learning across modalities, thereby enhancing the contrast strength of positive pairs while mitigating the influence of noisy correspondence pairs. Extensive experiments on multiple benchmark datasets demonstrate that our MCPH surpasses the state-of-the-art baselines.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2150_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2150_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Liu_Medical_MICCAI2024,
        author = { Liu, Yishu and Wu, Zhongqi and Chen, Bingzhi and Zhang, Zheng and Lu, Guangming},
        title = { { Medical Cross-Modal Prompt Hashing with Robust Noisy Correspondence Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposed a medical cross-modal prompt hashing framework with noise-robust contrastive constraint.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed challenge, mismatched pairs, i.e.,noisy correspondence, is interesting and the authors designed a reasonable method to solve that.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) the motivation of encoding the feature to binary codes should be elaborated in detail. 2) the method part of this paper is difficult to understand. The notations are defined but never used again, e.g., Di = {(Ii, Ti; Ci; li)}Ni=1. It is hard to find where the captions generated by LLM are used.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) please motivate the hashing method in the introduction section. It is confused that the major problem and algorithm design are not really related to the hashing part when it is a paper about hashing and retrieval. 2) please improve the method part and check all notations, whether they are well defined, explained, and used in the following paragraphs.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the idea is interesting and the experiments look fine. But the writing quality is not good enough.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces a cross-modal hashing method to encode medical images and texts into a common binary code space, facilitating fast and low-storage image –> text and text –> image retrieval. This method addresses the challenge of training with noisy image/text pair correspondences. For that, the authors have proposed a learning scheme that utilizes a pre-trained medical LLM to prompt image and text encoders, thereby effectively handling the noisy image/text correspondences in training data. The method is evaluated on two public datasets: Open-I and MIMIC-CXR, corrupted by inducing noisy correspondence by randomly altering image/text training pairs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) There are very few recent works on cross-modal hashing for medical image/text retrieval [1,2]. The use of caption prompts (generated from LLM) to robustly train against noisy correspondences for cross-modal hashing in medical image/text pairs is a novel concept. 2) The paper thoroughly reviews related works in deep cross-modal hashing and effectively justifies how existing works have not focused on handling noisy correspondences, which is a relevant problem in large image/text pair datasets. 3) The paper is clearly written and easy to follow, maintaining a balanced tone without making overclaims.

    [1] Xu, Liming, et al. “Multi-manifold deep discriminative cross-modal hashing for medical image retrieval.” IEEE Transactions on Image Processing 31 (2022): 3371-3385. [2] Zhang, Yong, et al. “Deep medical cross-modal attention hashing.” World Wide Web 25.4 (2022): 1519-1536.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The authors have noted that the field of cross-modal hashing has remained largely unexplored, which is true in a way. However, the paper lacks a strong motivation explaining how cross-modal hashing with medical data differs from that with natural image/text data and why it needs specific attention.

    2) The method is tested under a simple noise condition where image and text pairs are randomly shuffled in the training set. However, in real-world settings, noise could depend on specific instances or categories. Additionally, noise might manifest as incorrect descriptions of specific words, rather than complete swaps of entire text/image pairs.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have used a public dataset and the method should be fairly reproducible based on the description in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) Instead of relying on image and text encoders pre-trained on natural image/text data, the author could have utilized encoders specifically pre-trained on medical data such as MedVIT and Med-BERT. Maybe it is not necessary to use the caption prompts? 2) The categories ‘Q’ is not properly explained in section 3.1. 3) Please cite relevant works to justify “However, the effectiveness of these methods depends on….” 4) Please consider using ‘Z’ to denote the sequence of local token embeddings, as ‘z’ may confuse the reader into thinking it represents only one token, similar to ‘g’. 5) Some suggestions for Figure 2: a) It is not clear if all local tokens pass through LTA or just a single token. b) Please consider adding ‘h’ and ‘f’ in the figure to indicate outputs of hashing layers. c) Description of TE is missing. d) Please make the caption more descriptive. Reference: [3]. 6) Is there any measure of how fast the retrieval is? 7) The evaluation metric is not properly explained; if you followed a previous paper, please cite it.

    [3] Liu, Yishu, et al. “Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval.” Proceedings of the 31st ACM International Conference on Multimedia. 2023.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is particularly relevant in the current research context, as there is a growing shift towards multimodal medical data, and methods that offer both low storage and faster retrieval are increasingly important. Also, noisy correspondence is a relevant issue in large datasets. In this work, the use of caption prompts to robustly train a model for cross-modal hashing in the presence of noisy correspondence is novel and previously unexplored. While the major limitation is that the method is only tested with a very simple noise setup, this is an important research problem, and the method justifies the author’s intent.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    To eliminate clinical noisy correspondence, this paper proposes a novel medical cross-modal prompt hashing (MCPH) that incorporates multi-modal prompt optimization with noise-robust contrastive constraint for facilitating noisy correspondence issues.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This method dexterously incorporates multi-modal prompt optimization with noisy-robust contrastive constraint to enhance noisy correspondence learning.
    2. They build a contrastive learning strategy aims to enhance the contrast strength of positive pairs while mitigating the influence of noisy correspondence pairs.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In the table 1 and table 2, why there is no standard deviation for baseline method?
    2. The scale of Parameter alpha and beta in Fig 3 are different. Is it because the scale ot this two loss terms are different?
    3. Table 3. shold be also added the standard deviation.
    4. The ablation studies and parameter analysis only have the result for Open-I using 16 bits with 20% noise rate. Does it have the similar conclusion for other conditions and dataset?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel to eliminate clinical noisy correspondence across medical modalities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper presents Medical Cross-Modal Prompt Hashing (MCPH), a method that improves similarity search in multi-modal medical data by leveraging visual and textual prompts from a large language model, CheXagent. MCPH enhances alignment between visual and textual medical data through visual prompt learning (VPL) and textual prompt learning (TPL), which enrich the contextual understanding while reducing noise through a noise-robust contrastive learning strategy. Extensive experiments on benchmark datasets show that MCPH outperforms existing methods in eliminating noisy correspondence and improving cross-modal retrieval accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed Medical Cross-Modal Prompt Hashing (MCPH) is a novel approach that addresses noisy correspondence in medical data analysis using multi-modal prompt optimization and noise-robust contrastive learning

    MCPH incorporates a noise-robust contrastive learning strategy that dynamically adjusts contrastive learning intensity. This strategy effectively strengthens positive pairs’ contrast while mitigating noisy pairs’ influence, offering a strong improvement over existing methods.

    the work compares MCPH against several state-of-the-art transformer-based methods (DCHMT, DAPH, MITH),

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper lacks specific details on how the visual and textual prompts are optimized. A deeper explanation of the tuning process would help clarify how these prompts contribute to noisy correspondence elimination.

    more details for the hashing prompting and reason for designing the network.

    The paper does not present a complexity analysis of the MCPH framework. Including this would help assess the computational feasibility and scalability of the approach, especially for larger-scale datasets.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. clinical Impact Demonstration, the paper can give the clinical impact for the discussion part.
    2. more theory for prompting hashing.
    3. more details for the complexity analysis.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. the theory grounding.
    2. complexity analysis.
  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We express our gratitude to all the reviewers for their insightful feedback. We hope that our clarification can eliminate any doubts and gain enhanced recognition.

Response to Reviewer #1:

(1) Clarification of binary code encoding: The motivation for binary code encoding in multi-modal hashing retrieval lies in handling large-scale, heterogeneous datasets. By learning hash functions that maintain the semantic relationships between data points, binary encoding ensures that similar items remain close in the Hamming space, which improves the accuracy and speed of data retrieval across diverse modalities.

(2) Review of notations: We have carefully revised all notations to ensure their correctness in our manuscript.

Response to Reviewer #3:

(1) Clarification of symbol Q: Category Q represents the disease labels for each image-text pair in the medical multi-modal dataset.

(2) Citation of relevant works: We have cited these relevant works in our new version, ensuring a comprehensive understanding of multi-modal hashing retrieval.

(3) Modification the sequence of local token embeddings from ‘z’ to ‘Z’: To prevent being misunderstood, we have modified the sequence of local token embeddings from ‘z’ to ‘Z’.

(4) Clarification of Figure 2: We have also added a brief description of each operation in the figure caption to provide more context for readers.

(5) Explanation of evaluation metrics: To make a fair comparison, we follow the evaluation metrics provided by [22] to verify the effectiveness of our method compared against various baseline methods.

[38] Liu, Y. et al: Multi-granularity interactive transformer hashing for cross-modal retrieval. In: Proceedings of ACM MM. pp. 893–902 (2023)

Response to Reviewer #4:

(1) Modification of Tables 1, 2, and 3: To make a fair comparison, we follow the evaluation metrics provided by [22] to verify the effectiveness of our method compared against various baseline methods.

[38] Liu, Y. et al: Multi-granularity interactive transformer hashing for cross-modal retrieval. In: Proceedings of ACM MM. pp. 893–902 (2023)

(2) Different settings of alpha and beta: We have explored different values of the parameters alpha and beta to examine their impact on the performance of our multi-modal hashing retrieval system. By systematically varying alpha and beta, we aim to identify the optimal settings that maximize retrieval accuracy and efficiency.

(3) More ablation studies and parameter analysis: Due to page limit constraints, we only place the results for Open-I using 16 bits with 20% noise rate. In practice, our MCPH significantly outperforms all state-of-the-art baselines with different settings.

Response to Reviewer #5: (1) More details about implementation details: We have included additional details about the implementation in our supplementary materials. Furthermore, the code will be made publicly available upon acceptance of this paper.

(2) Complexity analysis: Following your kind suggestions, we would add more complexity analysis to verify the computational feasibility of our MCPH framework in the supplementary materials.




Meta-Review

Meta-review not available, early accepted paper.



back to top