Abstract

The Prostate Imaging Reporting and Data System (PI-RADS) is pivotal in the diagnosis of clinically significant prostate cancer through MRI imaging. Current deep learning-based PI-RADS scoring methods often lack the incorporation of common PI-RADS clinical guideline (PICG) utilized by radiologists, potentially compromising scoring accuracy. This paper introduces a novel approach that adapts a multi-modal large language model (MLLM) to incorporate PICG into PI-RADS scoring model without additional annotations and network parameters. We present a designed two-stage fine-tuning process aiming at adapting a MLLM originally trained on natural images to the MRI images while effectively integrating the PICG. Specifically, in the first stage, we develop a domain adapter layer tailored for processing 3D MRI inputs and instruct the MLLM to differentiate MRI sequences. In the second stage, we translate PICG for guiding instructions from the model to generate PICG-guided image features. Through such a feature distillation step, we align the scoring network’s features with the PICG-guided image features, which enables the model to effectively incorporate the PICG information. We develop our model on a public dataset and evaluate it on an in-house dataset. Experimental results demonstrate that our approach effectively improves the performance of current scoring networks. Code is available at: https://github.com/med-air/PICG2scoring

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2830_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/med-air/PICG2scoring

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_Incorporating_MICCAI2024,
        author = { Zhang, Tiantian and Lin, Manxi and Guo, Hongda and Zhang, Xiaofan and Chiu, Ka Fung Peter and Feragen, Aasa and Dou, Qi},
        title = { { Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an approach for incorporating PI-RADS clinical guidelines (PICG) into prostate cancer PI-RADS scoring using a multi-modal large language model (MLLM). The authors present a two-stage fine-tuning process that aims to adapt MLLMs originally trained on natural images to the prostate MRI domain while integrating PICG information. The model is evaluated on a public dataset and evaluated on a private dataset. Results show the approach improves performance of current PI-RADS scoring networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written.
    2. The method is well-motivated.
    3. The approach boost the performance of previous methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The motivation is unclear.
      • It is unclear why the authors adapt the MLLM for prostate cancer PI-RADS scoring.
      • It is unclear why the authors propose the two-stage training process.
    2. The differences with previous works are not discussed in detail.
      • The main differences (besides the application area) between this work and previous methods using MLLM for medicine are not clearly explained.
      • The differences between the proposed method and previous two-stage methods are not thoroughly discussed.
    3. The experiments are conducted on a private dataset, which hinders the generalizability of this work. Meanwhile, the details of this private dataset are limited. More information should be provided on the patient cohort, image acquisition protocols, and annotation process to better assess potential biases and the generalizability of the results. I strongly recommend the authors provide an experiment on a public dataset.

    4. The experiments are limited.
      • Comparisons are only made to prior PI-RADS scoring models without PICG incorporation. Contrasting with other methods that integrate clinical guidelines, even if they require extra annotations, would help contextualize the advantages of this approach.
      • The statistical significance of the reported improvements in scoring performance is not assessed. This is especially important to include given the moderate scale and private nature of the test sets.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The experiments are conducted on a private dataset. However, the basic introduction of the private data is missing, e.g., the patient cohort, image acquisition protocols, and annotation process.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to Weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The application of MLLM to Prostate Cancer PI-RADS Scoring is new. Howeever, the differences with previous works are not discussed in detail, and the experiments are performed on a private dataset.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposed a novel method that adapted a multi-modal large language model (MLLM) to incorporate Prostate Imaging Reporting and Data System (PI-RADS) clinical guidelines (PICG) into PI-RADS scoring without additional annotations and network parameters. The proposed method consisting of two stages, adapting MLLM to Prostate MRI domain and generating PICG-guided image features. The experimental results showed that the proposed methods had superior performance compared with the other state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The advantage of the proposed method is to incorporate PICG into the scoring network through MLLM without increasing the parameters of the scoring network or requiring additional training data. Moreover, the results showed that the scoring performance of PI-RADS annotation of lesions was good compared with the state-of-the-art methods.
    The paper was well-organized and well-written. Figures 2 and 3 helped to understand the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In the results section, there was no comparison of the number of parameters and training data with the existing methods. Therefore, it is difficult to confirm the advantage of the proposed method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned in the weakness of the paper, it would be better to compare the number of parameters and training data with the existing state-of-the-art methods.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method improved prostate cancer PI-RADS scoring performance by appropriately using a trendy method such as a multi-modal large language model.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a approach to improve the accuracy of prostate cancer diagnosis using MRI imaging by integrating Prostate Imaging Reporting and Data System (PI-RADS) clinical guidelines directly into a multi-modal large language model (MLLM). This integration allows the model to utilize the essential clinical guidelines without additional annotations or adjustments to network parameters. By adapting the MLLM to the specific domain of MRI data and using feature distillation, the paper demonstrates that this method enhances the performance of existing PI-RADS scoring systems, enabling more accurate and reliable cancer detection in clinical settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Creatively integrates PI-RADS clinical guidelines with a multi-modal large language model, enhancing MRI-based prostate cancer diagnosis without additional annotations.
    2. Incorporates a domain adapter layer specifically for 3D MRI images, demonstrating how to adapt language models trained on general images to medical imaging tasks.
    3. Validated on public and private datasets, the approach shows improved performance over existing methods, proving its effectiveness and utility.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper does not provide a thorough analysis of the interpretability of the AI model. Given that it integrates clinical guidelines, an evaluation of how these integrations impact the transparency and interpretability of the model’s decisions would be valuable.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper might benefit from a clearer and more structured explanation of the methods and technologies used. It is recommended that the paper include a more comprehensive analysis of the model’s interpretability.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes the innovative integration of PI-RADS clinical guidelines with a multi-modal large language model, which is a promising approach for enhancing MRI-based prostate cancer diagnosis. However, the lack of detailed interpretability analysis and limited discussion on model generalizability are concerning.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal, I maintain my rating.



Review #4

  • Please describe the contribution of the paper

    The paper describes a method to use textual guidelines to improve the performance of PI-RADS scoring. The LLAMA adapter V2 model is used to generate a feature vector from the image and the PI-Rads guidelines which provides distillation type loss that improves performance of scoring networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Use of multiple scoring networks show that the approach generalizes well Use of expensive MLLM in training rather than inference makes effective use of this computationally heavy model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is not clear how sensitive the work is to the actual guideline, vs simply having the MLLM process the image and generate the vector at all. Is it just the presence of this secondary, relevant, task that brings improvement?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Section 3.2 confused me on some aspects

    • PI-Rads networks cites 22, 10, 11, but the table cites 22, 11, 28
    • were these scoring approaches trained by the authors on this training dataset, or were pretrained weights used, similar q with VGG? In sec 3.3, the “initialized” state of the MLLM is referred to. Presumably this is the pretrained state, not a completely random state of the MLLM? Also, the last line of this section – features without pretraining – is confusing.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Paper presents an interesting approach to exploiting large foundation models that likely can be generalized outside of this specific example.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal, I maintain my accept rating.




Author Feedback

We thank all chairs and reviewers for their time. We are encouraged by the positive comments on our novelty and experiments. We believe that our work is a strong submission to MICCAI this year. We highly appreciate the AC’s careful reading of our paper and favorable consideration of it. — R1 Q: How sensitive the work is to the actual guideline, vs simply having the MLLM process the image? A: Our model shows better generalizability over all baselines on a test set with distribution shifts. We attribute this to our introduction of clinical guidelines since there are similar conclusions in existing works(Liu et al., NeurIPS’23; Dai et al., NeurIPS’23). More experiments are interesting for future study. Q: The pre-trained weight of baseline and MLLM A: We apologize for the typos. We used ImageNet pre-trained weights for our baselines. “Initialized state” refers to LLaMA pre-trained weights, not a random state. We will rephrase these. — R3 Q: There was no comparison of the number of parameters and training data with the existing methods. A: Our PI-RADS score classifier has the same parameter size as the baseline. During training, we distill features from MLLM into a lightweight classifier. Upon deployment, only the lightweight classifier is used. Training MRI data is the same for all baselines. — R4 Q: Why adapt the MLLM for prostate cancer PI-RADS scoring & Why propose the two-stage training process A: (1) PI-RADS scoring is crucial for prostate cancer diagnosis, which requires not only effective but also reliable solutions. Existing methods make reliable decisions with network modifications and extra annotations. As clinicians score images according to guidelines, we naturally adapt MLLM to encode the textual guidelines to guide the image-scoring task. (2) Our two-stage training is designed for our clinical task. The guideline has different rules on various MRI modalities. Thereby, we added a stage teaching the MLLM distinguishing MRI modalities, before encoding the guidelines. Q: Differences with previous works A: Unlike others built from scratch (Liu et al., NPJ Digit. Med’23) or focused on clinical text generation effectiveness(Li et al., ICML’23), we sit on a guideline-aware reliable solution and consider the limited computational resources in the clinics. We use “rules” to regularize the image feature space with texts and distill these features into lightweight models. Besides, we also suggest a scheme to teach the model to discriminate different MRI modalities in the first stage. Our novel two-stage training meets 3D volume needs (most MLLM inputs are 2D) and requires no extra annotations. Q: Provide an experiment on a public dataset A: Due to space limits, we prioritize presenting our model’s generalizability. We trained our model and all baselines on a public dataset (Natarajan et al., 2020) and evaluated them on our in-house dataset. Table 1 shows our model’s remarkable performance on the unseen dataset. Q: Contrasting with other methods that integrate clinical guidelines & statistical significance of the reported improvements A:(1) Comparisons with existing rule-based models might require extra annotations, which risks an unfair comparison. According to the results we show in Table 1, although without any additional labels or modification of network structures, our proposed method shows impressive generalizability over all baselines. This validates the effectiveness of the proposed method. (2) Table 1 reports the average and standard deviation of model performance over three runs. We will include dataset statistics in the final version. — R5 Q: Include a more comprehensive analysis of the model’s interpretability A: Our model shows better generalizability over all baselines on a test set with distribution shifts (Table 1), which is a side-proof of the model integrating guidelines in the decision. We will consider a more comprehensive analysis in future works.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I don’t think it is a strong submission to MICCAI this year.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I don’t think it is a strong submission to MICCAI this year.



back to top