Abstract

Accurately predicting the 5-year prognosis of lung cancer patients is crucial for guiding treatment planning and providing optimal patient care. Traditional methods relying on CT image-based cancer stage assessment and morphological analysis of cancer cells in pathology images have encountered challenges in terms of reliability and accuracy due to the complexity and diversity of information within these images. Recent rapid advancements in deep learning have shown promising performance in prognosis prediction, however utilizing CT and pathology images independently is limited by their differing imaging characteristics and the unique prognostic information. To effectively address these challenges, this study proposes a novel framework that integrates prognostic capabilities of both CT and pathology images with clinical information, employing a multi-modal integration approach via multiple instance learning, leveraging large language models (LLMs) to analyze clinical notes and align them with image modalities. The proposed approach was rigorously validated using external datasets from different hospitals, demonstrating superior performance over models reliant on vision or clinical data alone. This highlights the adaptability and strength of LLMs in managing complex multi-modal medical datasets for lung cancer prognosis, marking a significant advance towards more accurate and comprehensive patient care strategies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2173_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2173_supp.pdf

Link to the Code Repository

https://github.com/KyleKWKim/LLM-guided-Multimodal-MIL

Link to the Dataset(s)

https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71394

BibTex

@InProceedings{Kim_LLMguided_MICCAI2024,
        author = { Kim, Kyungwon and Lee, Yongmoon and Park, Doohyun and Eo, Taejoon and Youn, Daemyung and Lee, Hyesang and Hwang, Dosik},
        title = { { LLM-guided Multi-modal Multiple Instance Learning for 5-year Overall Survival Prediction of Lung Cancer } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a multi-modal integration framework for improving the 5-year prognosis prediction of lung cancer patients. By synergistically combining CT and pathology images with clinical information, the framework employs multiple instance learning and leverages Large Language Models (LLMs) to analyse clinical notes, aligning them with imaging modalities. The integration of these diverse data types promises to enhance the accuracy and comprehensiveness of lung cancer prognosis, addressing current limitations faced by traditional and single-modality approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The paper’s text is well-organized, facilitating easy comprehension of the methodologies, experimental setups, and results. (2) A detailed ablation study was conducted on the impact of each modality on the results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Insufficient innovation. Combining CT and pathology images and using LLMs is not enough to be an innovative point to support the paper. (2) The algorithms used for comparative analysis in Table 1 are insufficiently described and do not include comparisons with the latest state-of-the-art methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) It would be beneficial to more clearly highlight any novel computational techniques or unique integrative strategies that differentiate your framework from existing methodologies. Clarifying these aspects could help reinforce the innovative nature of your research. (2) It is recommended extending the range of comparative analyses to include more recent and relevant state-of-the-art methods. This will provide a clearer benchmarking of model’s performance and underscore its value in the current research landscape.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation for a weak reject is primarily due to the paper’s limited innovation and insufficient comparative analysis. While the integration of multiple modalities and the use of LLMs for clinical note analysis are commendable, they do not sufficiently advance beyond current technologies to meet the high standards of novelty expected at MICCAI. Additionally, the lack of a robust comparative framework against state-of-the-art methods makes it difficult to ascertain the proposed model’s relative performance and impact.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The original contribution of proposed paper is in integration of prognostic capabilities of CT and pathology images with clinical information. For that purpose multi-modal integration via multiple instance learning is used. Large language models are used to analyze clinical data and align them with imaging modalities. The concept is validated rigorously on external datasets from different hospitals and is aimed for 5-years survival prediction of patients with lung cancer. The concept yielded AUC of 0.877, p<=0.05 and recall of 0.964 on external validation set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One strength (novelty) of the paper is original multi-modal architecture that extracts features from CT slice images and patches of histopathological images under the guidance of text features extracted from clinical information by large language model (LLM). Multimodal alignment module is the key part of that architecture. It aligns text features with LLM-guided CT features and histopathological features. That enables their concatenation into the bag with uniform dimension and additional class token that summarizes information of the entire bag. This class token is passed through fully connected layer to determine the final survivor output. Another strength is rigorous validation of proposed concept on an external datasets from different hospitals.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It appears to me that main weakness of proposed methodology is coming from the presentation style that is not always easy to follow. That affects understanding of proposed methodology, in particular the multi-modal alignment module (MAM). The MAM module uses text features derived via the LMM to provide text prompts for mapping features from CT and histopathological images. That is the most significant part of proposed concept because it enables integration of prognostic capabilities of CT and pathology images with clinical information. Based on presentation in section 2.3 I had impression that described MAM is automated and synchronized with other modules of proposed multimodal MIL framework. Nevertheless, in section four authors discuss limitations of proposed methodology and point out that study relies on manually generated prompts for LLM-guidance. Thus, as pointed out, automated prompt generation remains a challenge for further research. As it is, the study is on the concept generation level. In Table 2 quantitative analysis is presented in terms of AUC, accuracy, precision and recall for internal and external validation sets and various combinations of three modalities (pathology, CT and text). A strange phenomenon occurred after the text modality with either CT or pathology modalities or with both of them. The phenomenon is that values of either precision or recall dropped after introduction of text modality. The exception is the combination of all three modalities for the external validation set. Described phenomenon was not discussed in the paper. Given the fact that recall is important metric for medical field authors should provide some analysis/explanations of this phenomenon.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The presentation style is not always easy to follow. That affects understanding of proposed methodology, in particular the multi-modal alignment module (MAM). The MAM module uses text features derived via the LMM to provide text prompts for mapping features from CT and histopathological images. That is the most significant part of proposed concept because it enables integration of prognostic capabilities of CT and pathology images with clinical information. Based on presentation in section 2.3 I had impression that described MAM is automated and synchronized with other modules of proposed multimodal MIL framework. Nevertheless, in section four authors discuss limitations of proposed methodology and point out that study relies on manually generated prompts for LLM-guidance. Thus, as pointed out, automated prompt generation remains a challenge for further research. As it is, the study is on the concept generation level.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The presentation style that is not always easy to follow. That affects understanding of proposed methodology, in particular the multi-modal alignment module (MAM). The MAM module uses text features derived via the LMM to provide text prompts for mapping features from CT and histopathological images. That is the most significant part of proposed concept because it enables integration of prognostic capabilities of CT and pathology images with clinical information. Based on presentation in section 2.3, I had impression that described MAM is automated and synchronized with other modules of proposed multimodal MIL framework. Nevertheless, in section four authors discuss limitations of proposed methodology and point out that study relies on manually generated prompts for LLM-guidance. Thus, as pointed out, automated prompt generation remains a challenge for further research. To reduce outlined issue authors need to provide code related to presented study. That would also help improving reproducibility of proposed methodology.

    In Table 2 quantitative analysis is presented in terms of AUC, accuracy, precision and recall for internal and external validation sets and various combinations of three modalities (pathology, CT and text). A strange phenomenon occurred after the text modality with either CT or pathology modalities or with both of them. The phenomenon is that values of either precision or recall dropped after introduction of text modality. The exception is the combination of all three modalities for the external validation set. Described phenomenon was not discussed in the paper. Given the fact that recall is important metric for medical field authors should provide some analysis/explanations of this phenomenon.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is weak accept. I propose acceptance due to: (1) originality of contribution related to integration of prognostic capabilities of CT and pathology images with clinical information that is aimed for 5-years survival prediction of patients with lung cancer. (2) Rigorous validation of proposed concept on an external datasets from different hospitals. I propose weak acceptance because the multi-modal alignment module is not operating autonomously but relies on manually generated prompts for LLM-guidance. That affects performance and reproducibility of proposed method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My justification for proposing the weak acceptance was due to the fact that multi-modal alignment module is not operating autonomously but relies on manually generated prompts for LLM-guidance. That affects performance and reproducibility of proposed method. In their rebuttal, authors admitted that automated prompt generation is extensively researched recently. Therefore, their reasoning is that “ it would be more appropriate to consider our manually generated prompt as a type of prompt generation methods with room for improvement, rather than at the concept generation level.” Authors, provided convincing argumentation for dropping precision for the internal validation. Regarding my comment related to reproducibility, authors promised to release the code after the paper is accepted. In overall, I am satisfied how authors answered my comments and raise my proposal from weak accept to accept.



Review #3

  • Please describe the contribution of the paper

    The authors propose a multimodal method for predicting 5-year overall survival by using CT scans, Pathology images, and Clinical Notes. The Open AI Dataset Project (AI-Hub, S. Korea, www.aihub.or.kr) dataset is used for training and validation together with an external dataset used for testing. The multimodal approach is shown to be more effective at solving the proposed task than the baseline approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors show technical understanding needed to work with three different modalities. The evaluation shows that the multi-modal approach significantly improves the results compared to the baseline approaches. The use of external dataset increases the confidence that the model generalises beyond the training distribution (see comments to the authors on questions related to the external dataset).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I have seen the 5-year overall survival as a binary classification problem used less frequently than predicting the survival time as a regression task. I am not sure how much clinical utility can be extracted from telling a patient if they are going to be dead within 5 years. I think the patients are more interested in how much time they are predicted to live. This is not an expert opinion, but a concern that I asked to be explained (see comments to the authors).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The pipeline is too complex to be reasonably expected to be reproduced by other researchers without the authors publishing their code (data preparation, modelling, and evaluation). “Reasonably expected” is used for a scenario when another research group proposes another model and is faced with a choice of whether to benchmark the method described in this paper or just reference it.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Page 1: Please check the 60% 5-year survival rate statement. Are you talking about a specific group within lung cancer patients for whom surgery or treatment went particularly well? If so, please clarify that.

    Page 2, line 2: Please explain the TNM staging system. For example “The TNM Staging System includes the extent of the tumor (T), extent of spread to the lymph nodes (N), and presence of metastasis (M)” from https://www.facs.org/quality-programs/cancer-programs/american-joint-committee-on-cancer/cancer-staging-systems/#:~:text=The%20TNM%20Staging%20System%20includes,the%20original%20(primary)%20tumor.&text=The%20M%20category%20tells%20whether,other%20parts%20of%20the%20body).

    Page 2: Please check DeepConvSurv (2020) followed by DeepCorrSurv (2017). The publication years are in reverse order.

    Penultimate line of the intruduction section: please clarify in what context do you report the p-value?

    Section 2.2: the use of D - length of a vector and in 1D, 2D, 3D is confusing. Consider using a different letter for the vector length. Alternatively, write 1-dim, 2-dim, 3-dim instead of 1D, 2D, 3D respectively.

    Section 2.2, Pathology encoder: $D_0$ is used once, is it the same length $D$ as in the CT encoder?

    Section 2.2, text encoder: please motivate the choice of CLIP as a text encoder instead of using text-only encoders, e.g. LLaMa 2, BERT, T5, etc. I do not think you use the vision encoder from CLIP, please correct me if I am wrong and clarify in the paper.

    Section 2.3: “In the MAM, we set the dimension of each modality’s encoded features M to D”. I think this sentence is confusing because capital letters of Latin alphabet are used for different things. As far as I understand, M is a vector of encoded features, while D is a scalar representing size. Please consider using (1) capital latin alphabet letters for scalar sizes (2) bold-font lower-case latin alphabet letters for 1-dimentional vectors (3) bold-font capital latin alphabet letters for matrices/tensors. I think this will improve the notation consistency of the paper.

    Section 2.3: Is $Z^{text/CT}$ the same as $Z^{text \times CT}$ used in the caption of Figure 1. If yes, please used one of them consistently throughout the paper.

    Section 2.4, possible typo: “the class token is passed through as fc layer” -> “the class token is passed through an fc layer”

    Section 3.1: please add the link straight away “The Open AI Dataset Project (AI-Hub, S. Korea, www.aihub.or.kr)” and clarify if it’s public. Does it have anything to do with OpenAI, the creators of ChatGPT? Or just a name clash (if yes, maybe mention it)?

    Section 3.1: please clarify if the clinical notes were in English or Korean. I checked the web page and it is all in Korean.

    Section 3.1: please clarify the number of cases. 908 cases split into internal train and test? How many were in the external test set?

    Section 3.2, CT: please specify the slice thickness of the CT scans

    Section 3.2, pathology: please specify the resolution (microns per pixel) at which the 3x224x224 patches were extracted, e.g. 0.25, 0.5, 1 micron per pixel (usually correspond to 40x, 20x, 10x scanner magnifications)

    Section 3.3: “The method of previous studies achieved AUCs of 0.648 and 0.699, with and without clinical information, respectively.” I think the with/without should be swapped since Table 1 shows that the result with text is better while the quoted sentence suggests that it’s worse.

    Section 3.3: Please check if DeLong’s test is applicable for ROC AUC comparisons in a binary classification scenario.

    Section 3.3: Did you predict 5yOS as a binary indicator? If yes, please specify it early on in the paper and reiterate again. People, who are used to survival analysis literature will expect Kaplan-Meier survival curves and not AUC. Please motivate the choice.

    Section 3.3: Please clarify where the external validation set came from (if it does not prevent anonymity) and if it is public. External dataset seems pretty similar in terms of the data distribution (pathology, clinical notes) according to the t-sne plots.

    Section 3.3, ablation studies. “The internal validation set was randomly split as 25% of the training set, with accuracy, precision, and recall derived by applying the threshold value that maximizes Youden’s index in the internal validation.”. Did you choose the optimal threshold on the internal validation set and then use this threshold for calculating Accuracy, Precision, Recall on the external set? If you chose optimal threshold on internal validation set than the Accuracy, Precision, Recall on the internal set are inflated since they are as good as they possibly can be. Please clarify this.

    Table 2, caption: Consider changing for clarity “DeLong’s test [2] between with and without text guidance” -> “DeLong’s test [2] between ‘with-text’ and ‘without-text’ guidance”

    Section 4: please clarify the use of prompts. From reading everything before, I thought that the raw clinical notes are passed into the pre-trained CLIP text encoder.

    Please order the cited references in ascending order to make look-up easier, e.g. page 1 [11, 6] -> [6, 11]. I think it’s easier to look up the references when they are sorted.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are many works trying to predict survival from unimodal data (just CT or just Histology), while multimodal approaches are more rare due to the high technical barrier of knowing about and being comfortable using the SOTA techniques in multiple fields. The use of multimodal data for 5-year overall survival prediction is a natural approach since it most closely mimics the work of clinical MDTs and uses all available data. Successful results of this work show that this multimodal approach might be a step on the way forward.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my comments so my original decision of “acceptance” stays.




Author Feedback

We deeply appreciate the reviewers’ time and effort in providing detailed feedback and suggestions. We have addressed the key points raised by each reviewer below. (R3) Manually generated prompt : Since we use a fixed text encoder, the text features used for guidance vary depending on the input prompt, making the optimal prompt crucial. We found that combining all clinical information into a single sentence prompt is more effective than applying each piece individually (e.g., “a photo of {information}”). However, as you mentioned, performance could be improved with learnable prompt techniques (automated prompt generation), which have been extensively researched recently. Thus, it would be more appropriate to consider our manually generated prompt as a type of prompt generation methods with room for improvement, rather than at the concept generation level. Dropped precision or recall : Except for the internal validation of the model using all three modalities, we presented results where AUC, accuracy, and recall values increased while precision slightly decreased. The excellent performance in AUC and recall suggests that the text-guidance and the addition of modalities contribute to higher performance. The slight decrease in precision can be understood as a typical trade-off between precision and recall. Additionally, our proposed method demonstrated improved precision and recall in external validation, proving the overall effectiveness of our model. Reproducibility : We will release our code after final acceptance to provide sufficient information for reproducibility. (R4) Typo and detailed comments : We will actively incorporate them and provide more detailed explanations to aid understanding. CLIP as text encoder : We treated CLIP’s text encoder as an LLM. CLIP can extract matching information from both text and images. Thus, we used the CLIP model with the expectation that the features extracted from our clinical prompt would effectively capture the matching clinical information in image modalities through cross-attention. Other LLM models could also perform well since they extract the semantic meaning of text more effectively than simply converting clinical information into a single vector. (R5) Novelty : First, unlike most multi-modal classification models, such as references 20 and 21, that simply concatenate extracted features from each modality without specific guidance, our proposed model uses cross-attention modules within the MAM to align text and image features, enhancing generalizability by exploiting text information, which has less distribution disparity between internal and external datasets. The LLM effectively extracts the semantic meaning of text, and CLIP’s training on text-image pairs helps the model better understand the visual elements embedded in text. Additionally, we ensured that the model does not become biased towards the text information alone and prevents any loss of information by re-attending to text features with unique information from the image modality. When we used a list-based vector alignment instead of CLIP’s text encoder for text input, the AUC was lower (0.829) than ours (0.877), further proving the efficacy of the LLM. Second, as a clinical aspect, our research has also novelty in its use of CT, pathology, and text for 5yOS prediction, which has not been attempted before. Each modality provides unique information for 5yOS, such as TNM stage from CT and lung cancer cell type from pathology images. Although some recent studies have used FDG PET, inflammatory, or genomic data, our study’s strength lies in using essential yet more commonly available data. Comparison with SOTA : As mentioned earlier, to our knowledge, there are no SOTA algorithms directly comparable in the exact context of our study. Most multi-modality research employs parallel or serial concatenation of extracted features from each modality. Therefore, we compared our method against such approaches, as shown in Table 1.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper’s key contribution is integrating prognostic data from CT and pathology images with clinical information using multiple instance learning and employing large language models to analyze and align clinical data with imaging results. This paper studied an important research problem and the technical contribution is deemed substantial. The proposed method is validated with an extensive study. The authors’s response have cleared the technical concerns from Reviewers #3 and #4. The response about the technical novelty to Reviewer #5 is acceptable in my opinion.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper’s key contribution is integrating prognostic data from CT and pathology images with clinical information using multiple instance learning and employing large language models to analyze and align clinical data with imaging results. This paper studied an important research problem and the technical contribution is deemed substantial. The proposed method is validated with an extensive study. The authors’s response have cleared the technical concerns from Reviewers #3 and #4. The response about the technical novelty to Reviewer #5 is acceptable in my opinion.



back to top