Abstract

This paper introduces HuLP, a Human-in-the-Loop for Prognosis model designed to enhance the reliability and interpretability of prognostic models in clinical contexts, especially when faced with the complexities of missing covariates and outcomes. HuLP offers an innovative approach that enables human expert intervention, empowering clinicians to interact with and correct models’ predictions, thus fostering collaboration between humans and AI models to produce more accurate prognosis. Additionally, HuLP addresses the challenges of missing data by utilizing neural networks and providing a tailored methodology that effectively handles missing data. Traditional methods often struggle to capture the nuanced variations within patient populations, leading to compromised prognostic predictions. HuLP imputes missing covariates based on imaging features, aligning more closely with clinician workflows and enhancing reliability. We conduct our experiments on two real-world, publicly available medical datasets to demonstrate the superiority and competitiveness of HuLP. Our code is available at https://github.com/BioMedIA-MBZUAI/HuLP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3209_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3209_supp.pdf

Link to the Code Repository

https://github.com/BioMedIA-MBZUAI/HuLP

Link to the Dataset(s)

https://hecktor.grand-challenge.org/ https://chaimeleon.grand-challenge.org/

BibTex

@InProceedings{Rid_HuLP_MICCAI2024,
        author = { Ridzuan, Muhammad and Shaaban, Mai A. and Saeed, Numan and Sobirov, Ikboljon and Yaqub, Mohammad},
        title = { { HuLP: Human-in-the-Loop for Prognosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents HuLP, a human-in-the-loop deep learning model that uses medical images to impute missing electronic health record (EHR) data that is then used to prognosticate patient survival. The images are embedded into a number of features each representing a clinical concept (e.g. tumor stage, age, gender, etc). An expert user can intervene with the predicted concepts and indicate the presence or absence of the concept, and the final embedding is used in a survival model to predict patient hazard from the concepts. Using two datasets of lung cancer CT images + EHR, and head-and-neck cancer PET/CT images + EHR, the proposed method is on par with or better than other baseline models in terms of survival c-index. Ablation experiments show the benefit of using image and EHR data for prognostication.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Tackles the big problem of missing data in medical research.

    • Performed cross-validation and presentation of uncertainties.

    • Generally good ablation experiments.

    • Allows incorporation of human expertise.

    • Image-based imputation (c.f. naïve imputation) is a good idea and appears novel and useful.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The description of the survival modelling is vague.

    • The presentation of ablation experiments and results is sometimes vague and unclear.

    • No clear benefit over the baseline Fusion [16], neither in resulting c-index nor more efficient/easier methodology. It is unclear if the statement about disjoint embeddings is a drawback for Fusion. See detailed comment no. 8.

    • Ablation experiment for Table 2 is unclear and potentially uninformative.

    • There are no experiments to analyze the effect of noise in the human intervention, e.g. the results when an intervention is wrong. The experiments only simulate perfect ground truth intervention.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No open code, and a unclear experimental description makes reproducibility as is difficult

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It would help with clarity and understanding if you include all notations in Fig 1, i.e. write out x, y, P, … at the appropriate places. Please adjust.

    2. Please specify how the continuous features from the EHR (e.g. age) were discretized.

    3. It is unclear from the text “To facilitate test-time intervention, p is randomly replaced with the hard ground truth labels [0, 1] with a probability of 0.25” if this is done only for missing values, not all values. Please clarify.

    4. Continuing on the bullet above: For the ablation study in Table 2 (w/ and w/o test-time interv.), it seems p was used also for non-missing data (first paragraph on page 7). I.e. “w/ test-time interv” = true p for non-missing data and random p for missing, and “w/o test-time interv” = random p for all. Is this correct? This seem like a very artificial scenario, and comparing the two will of course show better results when using the true data for non-missing values. If my understanding of the experiment is correct, this experiment seems uninformative of the benefit of user intervention.

    5. Please include the dimensions of the different components c_i, c_{Fi}.

    6. Please clarify the modelling of patient hazard. It is stated (eq 2) that H is the cumulative hazard function, and h is the estimated hazard. Please explain. Is this based on Cox proportional hazards or similar?

    7. What is sigma in Eq 5? Please clarify.

    8. Please include columns in Table 1 to indicate the use of image and EHR data (e.g. using check marks as in a typical ablation table). It is unclear if e.g. DeepHit in row 1 uses both images and EHR, or only EHR.

    9. Please clarify what is meant by statement (results section on page 6) “the learning of fusion from EHR and image embeddings were disjoint”. What is meant by “disjoint”?

    10. Please clarify why/how your method ensures disentangled embeddings (results on page 6).

    11. The columns in Table 3 show 30% to 70% missing data. The c-index is lowest for the least amount of missing data (30%). This seems counterintuitive. Are the columns permuted? Please clarify.

    12. The statemen in the discussion (page 8) about “…HuLP empowers clinicians to actively engage in the model, refining its predictions…” should be clarified to reefer to “concept predictions”, not to be confused with “prognostic predictions”. While it is true that the clinicians can engage with the model, but not directly to refine the prognostic prediction, but rather only to help fill in missing data. The engagement is thus quite limited in scope and not driven by the final model prognostic prediction. The statement should thus be toned down.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is generally well written and includes mostly relevant and well performed ablation experiments. The topic is very interesting, as missing data in medical research is a huge problem. Leveraging the image data instead of naïve imputation seems like a good and feasible approach. Furthermore, the effect of the imputation on prognostic performance is a worthy task. Clarifications on some experiments are needed as they are muddy and unclear in the current version. Also, the benefit of the proposed method should be clarified over baselines.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors developed HuLP: a prognostic model for survival prediction. The main contribution is that the model can handle missing values by deriving this information from the scan, as well as that it permits intervention by a clinician when making predictions. Both of these characteristics make a model more suitable for possible implementation in clinical practice. The model is used in two experiments for cancer survival prediction (lung cancer with CT scans and head-and-neck cancer with PET/CT scasns) based on imaging and electronic health record data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the papers is the innovation in their novel architecture to incorporate human input. The authors address two highly relevant aspects of clinical implementation of image-based AI: interaction between model and clinician and missing data. A novel approach is proposed to address these issues.

    The authors implemented a novel loss that is driven by not only the prognostic accuracy, but also by the correlation between imaging data and clinical parameters. The authors perform relevant experiments: they compare their model with several baseline models and methods for missing value imputation, and evaluate model performance with and without human intervention. Additionally, their model is able to handle missing values in an advanced way: by imputing the clinical information from the scans.

    The figures and tables are very clear and well-organized.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I miss a short related-work section with a motivation from the literature on the novelty of HuLP to highlight how HuLP builds on previous models incorporating clinician input and deadline with missing data. Also, the choice for the baseline models (DeepHit, DeepMTLR and Fusion) is not motivated at all.

    The authors do not significantly outperform some of the baselines, e.g. DeepHit using image data only for HECKTOR. Also, there are some possible downsides:

    • It can be expected that the method of the authors is more complex and difficult to implement and optimize than some of the other baselines.
    • The included clinical variables influence the image features that are extracted, which could make the model rely heavily on the exact clinical information that is included.
    • The practical feasibility of human intervention within a clinical setting is unclear, whereas results show that the model needs the intervention to have a better performance (Table 2). This could mean that the model could be optimized further instead of relying on manual intervention.

    In the light of the current performance, the added value of this method over simpler approaches is not yet proven.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N.A.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Major feedback

    Advice to clarify:

    • The human intervention now relies on changing the probability of the presence or absence of a certain concept by replacing p with [0, 1]. How large is the influence of manually changing these probabilities, as the authors describe that p is already randomly replaced with [0, 1] for 25% of the p values? It also seems like these p-values (the presence or absence of the concepts) are already known from the clinical information, so what is in that case the value of manual adjustment?
    • The authors describe that the baseline imputation methods may oversimplify things, thereby introducing a bias. However, they do not address that their method might also still suffer from bias if the missing values are not missing at random.
    • The authors used k=1 for the kNN imputation, whereas this is probably not the optimal value for this imputation method. Why was k=1 picked? Were multiple values of k evaluated?
    • The test-time intervention is insufficiently described. How was the test-time intervention implemented (i.e. regarding the results in Table 2)? Was this evaluated with a clinician, or did someone of the research team conduct this intervention? How was this person instructed? Was this person blinded for the initial model prediction? For what percentage of predictions was the intervention conducted?

    Minor feedback

    Advice to add a measure of spread (e.g. SD) to the results in Table 3.

    Advice to clarify:

    • The authors arbitrarily split the embeddings into two halves: why make this split arbitrarily? Is there a better way to do this?
    • The authors combine the tumor substages (e.g. T1a, T1b, T1c) into a main category (e.g. T1). Why is that? The substages could hold valuable information and the model should be able to handle the additional distinction.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors developed a novel approach for survival prediction that facilitates human intervention and handles missing data in an advanced way. However, the current results do not yet support the use of this model over other, simpler approaches.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a prognostic model conditioned on images that supports manual imputation of intermediate features, thus, human-in-the-loop. These features are made to correspond to categorical clinical/EHR variables when available, or derived from the image / implicitly imputed by the model when missing. A custom loss function, inspired by DeepHit, is proposed. The model is evaluated on two multimodal datasets, showing comparable performance to the prior art - higher C-index on one dataset and lower on another. Compared to simpler data imputation techniques, the proposed model shows favourable performance in multiple scenarios.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Excellent clarity and organization of the paper.
    2. The methodology is overall sound and well-presented, it tackles an important problem of missing data in multimodal medical computing. The suggested method has a potential to facilitate clinical translation of prognostic models.
    3. Generally, convincing experimental results - evaluation done on several datasets, the baseline methods are actual, the core analysis includes statistical testing.
    4. To-be-released source code contributes to an improved repeatability of the results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Justification for using the proposed method architecture is not clearly explained. Particularly, it is not clear why conditioning the EHR features on an image is a good idea compared to, for example, the architecture of DeepFusionV2 (https://doi.org/10.1007/978-3-030-98253-9_26). Same remark applies to the use of positive and negative concepts (ci-, ci+). While the idea may be guessed, the motivation should be clearly stated.
    2. Sensitivity of the results to the choice of hyper-parameters is not sufficiently clarified. Particularly, the impact of coefficients “a” and “b” in from the loss function should be analyzed. Additionally, only one encoder model was evaluated.
    3. The potential impact of data exclusion in HECKTOR data has to be discussed - “Features with over 80% missing data are dropped”. Can it be the reason why the proposed method outperforms the DeepFusion?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    -

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Abstract: “the superiority of HuLP”. This is not supported by the results. Please, tone down the sentence.
    2. Keywords: “Imputation-free”. The proposed method can be seen as doing implicit data imputation. Please, remove or replace the keyword.
    3. Page 2: “This capability significantly enhances the model’s decision-making process,”. In its current form, the sentence seems to refer to Table 2 results, where no statistical testing is done. Please, rephrase or remove the word “significantly”.
    4. “FC layer γ(·) with n discrete time bins”. Please, specify, what is the value of “n” in the experiments.
    5. Page 6: “HuLP distinguishes … producing rich, disentangled embeddings of the clinical features from the images”. It is difficult to see why the embedding are disentangled given that the EHR-correlated features are conditioned on the image-based embeddings in the architecture. Please, reconsider this sentence and rephrase.
    6. Table 1: Please, state explicitly which statistical test was used.
    7. Table 2: The scores without test-time intervension is considerably lower than the ones of the prior art. This should be briefly discussed.
    8. “and extract of meaningful” -> “and extraction of / to extract “.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Clarity and organization
    • Scientifically sound methodology,
    • Relevant and actual baselines, reasonable experimental section experiments
    • Source code
    • Partially unclear justification of the method design choices
    • Missing analysis of sensitivity to hyper-parameter values
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank the reviewers for their positive impressions of our work and address the major points raised by them:

[R1,4] Justification/motivation for the proposed architecture over simpler methods: HuLP is built on the idea that “predicting the future (the will-be) is more difficult than describing the present (the what-is).” In clinical contexts, clinicians are often trained and more confident in diagnosis (i.e. detecting diseases from EHR and medical images) than prognosis (i.e. predicting future survivals of patients). This is because the former is apparent, while the latter is uncertain and depends on ingesting abundant data to make accurate individual predictions. Conversely, a trained neural network may be limited by the number of samples per concept to make confident diagnoses, but can complement a clinician’s workflow by being able to ingest all available information to predict individual patient survivals. This is the basis for which we allow clinicians to interfere with the model’s prediction of EHR concepts and for the model to dynamically adjust its prognosis in response. In doing so, both clinicians and HuLP can benefit each other.

[R1,6] Clarification on test-time intervention: There appears to be some confusion regarding the role of p (and human input) during test time. To clarify, p is not random but learned during training. During training, p is randomly replaced with the ground truth label [0,1] with 25% probability per concept. During testing, a clinician may impart expert knowledge by replacing p with [0,1] to denote certainty in the absence/presence of a concept. For uncertain concepts, they may leave p as is. In Table 2, we emulate clinicians’ intervention by replacing p with the ground truth labels. For missing concepts, we treat them as clinicians being uncertain of the absence/presence of the concepts and thus leave p as is.

[R1,4,6] Justification for the choice of baselines: DeepHit and DeepMTLR are chosen because they are both top-performing discrete survival methods in prognosis. We find these baselines appropriate as our HuLP implementation is also discrete. Fusion is chosen as a multimodal baseline and also because it won the HECKTOR competition, the same dataset used in this work. Compared to Fusion, HuLP distinguishes itself in its ability to perform implicit imputation from the images - in line with clinician workflows - rather than performing hard imputations. Additionally, rather than learning separate embeddings for EHR and images, HuLP differentiates itself in being able to jointly learn the information from both in such a way that the EHR guides the model of the imaging features to be extracted to inform prognosis.

[R1,4,6] Further optimization and experiments: The reviewers rightfully remark that there are multiple facets of possible investigations to optimize the method to improve the results. We did not investigate beyond the design choices mentioned in the paper, as we found the results to already be promising. Our work thus presents a methodological novelty with the potential to be optimized further. R6 mentions that analyzing the effect of incorrect human intervention would be a good ablation. We could not explore this in the context of this work but will keep the valuable recommendation in mind for future work.

[R6] Clarification on survival model: Our model is not based on CoxPH. In CoxPH, the hazard function is assumed to be the product of a baseline hazard function and a risk score. In HuLP, the softmax output is interpreted as the hazard, which can be converted to a cumulative hazard function through a cumulative sum (H(T x) = sum_{t=1}^{T} h(t x) (Eq 2)).

[R4] Data exclusion: For fair comparison, we performed the same exact preprocessing, including data exclusion, for ALL baselines and HuLP experiments.

[R6] Code availability: To maintain anonymity, we put a placeholder link in the initial submission. The GitHub repo is updated in the camera-ready version.




Meta-Review

Meta-review not available, early accepted paper.



back to top