Abstract

Predicting the probability of human error is an important problem with applications ranging from optimizing learning environments to distributing cases among doctors in a clinic. In both of these instances, predicting the probability of error is equivalent to predicting the difficulty of the assignment, e.g., diagnosing a specific image of a skin lesion. However, the difficulty of a case is subjective since what is difficult for one person is not necessarily difficult for another. We present a novel approach for personalized estimation of human difficulty, using a transformer-based neural network that looks at previous cases and if the user answered these correctly. We demonstrate our method on doctors diagnosing skin lesions and on a language learning data set showing generalizability across domains. Our approach utilizes domain representations by first encoding each case using pre-trained neural networks and subsequently using these as tokens in a sequence modeling task. We significantly outperform all baselines, both for cases that are in the training set and for unseen cases. Additionally, we show that our method is robust towards the quality of the embeddings and how the performance increases as more answers from a user are available. Our findings suggest that this approach could pave the way for truly personalized learning experiences in medical diagnostics, enhancing the quality of patient care.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3180_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/38OJR6

BibTex

@InProceedings{Kam_Is_MICCAI2024,
        author = { Kampen, Peter Johannes Tejlgaard and Christensen, Anders Nymark and Hannemose, Morten Rieger},
        title = { { Is this hard for you? Personalized human difficulty estimation for skin lesion diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a transformer-based approach for personalized human difficulty estimation, i.e., the problem of estimating how difficult it will be for a user to answer a new problem, given the problem and the history of solved problems.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written
    • The problem of human difficulty estimation is relevant, and the authors manage to provide possible applications
    • The method outperforms the evaluated baselines
    • Explanations are clear and exhaustive
    • Comparison to several SOTA methods
    • A complete analysis of the results is provided, including limitations of the method regarding integration of skill level into the initialization.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Only two datasets are used, with only one being from the medical domain.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The method seems to perform much better than the baselines, although evaluation on a second medical dataset would be desirable. I understand the limitations in data availability for the task that limit this evaluation. Other than this, I have only two small comments:

    • Backward quotation marks were used
    • Page 2: Space missing before “Our code will be made…”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method sets a new state of the art in individualized human difficulty estimation, and proves to be robust to variations in embeddings. Evaluation is extensive and conclusive.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper contributed a novel approach for personalized estimation of human difficulty, utilizing a transformer-based neural network that considered previous cases and user performance. Demonstrations included doctors diagnosing skin lesions and language learning datasets, showcasing generalizability across domains. The approach encoded each case using pre-trained neural networks and utilized them as tokens in a sequence modeling task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is the way it has been structured. The authors mentioned the contributions very clearly by providing the rationale behind them. Additionally, the authors wrote the related work thoughtfully and outlined the problems with existing works, as well as described how they planned to overcome them. The methodology is also described in a detailed manner with all required mathematical formulations. Regarding the datasets used in the work, all public datasets are mentioned clearly in the paper. The authors also mentioned that they can produce the code upon acceptance. Furthermore, the Results section presents all the required information needed to prove the efficiency of the proposed work in comparison to existing work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I couldn’t find any significant weaknesses in the paper. As the authors claim, they presented a novel approach for personalized estimation of human difficulty, utilizing a transformer-based neural network that considers previous cases and user responses. I believe that discussing the current limitations of their work in detail would be beneficial for other researchers to further explore this method. These details could be provided in a supplementary file.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Which pre-trained models had been used in the proposed work according to statement “We combine the domain understanding of pre-trained models with sequence modelling”? Why long skip connections have been utilized in the proposed work? During training of the model, how the hyperparameters were selected? What is the significance of the Balanced accuracy?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main strength of the paper is the way it has been structured. The authors mentioned the contributions very clearly by providing the rationale behind them. Additionally, the authors wrote the related work thoughtfully and outlined the problems with existing works, as well as described how they planned to overcome them. The methodology is also described in a detailed manner with all required mathematical formulations. Regarding the datasets used in the work, all public datasets are mentioned clearly in the paper. The authors also mentioned that they can produce the code upon acceptance. Furthermore, the Results section presents all the required information needed to prove the efficiency of the proposed work in comparison to existing work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper tackles the problem of predicting personalized task difficulty for tasks like skin lesion diagnosis and language learning, aiming to optimize task distribution in professional settings. The authors introduce a transformer-based model that uniquely assesses task difficulty by integrating domain-specific pre-trained embeddings with a sequential analysis of user interactions. This model, which deviates from traditional generalized difficulty estimates, is evaluated using two datasets: a Skin Lesions Dataset with 34,485 images from 82 medical students, and the Duolingo SLAM Dataset for language learning. These datasets are split into training, validation, and testing phases, ensuring realistic distribution and balance. The model demonstrates superior performance in accuracy and balanced accuracy compared to baseline methods such as the Elo rating system and Bayesian Knowledge Tracing, marking significant advancements in personalized difficulty assessment.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Innovative Application of Transformers: The use of a transformer-based model to assess personalized task difficulty is a pioneering approach. This method allows for a nuanced understanding of individual user responses over time, leveraging the power of sequence modeling and deep learning to predict user performance more accurately.

    -Robust Performance Across Domains: The model demonstrates its versatility and effectiveness by performing well across diverse domains, such as medical diagnostics and language learning. This cross-domain applicability is a strong indication of the model’s robustness and its potential utility in various professional and educational settings.

    -Handling of Varied Data Inputs: The model is capable of processing an extensive range of input data sizes - from users with few answers to those with extensive historical interaction data - without a decline in performance. This flexibility makes it particularly valuable in real-world applications where data availability can vary significantly.

    -Superiority Over Existing Methods: The paper highlights that the proposed model outperforms traditional methods like the Elo rating system and Bayesian Knowledge Tracing in terms of accuracy and balanced accuracy. This improvement is crucial for practical applications where enhanced prediction accuracy can lead to better personalized experiences and outcomes.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper identifies a limitation in the model related to the initialization of new users’ skill levels. Currently, the model lacks an explicit mechanism to set initial skill levels for new users, which could hinder rapid and precise user assessment. The authors recognize this issue and suggest it as a focus for future enhancements, aiming to improve the model’s applicability and adaptability in practical settings.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The reviewer wonders if the authors have considered methods to enhance the interpretability of their model’s decisions. Could techniques such as attention visualization or feature importance metrics be implemented to clarify which features or tokens significantly influence the difficulty predictions? Such improvements could help end-users better understand and trust the model’s outputs.

    • Additionally, the reviewer is curious about the feasibility of integrating a feedback mechanism into the model. Did the authors explore the potential of such a feature to enhance the accuracy of learning adaptations and enable the model to adapt over time to changes in user skill levels or task characteristics? How feasible would this integration be within their current framework?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s innovative application of transformer-based models for personalized difficulty estimation justifies a strong recommendation for acceptance. Its pioneering approach extends the use of sequence modeling to effectively predict individual performance across varied domains such as medical diagnostics and language learning. This cross-domain robustness, combined with the model’s ability to manage diverse input data volumes, from sparse to extensive user interactions, underscores its practical utility and theoretical significance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

For reviewer #3: We will like to thank the reviewer for the time and energy spent reviewing our submission, and for the comments made. We fully agree with the reviewer that an additional (open-source) medical dataset would have been ideal. However, we did not manage to find one. Finally, thank you for the grammatical corrections, those are nice when writing the camera-ready version

For reviewer #4

We thank the reviewer for the comments and time taken to read our submission. With regards to the limitations of the method, we do have a brief discussion about it in the discussion section. Most notable is the lack of possibility to include prior knowledge about a physicians/students skill. Below we address your questions. 1) We employ two different pre-trained encoders (one for each dataset/domain) for the skin lesion diagnostic dataset it is a ResNet50 trained to embed the images according to diagnostic label, and in the Duolingo dataset it is a DistilBERT model used for autoencoding sentences. 2) Long skip connections is something we tried and saw that it increased performance slightly, see table 5 for ablation study. 3) Hyperparameters or model choices were selected through the ablation study results shown in table 5. 4) We are slightly unsure about what is meant with this question on the significance of balanced accuracy. However, we chose this as the primary metric to counter for class imbalances in the two datasets.

For reviewer #5: We would like to thank the reviewer for the nice comments, and appreciate the time taken to review our submission.

It would be interesting to investigate the interpretability of the model e.g. through attention weights. However, we would venture the guess that it wouldn’t be very understandable for non-domain experts.

We think that the possibility of integrating feedback into the model is an interesting avenue of research. One could look into online-learning or something similar. Simple retraining as the number of available samples increase is also a possibility.




Meta-Review

Meta-review not available, early accepted paper.



back to top