Abstract

In clinical In-Vitro Fertilization (IVF), identifying the most viable embryo for transfer is important to increasing the likelihood of a successful pregnancy. Traditionally, this process involves embryologists manually assessing embryos’ static morphological features at specific intervals using light microscopy. This manual evaluation is not only time-intensive and costly, due to the need for expert analysis, but also inherently subjective, leading to variability in the selection process. To address these challenges, we develop a multimodal model that leverages both time-lapse video data and Electronic Health Records (EHRs) to predict embryo viability. A key challenge of our research is to effectively combine time-lapse video and EHR data, given their distinct modality characteristic. We comprehensively analyze our multimodal model with various modality inputs and integration approaches. Our approach will enable fast and automated embryo viability predictions in scale for clinical IVF.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2517_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2517_supp.pdf

Link to the Code Repository

https://github.com/mibastro/MMIVF

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Kim_Multimodal_MICCAI2024,
        author = { Kim, Junsik and Shi, Zhiyi and Jeong, Davin and Knittel, Johannes and Yang, Helen Y. and Song, Yonghyun and Li, Wanhua and Li, Yicong and Ben-Yosef, Dalit and Needleman, Daniel and Pfister, Hanspeter},
        title = { { Multimodal Learning for Embryo Viability Prediction in Clinical IVF } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this study, the authors evaluated two different methods for predicting embryo viability by combining time-lapse videos and Electronic Health Records (EHRs). They also included additional features extracted from off-the-shelf methods to the multimodal model. The dataset used in this study was fairly large, consisting of 24027 embryos over 3695 IVF treatments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The study’s integration of time-lapse videos and EHRs to predict embryo viability is a fairly novel approach.

    -The writing is clear and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Availability: Why the dataset used in this study is relatively big, neither the dataset nor the code was publicly available.

    • Suboptimal model performance reported: The authors mentioned that they used the pre-trained DeiT-Ti[28] spatial transformers without fine-tuning to generate per-frame embeddings. Meanwhile, the reported F1-scores reported in the top half of Table 2 were not high (0.284-0.338). Given the fairly significant data size that the authors have (6 million images), I would expect the authors to include the performance of the models when the spatial transformer is fine-tuned using domain-specific data.

    • Lack of errors or confidence intervals reported on performance numbers: Several reported performance numbers are quite counterintuitive such as the AUC and F1 score when all the modalities were combined (v+v’+e+e’) were less than those using v+v’. The authors should have reported the confidence intervals of the performance numbers to show if the results were statistically significant or not.

    • Lack of comparison to other relevant work, for example [1].

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For future improvements:

    • I would suggest the authors compare their work with other relevant work to highlight the similarities and differences.

    • The authors should also report the standard error or confidence intervals and provide more explanation on why some performance numbers did not match expectations.

    References: [1]. Liu, Hang, et al. “Development and evaluation of a live birth prediction model for evaluating human blastocysts from a retrospective study.” Elife 12 (2023): e83662.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the proposed method to combine videos with EHRs was pretty novel, the improvement on performance of using video was not clearly demonstrated. Moreover, no error metrics were reported to build confidence in the reported numbers.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Many thanks to the authors for addressing my concerns. I am happy to change my suggestion to “Weak Accept” for the paper.



Review #2

  • Please describe the contribution of the paper

    In this paper, authors developed a multimodal model that leverages both time-lapse video data and Electronic Health Records (EHRs) to predict embryo viability for clinical IVF treatments. They explored two different approaches to integrate the video and EHR data modalities, including a multimodal transformer model and a two-stage approach using tabular models. They conducted comprehensive experiments with various combinations of modalities and integration methods to analyze the effectiveness of their multimodal model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Addresses an important problem in IVF treatments by automating embryo viability prediction to increase the chances of successful pregnancy.
    • Utilizes multimodal data (videos and EHRs) to leverage complementary information for the prediction task.
    • Explores different multimodal integration approaches, providing insights into effective ways to combine diverse data modalities.
    • Incorporates additional morphological features extracted from videos using existing methods to enhance the model’s performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The limited size of the training data with treatment outcomes may hinder the performance of large-scale models, as acknowledged by the authors.
    • The multimodal transformer did not outperform the two-stage approach when using only EHR and interpretable features, suggesting room for improvement in multimodal integration.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Should consider exploring self-supervised pre-training techniques to better leverage the large-scale video data and improve the multimodal transformer’s performance.
    • Additional analysis or ablation studies on the importance of different modalities and their combinations would be valuable.
    • Investigating techniques for better confidence calibration could improve the F1-score for treatment success prediction.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important problem in IVF treatments and proposes a novel multimodal approach to leverage diverse data sources for embryo viability prediction. Experiments and analysis provide valuable insights into the strengths and limitations of different multimodal integration methods. While the multimodal transformer did not outperform the two-stage approach in some cases, the authors acknowledge the limitations and suggest potential improvements.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Authors added details about their fine-tuning attempt on domain-specific data and plan to invesitaget techniques for better confidence calibration.



Review #3

  • Please describe the contribution of the paper

    The paper explores two different methods for integrating multimodal data for embryo viability prediction. One method uses a transformer-based multimodal model, processing EHRs and videos. The other approach is a two-stage process that first processes video data to extract morphological features, converting them into a tabular format that is then inputted into tabular models with EHRs. Paper presents the effectiveness of a multimodal model for embryo viability prediction in IVF treatments, demonstrated through various experiments. The multimodal transformer method showed improved performance with semantic features but struggled with tabular and video inputs. The model performed better without video, potentially due to limited training videos. It excelled when trained with Embryo-vision outputs. The two-stage method performed best with EHR and interpretable features, even without visual data. However, it scored low in predicting treatment success which could be due to calibration issues.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1-Based on the authors’ literature review, the use of multimodal data, including time-lapse videos and EHRs, for predicting embryo viability is a new approach in the field. This stands out from previous studies that primarily used single data types. The authors’ methods show potential for improved prediction accuracy. 2- paper provides a comprehensive analysis of two methods for embryo viability prediction: the transformer-based multimodal model and the two-stage approach. The first method integrates different types of data such as time-lapse videos and EHRs, while the two-stage method, first processes video data to extract morphological features, which are then combined with EHRs in tabular models. The authors offer a detailed performance analysis with different modality combinations, providing valuable insights into the advantages and disadvantages of each method. Furthermore, they have effectively compared the multimodal transformer method with the two-stage approach, provided valuable insights into the benefits and limitations of each method. 3- The paper is well-structured and provides sufficient explanation of the methods used, successfully demonstrating how they contribute to the overall goal of improving embryo viability prediction.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors could apply augmentation techniques on the existing video frames to artificially increase the size of data specifically for the minority class, which potentially improvs the model performance.
    2. Fine-tuning the Pretrained Model: The authors utilized the pretrained model DeiT as a spatial transformer without fine-tuning it on their specific task. Fine-tuning could potentially improve performance, as it adapts the model to the specific characteristics of the task at hand.
    3. The authors could also experiment with different pretrained transformer models and compare the results. This could provide insights into the best-fit model for this specific task. Or they should provide a justification for the choice of DeiT.
    4. The difficulty in finding the best threshold for prediction confidence is a common problem in machine learning tasks. The authors could use methods like ROC curve analysis to find the optimal. Alternatively, they could experiment with a wide range of threshold values in a validation set and choose the one that achieves the best performance.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is suggested to consider augmenting video data to boost model performance.
    2. The DeiT model could benefit from fine-tuning for this specific task.
    3. It could be insightful to test other pretrained transformer models or justify the choice of DeiT.
    4. To find an optimal prediction confidence threshold, methods like ROC curve analysis or testing threshold values could be utilized.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s use of multimodal data for embryo viability prediction, its thorough comparison of their proposed two prediction methods, and its clear structure make it a noteworthy contribution Although there are suggestions for further improvements like data augmentation, model fine-tuning, testing other pretrained models, and optimizing threshold estimation, the paper’s positive aspects are sufficient for recommending acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    same justification




Author Feedback

We sincerely thank all reviewers for their valuable comments. We are grateful that our work was recognized as well-written (R1, R4), novel (R1), and providing valuable insights (R3, R4). Regarding the major weaknesses that the reviewers identify, we want to clarify all possible misunderstandings as follow:

  • Availability (R1-Q1) Since our dataset involves human subjects, we cannot make our dataset publicly available. Instead, we will make our training and evaluation code publicly accessible with detailed documentation. This will enable users to modify and employ our model with their specific datasets.

  • Why not fine-tune a spatial transformer on domain-specific data? (R1-Q2, R3-Q1, R4-Q2) As Reviewer 3 recognized, the size of our supervised training dataset is limited for fine-tuning large models. We attempted fine-tuning a spatial transformer, but it resulted in poor performance. Although we have collected large data, the samples with labels for supervised learning are limited. Moreover, the model processes one video which consists of 200-300 frames as a single sample. In other words, the number of videos available for training is approximately 200-300 times smaller than the number of images. Consequently, we opted to use pretrained frame representations rather than fine-tuning the spatial transformer. We acknowledged this limitation in our paper and proposed self-supervised pretraining as a solution for future study.

  • Lack of confidence intervals (R1-Q3) End-to-end multimodal learning ROCAUC performance did not fluctuate after training loss convergence. This observation was consistent across different modalities. In contrast, we observed higher performance variation in two-stage approaches. We conjecture this is due to early convergence of two-stage models leading to different solution after optimization. Here, we report confidence interval from 10 trials of the two-stage approaches. Note that this is not an additional experiment, but a repetition of the experiment reported in the main paper to add the statistical significance. TabTransformer showed (+/-) 0.021~0.045 CI range and TabNet showed (+/-) 0.012~0.025 CI range for embryo-ROCAUC across different modalities. This supports the advantage of the end-to-end multimodal learning approach, which is the main aim of our paper since lower performance variation is preferable. We will add CI with relevant discussion if allowed.

  • Lack of comparison to [1] (R1-Q4) Thank you for the comment. We will add the paper [1] to our introduction. It is noteworthy to mention that [1] tackles a similar problem to ours, but the experiment setting is different and, therefore, not directly comparable. While our work aims to combine video modality and EHR data, exploring different modality fusion techniques, [1] uses a single image and EHR data without modality fusion. Moreover, they do not integrate different modalities but rather aggregate the predictions of unimodal models (image model and EHR model) for the final prediction. However, we agree that [1] is relevant to our work, so we will include this work in the introduction.

[1]. Liu, Hang, et al. “Development and evaluation of a live birth prediction model for evaluating human blastocysts from a retrospective study.” Elife 12 (2023): e83662.

  • Apply data augmentation. (R4-Q1) We applied rotation and flip augmentation but avoided further augmentations that may change embryo appearance.

  • Other constructive comments (R3-comments, R4-Q3,Q4) Thank you for your constructive comments on our direction. As you suggested, we will investigate self-supervised pre-training on larger scale data and techniques for better confidence calibration to improve F1 Scores.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top