Abstract

Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative method for cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic, annotated cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3680_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3680_supp.zip

Link to the Code Repository

https://github.com/sciai-lab/SynCellFactory

Link to the Dataset(s)

http://celltrackingchallenge.net/2d-datasets/

BibTex

@InProceedings{Stu_SynCellFactory_MICCAI2024,
        author = { Sturm, Moritz and Cerrone, Lorenzo and Hamprecht, Fred A.},
        title = { { SynCellFactory: Generative Data Augmentation for Cell Tracking } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a novel method for generating synthetic 2D microscopy video datasets for cell tracking using a motion model and ControlNets. These networks simulate cell movements to create realistic video datasets with cell detection and tracking annotations. This enriched dataset can be used as an additional data augmentation method to improve the performance of deep neural networks for cell tracking.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors use ControlNet, a text-to-image diffusion model, to generate synthetic videos that incorporate detection and tracking information. They utilize colored dots to represent cells at different developmental stages and lines to depict cell movement / division across consecutive time frames. This information directs the diffusion model to create a synthetic video dataset featuring realistic cell mitosis and movement. Unlike other methods, this approach does not depend on simulated ground truth segmentation masks. Furthermore, it requires only a small amount of annotated data for training and can generate an unlimited number of synthetic videos, with adjustable cell counts and sequence lengths.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the general method seems interesting and reasonable, there are several doubts and unclarities that would need to be resolved and addressed in the rebuttal letter:

    • I was surprised to see that the authors use cellpose as a pseudo ground truth. It may be questionable to generate synthetic data + pseudo masks if you cannot be sure that the masks you’re generating are 100% trustworthy. How does your approach handle errors that remain in the results generated by cellpose?

    • In the positional ControlNet it is unclear how the cell cycle stage is determined. Is this just done randomly or does it follow an adequate model of the cell cycle stages learned from training data as done in approaches like CellCycleGAN? Are there any biological constraints that are considered?

    • Movement ControlNet: how is it ensured that the information supplied to the network is actually used for fulfilling the task? Please clarify.

    • While your model could potentially create long sequences, it’s limited to a video length of 12 frames. Is there any justification / reason for this choice / limitation?

    • I guess Cellpose was most likely already trained on the CTC data. Did you prove that the retraining indeed helped to improve the performance?

    • In the introduction you mention “[22] generates an entire video at once using a 3D diffusion model guided by optical flow; crucially, their approach does not produce pseudo ground truth labels, while SynCellFactory does”. Would there be any hurdle of just applying cellpose as well to their images to generate pseudo ground truth? So I think this is not really a unique selling point for your method and the statement should be relaxed.

    • The cell division dynamics look a bit unrealistic. This is especially visible in the videos for the PhC-C2DL-PSC dataset. This point is also relatively vaguely described in the main text and would need additional clarification.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors provide the code for the motion model and ControlNet as well as the hyperparameter setting for ControlNet. It is thus likely that the results are reproducible. Moreover, the training data sets are from the Cell Tracking Challenge and thus also available to the public upon registration.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Since the segmentation ground truth is generated using Cellpose as a separate step, there may be concerns about the quality of the segmentation masks. In the paper, only the TRA score is used as a quantitative evaluation. However, it would also be beneficial to evaluate the segmentation quality, such as using the SEG score provided by the Cell Tracking Challenge dataset. This evaluation could be performed both with and without the proposed data augmentation method to better understand its impact on the segmentation quality.

    • To prove the effectiveness of using synthetic data as data augmentation, the study currently employs only one deep learning cell tracking method. It would be advantageous to expand the evaluation to include multiple methods.

    • The motion model currently utilizes a circular disk to simulate the area of a cell, which might not accurately represent cell shapes in some datasets since cells have significantly different shapes. To improve the model’s applicability across various datasets, it would be beneficial to consider adapting the motion model to different cell shapes, like integrating more flexible shape models or using shape parameters that can be learned based on the characteristics of cells in the training datasets.

    • It’s mentioned by the authors that the quality of the generated videos declines as the sequence length increases. In section 2.3, 12 is chosen as the desired length without explanation. It would be better to introduce an ablation study to systematically evaluate the effects of different sequence lengths on the quality of the generated videos and training of the cell tracking network.

    • In the supplementary material, a figure of cell hallucinations close to the image boundary is given but without enough description.

    • It would be beneficial to mention the computational effort (device, training time, GPU occupation) for training and the resolution of the generated videos.

    • In the ablation study of various synthetic data ratios, the results of DIC-C2DH-HeLa, Fluo-N2DL-HeLa, PhC-C2DH-U373 and PhC-C2DL-PSC show a trend of improvement as the percentage of synthetic data increases. It leads to a question about the potential outcomes if the model were trained exclusively on synthetic data.

    • Table 1: it would be good to also check if the results are significantly different using a statistical test. For most of the results it looks like the difference to the version without the SynCellFactory augmentation are almost negligible considering the mentioned standard deviation.

    • In Fig. 3, you show the impact of different mixing ratios. While you explain the plots in the main text, an interpretation or hypothesis of the observed behavior is missing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The general method and application are important and reasonable. However, as indicated in the detailed comments on the weaknesses / constructive criticism, there are several unclarities that would need to be resolved before acceptance/publication of the method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks for clarifying a few things. I think the clarifications make sense and the remaining small issues/changes can potentially be addressed prior to the camera ready version.



Review #2

  • Please describe the contribution of the paper

    The paper proposes an ensemble of three learning-based modules that together create time-lapse series of 2D images showing artificial population of living cells (i.e., cells are moving, changing their shapes and dividing), with segmentation and tracking annotations in the cellTrackingChallenge.net format. The learning of a motion module seems to be merely an extraction of parameters for its motion model from external annotated data. The other two modules basically deliver synthetic raw images (from different inputs/conditioning) and are based on the ControlNet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The software can generate artificial, very reasonably-looking time-lapse images with annotations. Maybe the output quality might not be enough to use the generated data for benchmarking tracking tools, but it is certainly good enough to serve as augmentation to real data for training some tracking tools.

    • The software seems to be very simple to use – promised in the paper, and somewhat double-checked by myself from their source code.

    • The application of this augmentation is demonstrated to deliver (minor) improvement of some tracker method over several different real image sequences.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method produces raw time-lapse images showing synthetic cells, lineage information, and detection markers, but not segmentation masks (of the cells), which would have been otherwise an ideal outcome. The Authors compensate the absence of generated segmentation masks by running a (custom-trained) CellPose on the raw images and match the obtained segmentation masks with detection markers afterward. Not only issues with the matching may result in providing a naive (circular, thus inaccurate) segmentation mask, the segmentation is not guaranteed to always produce accurate outlines of the tracked synthetic cells.

    • The motion model may be relatively simple in the sense that it may not check for collisions during the motion. This is however only a guess of mine as the motion model is not detailed in the paper (I understand, lack of space) and I wasn’t sure when I was checking the source code (still I thank the Authors for sharing the source code openly!). The CN-Mov network, on the other hand, may b trained to handle collision cases so that the created synthetic images just look good.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I thank the Authors for sharing their source code with reviewers. Do you, please, intend to share it also with the public?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    If I may ask, I would be in favour of providing more details about the motion model and also a bit of evaluation of the quality of the generated textures (people today often use the FID - Fréchet Inception Distance). To gain space for it, I would shrink the section “Tracking Metrics” (in Sec. 3.2) to one or two sentences, and probably also removed completely the section “CTC Results”(in Sec. 3.3).

    A typo? In the last sentence of the Sec. 2.3 I found “…desired video length of 12 frames is desired”, shouldn’t “12” be replaced with “t”?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach has a potential to impact the cell tracking community, which is why I wish it was published. The impact could be further strengthened if the Authors would release their source code or offer it as a service on the web somewhere (e.g., Google collab).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I thank the Authors for their rebuttal.



Review #3

  • Please describe the contribution of the paper

    This manuscript presents SynCellFactory, a generative machine learning method to simulate annotated and realistic cell migration 2D microscopy videos. It combines dynamically the output of three different models: (I) a motion model, which samples a random population of objects following a 2D stochastic Brownian motion and provides their localisation in a time-lapse sequence, (II) a first diffusion model (CN-Pos), which generates a realistic microscopy image of cells at time T given an image of cells localisations, (III) a second diffusion model (CN-Mov), which generates the corresponding T-1 realistic and coherent image given the inferred realistic image at time T, the cell positions and the expected cell displacement from T-1 to T. SyncCellFactory is benchmarked using 7 different datasets from the Cell Tracking Challenge and an existing machine learning tracking algorithm.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    SyncCellFactory tackles two major challenges in microscopy image processing: lack of live-cell annotated time-lapse image data and cell tracking. For this, it proposes a new pipeline that takes advantage of the very recent diffusion models for generative AI, a technology that is still unexplored in the field of microscopy, and well known cell motility distribution parameters to produce promising simulations of cells migrating in 2D.

    Although the generated videos are often not perfect, the authors show that using them as a data augmentation method to train a cell tracking algorithm, contribute to the results improvements. This test has been performed on datasets from seven different experimental setups where the microscopy imaging parameters and cell types differ.

    Additionally, SyncCellFactory proposes a modular approach, which means that one should be able to potentially modify the motion model so other type of experiments (e.g., wound healing or collective cell migration) could be easily simulated.

    The manuscript is generally well written and clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The current approach does only support the simulation of one channel, while in microscopy different channels may be acquired.

    • The capacity or limitations of the simulator are not related to imaging parameters. For example, the poor quality of the results for PhC-C2DL-PSC may be improved if the pixel size or the resolution of the images was increased. Also, what are the limitations in terms of cell density or cells getting close to each other? Particularly to encode the motion for CN-Mov.

    • The authors refer several times to the robustness of the results, the photoreality and accuracy of the results, but these are no quantitatively assess. Aditionally, SynCellFactory identifies either cells in mitosis and right after the cytokinesis, but what about apoptosis? Also, the mitosis event in cells is quite well differentiated morphologically. Did the simulations capture the rounding of cells and known morphological cues of mitosis? (When looking at the images in Figure 2 Fluo-C2DL-Huh7 T3, this doesn’t seem to occur).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The following comments are meant to improve the quality of the current text in the manuscript:

    • Abstract: The authors reffer to the complexity of microscopy images but this can mean many different things despite tracking complexities, such as for example imaging artefacts, so I would try being more specific or rewriting it. Also, in the abstract it is not clear whether SynCellFactory provides annotated videos or only the video simulation. The same happens in the last paragraph of the first page: “We embrace the power of conditioned 2D diffusion models [19, 28] to generate high-quality synthetic cell videos that mimic the appearance and behavior of real cell data sets.” I would explicitly say that cell videos generation comes together with the annotations.

    • In general there is no mention of the maximum video length or cell tracks that SynCellFactory is able to robustly simulated. For example, is there any change the method considers cells entering and leaving the field of view? Or in Page 2, the authors say: “Utilizing as little as one annotated cell video for training, SynCellFactory can generate an extensive library of annotated videos in a consistent style, effectively augmenting the available training data.” How long can these videos be?

    • The text mentions pseudo ground truth labels to refer to some annotations provided by CellPose. I would explicitly describe what pseudo ground truth labels mean in this specific work.

    • There is a repeated sentence at the end of the second paragraph in page 2: “In [10] …”

    • I would smooth the following statement in page 2: “… can further enhance accuracy of an already leading deep learning cell tracking method.”

    • Interchange Figure 1 and 2. As it is now, the results are shown even before the method is described, which can be confusing and distracting.

    • I would extend and work on the description of ControlNet. For example, what are ctxt, cimg and itgt? Where do you use the text conditioning?

    • Figure 2: the current ordering is confusing. I would try using different panels (a,b,c) to accompany the explanation in the caption. Also the colors chosen (red, green and blue) are not adequate for color-blindness.

    • In Page 5, the text says: “The disks are colored according to the stage in the cell cycle, changing colors through the phases of Mitosis and reverting post-cytokinesis.” What do you mean exactly? what are these colors?

    • Page 6: “In particular, we found random cropping of the images and random 90 degree rotations beneficial.” What’s the crop size? how does that relate to the image resolution and cell size?

    • Please define the alpha coefficient.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Availability of large annotated datasets is one of the major limitations for the advance of machine learning techniques for cell tracking. In this sense, the authors present a solution with promising results using new methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My decision is the same as before. I think the manuscript presents an innovative pipeline in tracking, an area of bioimage analysis that is still not fully evolved and adapted to data driven and machine learning approaches




Author Feedback

We thank all reviewers for their valuable feedback and appreciate their constructive comments. Due to space limitations, we will address major concerns here and aim to incorporate as many reviewer comments as possible in a revised manuscript. Abbreviations: ground truth (GT); Cellpose (CP); SynCellFactory (SCF)

Response to Reviewer 1:

Q: Concerns about CP pseudo GT and error handling (R3). A: R1 is correct that SCF does not strive to guarantee highly accurate segmentation labels. Indeed, synthetic segmentation GT is not our focus, and any segmentation method can replace CP. Instead, SCF targets reliable tracking GT (detections, tracklets). We ensure each cell has a mask by integrating the simulated locations in our correction step (see p. 5 Sec. 2.4). Experiments prove these masks suffice to train and enhance models reliant on segmentation GT (Embedtrack).

Q: Cell cycle stages (R4) A: Cell cycle stages are inferred from the motion model. Splitting occurs randomly during motion model inference without strict biological constraints, except for the split duration which is based on the training data. We agree with R1 and R4 that refining cell cycle modelling is a promising future direction.

Q: Details on Video Length (R4) A: The minimum duration of a full mitosis cycle is t = 6 in tested datasets. We doubled the minimum cycle length (t=12) for our experiments. Video quality declines after t = 30.

Q: CP training dataset A: Publications do not list CTC as training data for CP and CP 2.0. In tests, our fine-tuned models outperformed all pre-trained CP models based on segmentation metrics.

Q: Comparison with [22] A: Serna-Aguilera et al. [22] do not provide tracking pseudo GT because their 3D model does not afford spatial conditioning. While segmentation pseudo GT can be generated using CP, tracking GT cannot. We will revise the introduction to state: “…crucially, their approach does not produce tracking pseudo-GT labels, while SCF does.”

Q: Varied cell shapes A: The disks in the motion model identify cell position, not cell shape. ControlNets then generate realistic shapes. A complex shape model would increase motion model complexity, including when cells interact.

Q: Computational Effort and Resolution A: On a single A100 40GB GPU: training one dataset ~ 9h, sampling one timelapse ~ 3min. Generated videos match original resolutions, ranging from 512x512 (DIC-C2DH-HeLa) to 1024x1024 (Fluo-C2DL-Huh7).

Q: Movement ControlNet information A: The Movement ControlNet employs a learning design and a loss function that gently enforces conditioning (see ref. [28]), ensuring effective task completion as evidenced by the quality of results.

Response to Reviewer 3:

Q: Motion Model Detail A: In the motion model, collision detection and resolution occur when two cells overlap. Using a hard sphere model, positions are adjusted with a repulsion vector until overlaps are resolved. As searched for by R3, the relevant code is located at ‘/motion_module/motion_help.py line 514’.

Q: Include FID (R4) A: While genAI metrics like FID correlate with human perception, this manuscript focuses on their use as data augmentation tools, prioritising the tracking metric as the sole relevant measure for our pipeline.

Q: Code Availability A: We pledge to release the code as open source on GitHub.

Response to Reviewer 4:

Q: Density and Distance Limitations A: R4’s insights on PSC prompt further analysis of density and cell distance impacts on our pipeline, though no systematic study has yet assessed its limits. Qualitative tests showed that the cell density limits are broadly based on the initial and final frame densities of the training video.

Q: Apoptosis and morphological cues of mitosis A: Our motion model, excluding apoptosis and mitosis cues, is simplistic. A complex model would capture cell cycle nuances more accurately but needs more biological priors. We opted for simplicity in the SCF model for broader application.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top