Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model’s effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis. Dataset and code will be available at: https://github.com

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4285_paper.pdf

SharedIt Link: https://rdcu.be/eHwV1

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_6

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/sarathsp1729/Zebrafish-development

Link to the Dataset(s)

https://github.com/sarathsp1729/Zebrafish-development

BibTex

@InProceedings{SivSar_Automated_MICCAI2025,
        author = { Sivaprasad, Sarath AND Wang, Hui-Po AND Jäckel, Anna-Lisa AND Baumann, Jonas AND Baumann, Carole AND Herrmann, Jennifer AND Fritz, Mario},
        title = { { Automated Detection of Abnormalities in Zebrafish Development } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {56 -- 66}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a dataset of microscopic image sequences that capture zebra fish embryonic development under control conditions and exposure to compounds with annotations from experts at the image and sequence level. Two classification tasks are formulated (fertility detection and toxicity assessment) and evaluated at the image level prediction and predictions over time. A transfomer-based model is defined as an initial benchmark that takes one image at a time and considers context from the sequence. The performance of this model in both tasks exceeds 90% at the image level and shows room for improvement in the whole sequence evaluation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The motivation for the proposed dataset is clearly explained and contrasted to existing datasets, highlighting the need for larger scale and more detailed annotations.
- Evaluations at the whole sequence level demonstrate some challenges and opportunities to develop better models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited novelty of the transformer model: based on the original ViT architecture [5], the spatial processing is namely the patch-based formulation and the temporal information is given by a positional embedding, also part of the original formulation of Transformers [23]. Related works that have used Transformer models for processing sequential image data are missing (even if it was in other domains). Two references that use Transformer models in zebrafish data are mentioned in the related works, but do not include the task or how the proposed model differs from those.
- Lack of comparisons to other models: the experiments are limited to one model, for which ablations are not reported either (e.g., to study the effect of the spatial and temporal embeddings).
- Missing justification for some experimental details: 1) The high resolution images are downsampled for processing. What are the implications of this? and consider revising claims such as “The model is designed to process sequences of high-definition microscopic images”. 2) There were 4 classes annotated for the toxicity task, but it seems like the model predicted alive vs anomalous samples.
- The tasks described in section 3 do not mention the evaluation scenario of early detection, which is then introduced in the results. Early detection is phrased as a relevant and challenging task, which can be clarified earlier in the paper and how to properly evaluate it (accuracy at different time steps).
- Even though the initial results provide a benchmark for the tasks, it is hard to appreciate how challenging these task actually are without comparing to other methods (mentioned before) or reporting number from previous methods in existing datasets.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Precision is used to report performance in the abstract but only accuracy is reported in the results.
- When claiming the contributions, the fine-grained temporal annotations are part of the dataset (so it can be combined with the first contribution).
- The related work could benefit from comparisons to other transformer models/CNN models, rather than spending one paragraph describing why zebrafish is used as a model organism (may be relevant for a different audience though).
- Taking the logit as the confidence of a prediction may be a simplification, considering that these may not be calibrated (how close the estimated confidence is to the true probability).
- Explain how this number was computed: “1.3 hours in which human experts can accurately predict the label”. Does it follow a prospective analysis? In a retrospective analysis, the fact that the AI needs to see more of the sequence to reach certain accuracy may not be a problem because the data is already collected and the processing is fast. This comparison can be clarified.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental validation of the proposed benchmark for the dataset misses comparisons and has simplifications that were not justified. The novelty of the approach is limited, and some aspects of the dataset are not exploited (high resolution, temporal resolution to end up having a model that makes image-level predictions and smoothing the sequence of it).
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

In the response, it is claimed that the transformer model is first viable modern solution; however, this claim is not substantiated. The lack of comparisons against existing methods makes it difficult to evaluate whether other methods are viable as well or outperforms alternatives. While ViTs are modern architectures, they have been widely used since 2021, and their application alone does not establish novelty or superiority without proper benchmarking.

The issue of model calibration, which was brought up in earlier feedback, is not addressed in the revised manuscript beyond a brief mention.

The justification for not including baseline comparisons (due to the novelty of the dataset) is also unconvincing. Prior dataset papers published at MICCAI typically include benchmarks, even when the data is new. Without other methods for reference, it is difficult to assess the difficulty of the tasks and the value of the dataset for developing novel models that address existing challenges. The task appears highly problem-specific, which is entirely valid, but may make the dataset more relevant to specialized communities rather than the broader medical imaging or ML/CV audience.

Review #2

Please describe the contribution of the paper

The authors present a new data set that can be used for defect detection in early embryonic development in zebrafish, in particular to assess the egg fertility and toxicological effects. The images are time resolved with 5 and 15 minute intervals for analysis of temporal effects of fertility and toxicology, respectively. The data set also comprises annotations and the authors use it to train a transformer-based architecture for the detection of toxicological effects on the development.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors provide a new annotated data set for fertility and toxicological testing in zebrafish embryos, which is a nice contribution for the community and for further testing.
- A transformer-based architecture is exemplarily applied on the data set. While the architecture itself is not novel, it’s great to see that the data set is apparently large enough to successfully train larger transformer-based models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major points:
- The authors claim that the annotation was performed based on standardized guidelines. It would be interesting to know more details about these specific guidelines.
- In Fig. 2 it is unclear how temporal information is used / propagated in the Vision Transformer. Is it only an additional token that contains an encoding of the time stamp? If yes, are the embryos synchronized somehow to ensure that the time points also match across different sequences or images? Is information from previous frames considered in any way?
- Except of the temporal integration (that is a bit unclear as mentioned before), the transformer architecture seems like a standard vision transformer. If there are any other novel aspects about the architecture you used, it would be advisable to emphasize them more prominently.
- Rescaling of the input images to the transformer changes the aspect ratio of the images. Is this potentially a problem and cropping the images to square sized ROI surrounding the embryo might yield more accurate results.
- It seems both tasks are just binary classifications. It is not fully clear, why you use o=1 and o=2 despite the same number of classes in both tasks. This also applies to the two different losses that you used.
- In Figure 4, you present the human vs. prediction performance. While the predictions seem to (almost) continuously increase, there is a drop in human performance between timepoints 25 – 100. Is there any explanation for this behavior?
Minor points:
- There are some inconsistencies in the specifiers (e.g., z_t being bold or non-bold, x_t and y_t with t being a vector or scalar and the like. Make sure to double-check consistency of the specifiers.
- “avarying” -> varying
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It is a nice endeavor to create a new open-source data set and this will certainly be useful for the community. The demonstrated application of vision transformers for the task of classifying fertility and toxicological impacts also nicely indicates potential use-cases for the data set. Nevertheless, the methodological novelty remains limited (or is at least not clearly enough presented) and there are some other unclear parts that would need a bit more attention before being suitable for publication.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My questions have been addressed and I think that the proposed changes to the manuscript will make it suitable for publication.

Review #3

Please describe the contribution of the paper

The paper introduces a new dataset for abnormalities detection in zebra fish embryos, focusing on fertility detection and toxicity assesment. The paper also includes a baseline benchmark visual transformer model setup.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents a substantial, expert-annotated dataset featuring temporal image sequences of zebrafish development. The dataset supports two key tasks: fertility detection (1,344 sequences; 130,368 images) and toxicity assessment (288 sequences; 55,296 images). This is the first large-scale resource of its kind offering detailed temporal annotations for tracking these developmental outcomes.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper employs a standard Vision Transformer baseline, demonstrating high accuracy on the overall classification tasks. The key area identified for improvement is the speed of accurate early prediction, where the model lags behind experts. Considering the model’s lack of novelty, the contribution rests on the newly introduced dataset. The paper itself appears to be more aligned with biological research than with medical imaging or clinical relevance.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The core contribution of this work is a comprehensive new dataset featuring temporal image data for zebrafish embryo classification. The paper further includes a baseline transformer model which, despite its simplicity, achieves promising performance and serves to underscore important avenues for future research, such as improving the timeliness of accurate classification. It also raises concerns regarding its relevance to MICCAI, as the paper appears to be more focused on biological analysis than on medical imaging or clinical applications.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their insightful feedback. We’re encouraged that our dataset is seen as a valuable contribution, enabling further testing (R1) and providing the first large-scale, detailed annotations for zebrafish development (R1, R2, R3). Our additional ViT baseline validates the setup (R1, R3) and underscores the need for advanced methods (R2). Though it is a simple baseline, it presents the first viable modern solution and reveals key differences between human and AI evaluations in this domain. We address key concerns below and will incorporate remaining editorial suggestions in the next version. Upon publication, we will release the full dataset and detailed protocols along with it. Details on temporal info. (R1, R2): We use a learnable temporal embedding on the class token, not previous frames. Imaging is fully automated and synchronized for consistent time alignment. Prior frames are intentionally not used during inference to enable context-free predictions—this reflects the actual lab setting where early history is unavailable at spawn time. It is critical for assessing an embryo’s state at any single time point without needing sequential history. Output dimensions o=1/2 & calibration (R1, R2): For the fertility detection task, binary classification with o=1 is the natural choice. We evaluate model calibration in Figure 3c (addressing R2’s question on using prob.) and use these probabilities for inference. In contrast, toxicity assessment involves a broader range of potential outcomes. As reviewers noted, four classes are annotated in the dataset. We club them to ‘alive’ vs. ‘anomaly’ for baseline evaluation, but keep o>1 to support future extensions into finer-grained sublethal effect classification—both for our ongoing work and for future users of dataset. This rationale is discussed in the manuscript (page 6), as it enables easy scaling of the framework and fair comparison across this baseline and future models. Cropping and aspect ratio (R1), high-definition raw vs. downsample(R2): We opted for resizing over cropping as the sample is not centered, especially the live samples move around. Although this alters the aspect ratio, we observed empirically strong results. Cropping remains a valid alternative for future work. As described in the experiment section, training h-params follow [14]. Images are downsampled to standard transformer input size. The original 1344×820 images—referred as “high-definition”—will be released for downstream use. We revise the wording around “model” for clarity. Evaluation time (1.3 hours) and human accuracy dip (R1, R2): The 1.3-hour estimate is derived from expert annotations—specifically, the earliest time point at which image-level labels consistently match final sequence labels (Figure 4). Experts evaluate data prospectively, mirroring real-world lab workflows. The dip in human accuracy between timepoints 25–100 reflects the inherent ambiguity of early-stage phenotypes. Since all embryos appear alive initially (t=1), and the dataset has a large number of normal samples, performance is good. Accuracy improves again as development progresses and phenotypes become more distinguishable, causing a dip in between. Comparison (R2): As noted by R2, no directly comparable baseline exists due to the novelty of our dataset. Prior zebrafish+transformer models—[8] (on unsupervised segmentation) and [24] (on keypoints)—target different objectives/modalities. We therefore use ViT, a state-of-the-art architecture for image-based modeling, and benchmark against expert performance as the strongest real-world reference point. Relevance to the MICCAI community (R3): Zebrafish are an established preclinical translational model in biomedical research. Our work aligns with MICCAI’s focus on medical imaging for healthcare and drug discovery. CLINICCAI track explicitly bridge preclinical models and therapeutics. Zebrafish-based studies have appeared in MICCAI’s main conference & workshops.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
My recommendation is based on:
1. Vox populi: there were 2 For votes and 1 Against vote. The Against vote wished for inclusion of comparison methods, which I chose to downweight because the paper is largely about a new dataset. A valid comment was that this work might fit better elsewhere than in the main MICCAI section (eg a dataset workshop). I believe that datasets merit attention at ML conferences, because they are so hard to create well and so fundamental to the whole ML enterprise. I sympathize that, like all dataset papers, this paper gets caught by the de facto requirement at ML conferences for an algorithm, and the tendency to judge the algorithm rather than the dataset. If accepted, I urge the authors to clarify a few things:
2. What does the extremely high performance of the algorithm say about the value of the dataset for other users? Are the problems posed by the dataset too easy? (noted by a reviewer).
3. You discuss prior datasets, one of which appears much larger and with videos, which of course have very high temporal resolution. Please clarify the value-add of this new dataset.
4. If not accepted and a reworking is possible, should more page space be put towards describing the dataset? What is the balance of importance in your eyes between dataset and algorithm?

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers raised concerns on the technical novelties of the paper, as it is an application of the transformer model. The application domain is more biased to the biological domain, rather than the medical or clinical domains of MICCAI.

back to top

Automated Detection of Abnormalities in Zebrafish Development

Author(s):