Abstract

This paper is not about a novel method. Instead, it introduces VesselVerse, a large-scale annotation dataset and collaborative framework for brain vessel annotation. It addresses the critical challenge of data annotation availability in supervised learning segmentation and provides a valuable resource for the community. VesselVerse represents the largest public release of brain vessel annotations to date, comprising 950 annotated images from three public datasets across multiple neurovascular imaging modalities. Its design allows for multi-expert annotations per image, accounting for variations across diverse annotation protocols. Furthermore, the framework facilitates the inclusion of new annotations and refinements to existing ones, making the dataset dynamic. To enhance annotation reliability, VesselVerse integrates tools for consensus generation and version control mechanisms, enabling the reversion of errors introduced during annotation refinement. We demonstrate VesselVerse’s usability by assessing inter-rater agreement among four expert evaluators.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0087_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0087_supp.zip

Link to the Code Repository

https://i-vesseg.github.io/vesselverse/

Link to the Dataset(s)

COSTA Dataset: https://imed.nimte.ac.cn/costa.html CAS Dataset: https://codalab.lisn.upsaclay.fr/competitions/9804 IXI (Original) Dataset: https://brain-development.org/ixi-dataset/ IXI (Some Annotations) Dataset: https://xzbai.buaa.edu.cn/datasets.html SMILE-UHURA Dataset: https://www.synapse.org/Synapse:syn47164761/wiki/620033 TubeTK Dataset: https://public.kitware.com/Wiki/TubeTK/Data TopCoW Dataset: https://topcow24.grand-challenge.org/data/

BibTex

@InProceedings{FalDan_VesselVerse_MICCAI2025,
        author = { Falcetta, Daniele and Marciano, Vincenzo and Yang, Kaiyuan and Cleary, Jon and Legris, Loïc and Rizzaro, Massimiliano Domenico and Pitsiorlas, Ioannis and Chaptoukaev, Hava and Lemasson, Benjamin and Menze, Bjoern and Zuluaga, Maria A.},
        title = { { VesselVerse: A Dataset and Collaborative Framework for Vessel Annotation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {656 -- 666}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents “VesselVerse”, a platform/framework that consists of a 3-D Slicer Plug-In for annotation of brain vessel annotation, a dataset summarizing three public MRA and CTA datasets with (semi-)manual and algorithm-generated annotations, and a consensus approach to consolidate provided annotations based on STAPLE. In the second part of the paper, a user study is presented that assessed the quality of the different annotation approaches and compared it with rating within the automatic consensus generation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper strives toward a community-driven effort for generation, improvement, and versioning of (vessel) datasets and corresponding annotations.
    • The dataset provides (semi-)manual annotations for 1,130 MRA/CTA volumes.
    • The framework considers many highly relevant aspects for data consolidation (if that is the right word), such as the possibility to add new annotations, correct old annotations with versioning, and submit new datasets.
    • Code for the framework is provided for review in an anonymized form, which I believe is essential for a project like this.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The submission process of improvements and a corresponding potential validation process of submitted corrected annotations is not clear / not discussed. New annotation seem to be “subject to validation” according to Fig. 2 and it is mentioned that “each annotation is subjected to validation checks for annotation quality, …”. Still, this process is missing detail from my perspective: how are these validation checks conducted any by whom? What is seen as valid / not valid edits? Since edited annotations may lead to corrupted annotation versions, inconsistent branching, etc., I believe this aspect is highly relevant for a paper that focuses on a framework as a contribution and should be described and discussed in more detail. From my perspective, the paper would benefit strongly from deriving clear guidelines for contributions (datasets, annotation edits or new annotations) to the VesselVerse platform.
    • While the framework specifically describes the challenge of differing structures of interest and differing, it does not seem to consider this in its design for the generation of the consensus, versioning or similar (to the best of my understanding).
    • For the second part of the paper, I was not able to follow what the main contributions are meant to be and what the experiments are supposed to demonstrate: the value of the VesselVerse dataset itself? the utility of the STAPLE algorithm for consensus generation? the behaviour of different experts? This can potentially highlighted more clearly to also understand the results in this context.
    • The comparison between the rater’s rankings and the STAPLE consensus is not very clear to me (see comments below), both from the setting and from the insights that can be derived from the results, especially since the majority of “experts” is algorithm-derived.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I understand that the following review is rather lengthy but especially since this paper does not focus on methodological contribution, I wanted to make sure to understand and discuss its components.

    • As mentioned above, the paper discusses that datasets may follow different annotation protocols, and different applications may have different requirements w.r.t. their labels. It is briefly mentioned that users may “select [masks] from different annotation protocols or create a consensus”, later, it also discusses that multiple annotation protocols may exist simultaneously. However, it is not really discussed how this is implemented from a user’s perspective. It seems that a structured approach to this, also for corrected or adapted annotations, is important to actually utilize the framework.
    • pg. 4/pg. 6: While I understand that (human) expert annotations and automatically generated masks may be treated as the same structures implementation-wise, I found this representation in the text suboptimal. It makes it partly more difficult to follow the text and puts a manually verified (?) segmentation on the same level as a “Frangi-Filter-generated” segmentation. A motivation for this was not clear to me.
    • pg. 4: Consensus generation with STAPLE: While I am not an expert in STAPLE, to the best of my understanding it can be used with different forms of priors. Is any additional information / prior integrated in the consensus generation? Additionally, since a threshold is required to obtain the final mask, information about the selection of this threshold parameter should be added to the paper.
    • pg. 5: Version control: See weaknesses above. Additionally, it should be highlighted how potential “conflicts” are handled especially if there may be considerable time delays between commits for validation and integration in vessel verse. Is there something like a “gold standard” annotation for each dataset (potentially also connected to a specific annotation protocol)? Minor: An additional aspect that could be mentioned is that works that utilize the VesselVerse dataset then also need to clearly highlight which version they utilized for their experiments.
    • pg. 6: The process of generating (semi-)manual (?) annotations should be described in more detail. It is not clear what “assisted manual annotations” mean, which annotation protocol the annotator followed, which expertise they had, whether it was a single annotator or various annotators etc. This information should be added to understand the quality of this manual annotation. Generally, with the provided descriptions, it is not clear what quality the 600 annotated datasets (whether it be human- or algorithm derived annotations) have, making it difficult to assess the added benefit of the VesselVerse dataset.
    • pg. 6/7: It was not fully clear to me how the validation protocol was set up, and how many image-annotation pairs were rated by how many raters. Some questions: Specifically, did each rater assess 20 (images) * 5 (algorithms) segmentations? How reliable is are the results if only around 3-4 images per dataset are assessed? Were the algorithms randomly selected? What is the difference between the “overall quality” and the “quality score”? It seems that no specific annotation “target” protocol was defined, which seems to limit the expressibility of the quality assessment.
    • pg. 8: While the Kendall’s tau is positive in most cases, the results indicate at most moderate alignment [1]. For both W and tau, it would help interpretation to add ranges for high or low agreement. From the description of the experiments, I am not sure if I can follow the statement “the consistent positive correlation in MRA datasets validates that STAPLE’s consensus-based approach effectively captures the collective expertise of radiologists, validating its relevance as a consensus tool withing our framework” (minor: typo in “withing”)
    • pg. 8: I am not fully clear on the role of the second experiment. Why wasn’t the manual annotation simply included in the first set of experiments? The motivation for excluding nnUnet and A2V in the second experiment but not in the first is also not clear to me.

    Additional comments:

    • Tab. 1/Fig. 1: It seems that the datasets share images if the COSTA annotations are available for the IXI dataset. I recommend making this connection clear in the overview.
    • pg. 4: “Similarly to the approach used by SMILE-UHURA [1], VesselVerse, broadens the concept of expert annotators to encompass model-generated segmentations, thereby treating segmentations from any algorithm as expert-level annotations.” - I found this comment somewhat confusing as – to the best of my understanding – SMILE-UHURA provides does indeed also provide results from different (semi-)automatic segmentation algorithms; however, there is still a very clearly distinguished human-verified reference annotation for this dataset.

    [1] https://journals.lww.com/anesthesia-analgesia/Fulltext/2018/05000/Correlation_CoefficientsAppropriate_Use_and.50.aspx

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe that the framework proposed by the authors may have substantial value for the community; however, I am missing 1) information / concepts regarding the implementation of the consolidated database and how this will be managed subsequently, and 2) the value of the validation study conducted. I am not sure whether MICCAI is the right venue for this contribution as there is simply very limited space to discuss important software- and framework-related choices, and there is further limited opportunity for validation of a complex software system in such a review process.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces VesselVerse, a dataset that comprises 1,130 annotated images drawn from multiple public datasets covering various imaging modalities (e.g., TOF-MRA and CTA). In addition to presenting a rich dataset, the paper details a collaborative annotation framework that incorporates multi-expert inputs, version control, and a consensus generation mechanism based on the established STAPLE algorithm. This dual contribution addresses two main challenges in medical imaging: the scarcity of high-quality, annotated data and the inherent variability in expert annotations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Dataset Size and Diversity: VesselVerse stands out by aggregating annotations from three different public datasets and covering multiple modalities.
    2. Collaborative Annotation Framework: The authors employ a 3D Slicer extension for vessel annotation revision.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. How are the manual annotations obtained? The paper does not clearly specify the process used for manual annotation. For instance, the TopoCow dataset includes voxel-wise annotations, but it is unclear how annotations for the other public datasets, which lack such data, were generated. Was the human expert responsible for reannotating TopoCow to maintain consistency across datasets, or was a different approach used for those without existing annotations?
    2. Dataset Distribution Variance. The dataset is compiled from various public sources spanning different imaging modalities. Although it comprises a total of 1,130 images, the inherent variance in distribution presents challenges for training deep learning models.
    3. Visualization and Dice Score of Annotations: Currently, the assessment of the annotations is based solely on expert ratings, which may not be intuitive for readers. Including visual comparisons of the different annotations, along with quantitative metrics such as the Dice score, would more effectively demonstrate the quality and consistency between expert annotations.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a large 3D imaging dataset with vessel annotations, featuring input from multiple experts. This contribution is valuable to the medical imaging community, particularly for advancing segmentation techniques and exploring annotation uncertainty. However, the limited evaluation of the provided annotations reduces its overall impact. Consequently, I lean toward a weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The work proposes both an annotated datasetwith 1,130 images as well as a collaborative framework for vessel annotation in medical images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Data annotation remains a challenge, especially for data-driven applications in medical imaging. A framework with version control that allows tracking individual annotation contributions is extremely helpful to harmonize different annotation protocols and resolve annotation discrepancies within the same annotation protocol.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The STAPLE algorithm is a well-established algorithm, but it was proposed over 20 years ago. Newer consensus algorithms are available.
    • I worry about consensus annotation biases if X annotations followed protocol A and Y annotations followed protocol B with X>Y, for example.
    • What is an expert annotator?
    • Just to confirm: in the experiment, each evaluator just rated 5 annotations?
    • Typo? expect judgment -> expert judgment
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper address an important gap in the medical imaging literature. Though the description and methodology could be slightly better described and or justified, I believe this paper is of interest to the MICCAI society.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their valuable feedback and very positive reception of the VesselVerse framework, and we look forward to presenting and discussing it with the MICCAI community. We appreciate this opportunity to clarify misinterpretations and correct inaccuracies. Please find our answers below:

  • Validation process and quality control (R1, R2): Submitted annotations undergo systematic checks for quality, spatial, and metadata consistency before repository commitment (Section 3.1). An expert performs validation through a workflow where contributions are committed to a staging area and then reviewed before integration. As the framework will be public, a dedicated website will provide guidelines for contributions and detailed validation criteria.

  • Manual annotation methodology (R1, R3): Manual annotations followed a semi-automatic approach where Frangi filter initializations (Section 3.2) were refined by an expert using 3D Slicer. We denote the process as “assisted manual annotations” (MA) in Table 2. While we used a single expert for consistency in this initial release, our framework’s key strength is enabling multiple experts to improve existing annotations or contribute annotations following potentially different protocols.

  • Consensus generation algorithm (R2): We chose STAPLE for its proven reliability in medical imaging. However, our framework’s modular design can accommodate other consensus algorithms in potential future releases while maintaining backward compatibility. We aim for community contributions in this area.

  • Protocol differences and dataset variance (All): VesselVerse addresses protocol variability by enabling users to select specific protocols to generate consensus masks (Section 2). Fig 1 illustrates differences between COSTA and VesselVerse labels, while experiment results (Section 4) confirm evaluators’ preferences aligned with their clinical backgrounds. We also recognize the variance across datasets and imaging modalities as both a challenge and an opportunity: such diversity enables more robust model training. Our versioning system allows researchers to track exactly which dataset versions were used in their publications (R1’s concern) while allowing each user to select their preferred dataset or annotation style, with harmonization mechanisms available when needed.

  • Evaluation methodology and terminology (All): Our validation includes diverse clinical perspectives from 4 specialists (Section 4), each assessing five annotations per image across 20 images. The camera-ready version will provide further details about the experts. This was omitted in the submission for anonymization purposes. As part of the evolving nature of the dataset and framework, we expect to perform a more extensive evaluation with larger cohorts and further metrics (R3).

  • Regarding R1’s observation about the term expert, we acknowledge that grouping human and algorithm-generated annotations as “experts” may cause confusion. Our intention was to enable a democratic computational approach to consensus generation where all annotation sources are considered. At the same time, we highlight the fact that the metadata associated with annotations discriminates the two categories.

  • Visualization and metrics (R3): We agree that visual comparisons and additional quantitative metrics would enhance the manuscript. The page limit provides a hard constraint on this. However, we plan to include more visualization examples and expanded quantitative comparisons between different annotation sources on the project’s website. VesselVerse will evolve through community contributions, with this initial release serving as the foundation for a living dataset that improves through collective expertise.

  • Erratum: We learned of an error in the reported number of images for TopCow. The number will be corrected in the camera-ready version




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top