Abstract

Biomedical image analysis challenges have become the de facto standard for publishing new datasets and benchmarking different state-of-the-art algorithms. Most challenges use commercial cloud-based platforms, which can limit custom options and involve disadvantages such as reduced data control and increased costs for extended functionalities. In contrast, Do-It-Yourself (DIY) approaches have the capability to emphasize reliability, compliance, and custom features, providing a solid basis for low-cost, custom designs in self-hosted systems. Our approach emphasizes cost efficiency, improved data sovereignty, and strong compliance with regulatory frameworks, such as the GDPR. This paper presents a blueprint for DIY biomedical imaging challenges, designed to provide institutions with greater autonomy over their challenge infrastructure. Our approach comprehensively addresses both organizational and technical dimensions, including key user roles, data management strategies, and secure, efficient workflows. Key technical contributions include a modular, containerized infrastructure based on Docker, integration of open-source identity management, and automated solution evaluation workflows. Practical deployment guidelines are provided to facilitate implementation and operational stability. The feasibility and adaptability of the proposed framework are demonstrated through the MICCAI 2024 PhaKIR challenge with multiple international teams submitting and validating their solutions through our self-hosted platform. This work can be used as a baseline for future self-hosted DIY implementations and our results encourage further studies in the area of biomedical image analysis challenges.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2800_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/remic-othr/PhaKIR_DIY

Link to the Dataset(s)

N/A

BibTex

@InProceedings{KlaLeo_DIY_MICCAI2025,
        author = { Klausmann, Leonard and Rueckert, Tobias and Rauber, David and Maerkl, Raphaela and Yildiran, Suemeyye R. and Gutbrod, Max and Palm, Christoph},
        title = { { DIY Challenge Blueprint: From Organization to Technical Realization in Biomedical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {88 -- 98}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a software and setup to allow organizers of challenges to host challenges on own hardware using a open source toolkit that is described in their contribution. They briefly discuss results of a limited user questionnaire collected after using their tool themselves for a challenge last year.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is noteworthy in its deviation from a real research paper. Instead, it opens a subject to discussion that is implicitly agreed to be a cornerstone of today’s open and collaborative problem-solving in medical imaging: the competition for the best solution to a current clinical or technical problem. Since a variety of tools for the same purpose exist, it cannot be stated that the proposed solution is different or even novel/better than any of those, yet it is open source and apparently released with no commercial interest.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The idea is not novel, but this is of a lesser concern as it is made up for by the fact of the public release.

    Also, this is no research publication per se. It lacks a hypothesis, validation criteria, and a formal test. Still, this is also not my main criticism since this is somewhat expected (I wouldn’t know how to propose a proper scientific evaluation, or even how to define testing criteria, let alone how to “double blind test” this tool against another.). However, there are several points that I still would like to point out.

    1. The authors state that the tool fulfilled all criteria that have been defined, apparently in an attempt to provide some validation-like statement. However, this is of course a tautology as they have defined the criteria themselves.
    2. I think it would be easy to define further stakeholders than the groups defined in the publication, and their needs wouldn’t be fulfilled.
    3. I don’t think that it is a good idea to solve both authentication and authorization in the tool. Instead, it should offer a way to defer authentication to trusted apps/organisations to allow integration into backend systems that are supposedly available with major players that might consider using such a self-hosting system.
    4. Integration with large challenge platforms should also be addressed. One of the biggest values of the established platforms is that they serve as a book-keeping ledger about challenges past and present, and go-to information hubs about solutions. Perhaps this is already solved; would be good to read a comment about this.
    5. I’m missing a comment on requirements and solutions around scheduling of Docker based jobs. How is this handled; how are DoS attacks prevented, how is it assured that participants receive sufficient compute? At least some numbers from their own challenge would be needed to get an idea: how many participants has it seen, how large was the dataset, how many jobs were run…? I seek to understand this better, but there is really not enough information here to get the picture.
    6. Same for the administrative overhead: which expertise was specifically needed, and how many person months to sustain the operation of the tool? Unfortunately, there is almost no information at all about the IT side of things beyond merely enumerating the central open source tools employed for the tasks.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although this is definitely not a standard submission to MICCAI and does neither propose something novel, nor even something scientific, I do think that there is enough value in the fact itself to propose a challenge DIY toolkit. My major criticism is the very sparse information. From the fact that the tool apparently has been used already, and successfully so, I hope that this can be remedied. For this submission, I strongly favor a Poster presentation over a oral presentation; if possible, the tool should be demonstrated, too.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript proposes a DIY framework for self-hosted biomedical image analysis challenges, emphasizing data sovereignty and compliance. It outlines organizational workflows, technical components (containerization, IAM, automated evaluation), and validates the approach through a MICCAI 2024 case study. The modular architecture integrates open-source tools like Docker, CVAT, and MinIO, demonstrating feasibility while reducing reliance on commercial platforms. Advantages include customization, cost efficiency, and granular control over sensitive medical data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This manuscript has following advantages,

    1. The manuscript enables institutions to maintain ownership and compliance with GDPR/HIPAA.
    2. It eliminates commercial platform fees through open-source tooling.
    3. Custom workflows: Supports tailored evaluation metrics and submission processes.
    4. It enables on-premises hosting, which minimizes third-party data exposure risks.
    5. It also enables, Git-based versioning and containerization standardize algorithm comparisons.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. What specific criteria governed the curation of the challenge dataset (size, modalities, geographic diversity)? How were potential biases in medical imaging equipment or patient demographics addressed?
    2. The SFTP-based data distribution lacks details about access latency for large 3D volumes (e.g. for CT, MRI scans). Was performance benchmarked against cloud alternatives?
    3. How was inter-annotator agreement measured for CVAT-based labels, given known variability in medical image interpretation?
    4. The evaluation engine uses Python scripts for metric computation. How does this approach handle non-deterministic AI model outputs or hardware-induced variability?
    5. For Docker containerized submissions, what safeguards prevent adversarial attacks exploiting Docker vulnerabilities?
    6. The paper advocates OAuth 2.0 via Authentik but doesn’t specify how token hijacking risks were mitigated in a medical data context?
    7. Can the author mention about threat analysis for this project, analyzing threats for data and model being used without correct access (e.g. data, and model)?
    8. How did the automated evaluation’s throughput (submissions/hour) compare to commercial platforms like Grand Challenge?
    9. Why weren’t emerging standards like MLFlow or DVC integrated for experiment tracking, given their adoption in reproducible AI research?
    10. For multi-modal submissions (e.g., combined imaging + EHR data), does the architecture support heterogenous data pipelines?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    a. The manuscript enables institutions to maintain ownership and compliance with GDPR/HIPAA. b. As mentioned above there are few points / questions that needs to be addressed by the authors for a strong acceptance of the manuscript.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors provide a ‘blueprint’ for a hosting a biomedical image analysis challenge via a ‘do it yourself’ approach rather than relying on a cloud provider like grand-challenge.org or synapse. A case study is presented in which a challenge follows this blueprint, and thereby feasibility is shown.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is an important area of study for our community. Challenges are becoming more important each year, and there are limited existing options for hosting a challenge which often requires substantial funding and offers limited flexibility. It is in this community’s interest for viable DIY options to be available for challenge hosting, and this article does a good job laying out how this can be accomplished.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the paper does a good job covering many of the important topics in DIY challenge hosting, there is simply not enough space for each item to be described in a high degree of detail. Therefore, a reader who wishes to follow this blueprint will need to spend considerable time reading the material which this paper references.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper serves an important purpose to illustrate to any potential challenge organizer that self-hosted options are available outside of the major cloud providers. This won’t make sense for many organizers who might lack the expertise, but for those who wish to pursue highly innovative challenges using features that are not yet supported by existing platforms, this paper would be highly useful.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their feedback. Our response is organized into 4 topics summarizing key reviewer questions (R).

1.Validation and Evaluation – R 2: Is defining your own criteria tautological? – R 4: What dataset curation criteria and bias mitigation were used? – R 1+4: How was annotation quality ensured? – R 2: How do you guarantee reproducible, deterministic results and what is your evaluation throughput compared to cloud platforms?

We acknowledge that our self-defined criteria may appear tautological. They served as design goals to demonstrate feasibility within the case study, not for formal validation. We will clarify this in the discussion.

Dedicated dataset and challenge papers will detail the case study. The dataset spans multiple imaging modalities and centers; the final version will reference these papers with exact numbers and selection criteria.

Ground truth was generated via a three-stage consensus: one annotator followed by two sequential expert reviewers to ensure consistency. We will clarify this in the manuscript.

Our evaluation engine supports deterministic inference; for stochastic models, fixed random seeds must be used. Though we have not benchmarked against commercial platforms, our modular pipeline supports horizontal scaling to meet any throughput target without architectural changes.

2.Security and Access Control – R 2+4: Why handle both authentication and authorization in-tool rather than defer to institutional SSO? – R 4: How did you mitigate token-hijacking risks and perform threat analysis? – R 2: What safeguards prevent Docker-based attacks and DoS?

Authentik is our IAM. In standalone mode it delivers integrated AuthN + AuthZ. When an institutional OIDC/SAML IdP exists, we are able to switch to proxy mode: the external IdP keeps AuthN and user records, while Authentik enforces RBAC and context-aware AuthZ.

Docker submissions run in isolated containers with resource quotas (e.g., offline) to mitigate malicious code and DoS. Authentik supports short-lived tokens with rotation, PKCE, SameSite cookies, scope limitation, and token revocation. We will add a “Threat Considerations” paragraph. For ultra-sensitive data, a federated learning approach avoids raw data export.

3.Integration and Extensibility – R 2: Can the system support additional stakeholder roles? – R 2+4: Is integration with established challenge directories possible? – R 4: Why not include experiment tracking (e.g. MLflow) or multi-modal pipelines?

Our role-based IAM is extensible and supports additional stakeholder roles (e.g., data curators). New roles with custom permissions can be configured without architectural changes. We will highlight this in the manuscript.

While full integration with public challenge directories is not yet implemented, organizers can list their challenges there and link back to their DIY site.

Evaluation can indeed be seen as a series of experiments and is compatible with SOTA tracking tools; we will mention this. Submission containers support arbitrary file pointers, enabling heterogeneous, multi-modal pipelines without system changes.

4.Deployment and Maintenance Effort – R 4: How is large-scale data distribution handled (latency, throughput)? – R 2+4: What are the person-month and expertise requirements for setup and ongoing administration? – R 2: How are resource quotas and submission scheduling managed to prevent overload?

We distribute data via SFTP at ~1–2 Gbit/s, sufficient for up to ~200 GB; for >1 TB or time-critical transfers, we recommend S3-compatible multipart downloads (>40 Gbit/s). To prevent overload, we enforce per-team submission limits and container quotas. As participants train locally, there is no continuous data streaming; organizers run inference only.

Initial setup requires moderate DevOps skills and may take up to 2 weeks; ongoing maintenance averages a few hours per week. Deployment guides in our repository help reduce the learning curve for new organizers.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top