Abstract

Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets’ parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying “learned” systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2332_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ImFusionGmbH/HyperSORT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{JouSam_HyperSORT_MICCAI2025,
        author = { Joutard, Samuel and Stollenga, Marijn and Balle Sanchez, Marc and Azampour, Mohammad Farid and Prevost, Raphael},
        title = { { HyperSORT: Self-Organising Robust Training with hyper-networks } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {275 -- 285}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The long-tail effect and erroneous annotations have long been significant challenges hindering the advancement of MIA segmentation. The authors constructed this scenario through synthesized mislabeled data and validated it on 3D abdominal multi-organ CT scans, while also presenting non-evaluative results from TotalSegmentator. A hypernetwork framework was developed to generate parameters for the task network, thereby mitigating the impact of labeling inaccuracies. This mechanism involves introducing a metric for annotation preference (denoted as lambda), which is decoded through an MLP to produce task network parameters for inference. During training, the latent variable space was regularized to characterize annotation preferences by jointly constraining segmentation loss and L1 regularization of lambda. This regularization guides the task model to adopt distinct segmentation strategies for data with varying annotation preferences, effectively alleviating performance degradation caused by segmentation biases. Experimental validation was conducted using synthetic data from AMOS liver CT segmentation and a dataset (TotalSegmentator) that lacks accurate annotations as per the authors’ indication. A well-trained model was employed to evaluate the TS dataset, with subsequent analysis revealing inherent limitations in its segmentation quality.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The HyperNet framework demonstrates its potential as a novel training paradigm, substantiated by its systematic implementation in this work.

    2. This study establishes HyperNet’s capability in addressing annotation discrepancy characterization tasks, particularly in disentangling annotation style variations within mislabeled datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited Innovation. This work exhibits substantial methodological gaps compared to existing hypernetwork research. It neither introduces technical/methodological enhancements to conventional hypernet frameworks nor systematically validates performance alterations of training strategies in MIA segmentation contexts. Crucially, the absence of error correction mechanisms for classified annotation biases (e.g., morphological dilation/erosion miscalibrations) to optimize segmentation outcomes further diminishes its scientific contribution.

    2. Incomplete Experimental Validation. The experimental protocol fails to conduct comparative analyses with established annotation error mitigation approaches (e.g., meta-learning frameworks, teacher-student paradigms, uncertainty-aware learning, or explicit error characterization methods). Despite extensive prior work in annotation discrepancy modeling, the authors neglect benchmarking against state-of-the-art techniques under identical synthetic data conditions. A rigorous evaluation should quantitatively compare the proposed method’s capacity in characterizing annotation style variations against contemporary approaches.

    3. Misaligned Experimental Design. The synthetic error simulation—uniform 15% dilation/erosion across entire volumes—constitutes an oversimplified approximation of real-world annotation variances. Clinical annotations typically exhibit localized morphological inconsistencies and radiologist-specific systematic biases (e.g., selective exclusion of tumor-adjacent parenchymal distortions). The claimed capability to model diverse annotation styles remains unsubstantiated, as neither regional error patterns nor clinically observed systematic biases (as referenced in the introduction) were experimentally addressed.

    4. Methodological Flaws in Validation. The experimental design deviates from standard protocols for annotation error resilience evaluation. Conventional methodology requires training on corrupted annotations while testing against clean ground truth to isolate error mitigation efficacy. The authors’ omission of this critical validation pipeline, coupled with the absence of comparative segmentation performance metrics (Dice, HD95, etc.), fundamentally undermines claims of methodological innovation.

    5. Unsubstantiated Analytical Claims. The extensive critique of TotalSegmentator annotation quality lacks empirical validation through gold-standard verification. A scientifically sound approach would involve collaborative re-annotation of TS subsets by domain experts to establish partial ground truth, enabling quantitative evaluation of annotation discrepancy characterization accuracy. Current assertions about TS dataset limitations remain conjectural due to insufficient statistical evidence.

    6. Structural and Expository Deficiencies. The manuscript suffers from methodological opacity, particularly regarding hypernetwork optimization dynamics and latent space regularization mechanisms. The experimental section fails to substantiate core innovation claims through controlled ablation studies or interpretable visualizations of annotation style disentanglement. This expository ambiguity, combined with disjointed narrative logic, severely impedes technical reproducibility and scholarly impact assessment.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Absence of Methodological Novelty. The proposed framework demonstrates limited innovation given the proliferation of existing annotation error correction techniques (e.g., meta-learning architectures, uncertainty-aware learning paradigms, and teacher-student distillation) and prior hypernetwork applications in MIA tasks. While hypernetworks for medical image analysis have been preliminarily explored, this work introduces no novel mechanisms—whether architectural modifications, regularization strategies, or error-type-specific optimization protocols—to advance beyond established methodologies.

    2. Critical Experimental Deficiencies. The experimental design fails to provide quantitative comparisons with baseline methods in annotation bias mitigation or segmentation performance metrics (e.g., Dice coefficients, Hausdorff distances). Notably, even basic segmentation results on synthetic datasets—essential for validating the framework’s ability to counteract simulated annotation errors—are conspicuously absent. This omission precludes objective assessment of whether the method achieves its claimed functionality of annotation-style-adaptive segmentation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The authors’ rebuttal emphasizes their primary contribution as providing semi-automated analysis of the TS dataset. While I do not dispute the utility of this effort (which may align more appropriately with Computer-Assisted Intervention [CAI] than Medical Image Computing [MIC] paradigms), its scholarly impact predominantly constitutes a dataset-level enhancement rather than methodological innovation. Under CAI evaluation criteria, the initial Reject (2) rating appears disproportionate; I propose revising the decision to Borderline Reject (3) due to the following technical considerations:

    1. Although the authors assert their main contribution lies in TS dataset analysis, the manuscript fails to report performance improvements from the refined annotations. A methodologically rigorous demonstration would require: Retraining baseline models (e.g., U-Net) using the enhanced TS annotations Quantitative comparisons of output consistency (e.g., inter-annotation Dice coefficients) or confidence calibration (e.g., uncertainty maps) between original and refined datasets

    2. I formally retract the prior critique regarding methodological novelty, acknowledging my erroneous citation of unpublished works in hypernetwork applications. While accepting the authors’ counterarguments on innovation claims, this concession does not substantiate the work’s standalone scholarly merit. Regrettably, conference management systems preclude score modification post-rebuttal, necessitating this revised assessment.



Review #2

  • Please describe the contribution of the paper

    This paper introduces Hyper-SORT, a framework using the recent concept of “hyper-network” to model the “image annotation process” (or rather the image annotation quality) with a UNet backbone segmentation architecture conditioned by a latent vector that represents both the image and annotation variability (the annotation conditioning hidden variable). The key contribution is the unsupervised learning of this latent variable.

    During training, HyperSORT jointly learns the parameters of the hyper-network and the empirical distribution of the annotation conditioning hidden variable.

    They validate the proposed method on two 3D abdominal CT public datasets: (1) synthetically-perturbed annotations of the AMOS dataset, (2) TotalSegmentator using 2 versions, V2 being a refined version from V1 (hence having some “poor” V1 annotations updated in V2).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Writing: Great efforts in Fig 1 with diagrams to make the complex concepts clear. Great effort in linking mathematical notations and general concepts (eg. Modelization of the Labeling Process)

    Method: Very interesting use of the pioneering notion of “hyper-networks” Creative thinking

    Results: Very diverse and clearly commented

    Great efforts to comment on potential use-cases of the proposed method. For example: “UNet parameters from preferred clusters can be used to correct erroneous annotations from the training set, making of HyperSORT a particularly convenient tool for bootstrapping scenarios.”

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Writing: The abstract could be much clearer. Conclusion provides different views of the aims compared to the abstract: “to finely stratify the training set and help identifying both erroneous cases and systematic biases while producing performing robustly trained networks.”

    Some key phrases remain not specific enough. Example: “The new paradigm introduced here instead leverages hyper-networks by learning and discovering relevant implicit conditioning within the training set.” => for which training task, to do what?

    Methods: “We used 2-dimensional latent vectors to facilitate the visualization and analysis of our results.”: this seems quite strange. Ease of visualisation should not drive the size of a latent vector.

    Training details: “Additional details can be found in our public repository”: it is not usual to read such statement when using such recent concept, knowing that details can be critical to understand sensitivity to important hyper-parameters or design choices.

    Results:

    Additional comments: Modelization -> Modeling Eccentric -> off-center

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clarify aims and targeted use-case(s) + some design choices + training details

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Post rebuttal comments:

    • Use of 2-D latent space: Q: “We used 2-dimensional latent vectors to facilitate the visualization and analysis of our results”+ => A: “Although this should indeed be further investigated, we actually find interesting that 2 dimensions are already sufficient to capture various annotation styles.”. I would support this as an interesting finding to focus on if accepted.
    • Q: Training details: “Additional details can be found in our public repository”: it is not usual to read such statement when using such recent concept, knowing that details can be critical to understand sensitivity to important hyper-parameters or design choices. => A: Greatly appreciated that the authors shared their code via the rebuttal.
    • Use of “TotalSegmentator (TS),”: I believe we are in a transition time wrt to such segmentation LLM solution. So interesting for MICCAI to challenge bias/quality of such tool in the hands of clinicains.



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel method ‘HyperSORT’ for training image segmentation networks on datasets with noisy/less reliable labels. Unlike traditional methods that learn a mapping between the input scan and the label, HyperSORT models the annotation process, preventing the model from overfitting to the training set’s noise. This is achieved by training a hypernetwork to estimate the segmentation network parameters conditioned on a latent variable that encodes the ‘annotation style’ of each training sample. The authors demonstrate both qualitatively and quantitatively, on a synthetic (AMOS) and a real-world (TotalSeg) datasets, the method’s ability to (1) model the different annotation styles present in the training data and (2) detect systematic errors in the annotation process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strengths of the paper are outlined below:

    • The authors introduce a novel formulation of the segmentation task by modeling the annotation process as a function, rather than overfitting to the potentially noisy training labels.
    • In contrast to prior work that employs hypernetworks conditioned on explicit variables (e.g., voxel spacing), this study proposes using a learnable latent variable to condition the hypernetwork. This latent variable implicitly captures variability in both the images and annotations within the training set.
    • The method is validated in both a controlled synthetic environment and a real-world scenario on the task of liver segmentation from CT scans, demonstrating its effectiveness in modeling annotation styles and detecting systematic labeling errors.
    • The authors evaluate the method using an external test set (CT-1K) with no overlap with the training data (TotalSeg), highlighting the approach’s strong generalizability.
    • The paper includes comprehensive qualitative and quantitative evaluations, and compares the proposed approach to a recent quality control method (Quality Sentinel), further supporting its claims.
    • Limitations of the method are clearly acknowledged, including its inability to distinguish between hard cases and erroneous annotations.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper could be strengthened by addressing the following points:

    • An ablation study would be valuable to assess the impact of key design choices, such as the size of the hypernetwork and the dimensionality of the latent variable.
    • Although the primary focus is on detecting annotation errors and style variations, including a strong segmentation baseline (e.g., nnU-Net) would provide a useful reference point for evaluating segmentation performance.
    • Liver segmentation, while a relevant example, is relatively straightforward compared to more challenging tasks like soft-tissue organ segmentation (e.g., intestines), where the variability in the organ shapes across patients is higher. Discussing the expected performance of the method in such complex scenarios would enhance the paper’s scope and applicability.
    • Further clarification is needed on how clusters are identified in the latent space—specifically, whether a clustering algorithm was used, and if so, which one.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a novel method for a relevant medical imaging task which is the training of robust segmentation networks in presence of sub-optimal labels. The authors rigorously evaluated the proposed method on two datasets including a controlled and a real-world setting and compared it to a relevant quality control method which showed its potential in addressing the task in question. While there are areas for further explorations (such as ablation studies and broader applicability to more complex segmentation tasks) the method shows strong potential and addresses an important challenge in the field. Therefore, I recommend this paper for acceptance.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed most of my concerns regarding adding a reference performance with a strong baseline and discussion about the performance for more challenging tasks. Therefore, I recommend to accept the paper.




Author Feedback

We thank our 3 reviewers for their time and valuable feedback. We would like to reaffirm our belief that the core scientific contributions of our work are of interest to the MICCAI community. For brevity, we reference reviewer comments as RX.Weakness/Strength.N (e.g., R2.W.3).

CONTRIBUTIONS:

Our main contribution is the joint training of the hyper-networks (HN) and latent vectors to identify unknown annotation biases. This is, as far as we know, novel (R1.S “creative thinking”, R3.S.1-2) and significatively different from existing uses of HNs and dataset QC methods (see Section 2). Thus, in the absence of concrete references to similar published work in R2.W.1, we respectfully contest R2’s point on the novelty of HyperSORT (HS).

On the “lack of solution to correct biases” mentioned in R2.W.1, we refer to the end of Section 3 or Section 4.1, where we suggest the use of \lambda = 0 (or any fixed value) as a corrective method (see Fig 2 and 3).

EXPERIMENTS:

R2.W.3: The synthetic experiment (“proof-of-concept”) is a sanity check showing that HS works in a controlled environment, with known main biases. Our main experimental focus is on TotalSegmentator (TS), a large dataset containing “real” unknown biases and errors. As TS is one of the most used public datasets, we believe that our contribution will positively impact the MICCAI community. Hence, we respectfully refute the lack of realistic evaluation in R2.W.3.

We acknowledge the relevance of approaches mentioned in R2.W.4 and already introduced several of them in section 2 but added a reference to W. Dong Front.Comput.Sci 2025. Yet, HS’s objective is orthogonal to these works which aim at producing a unique model robust to annotation noise. Instead, HS characterizes annotation biases at the data sample level and is able to generate multiple models. This point is essential to understand the soundness of our evaluation protocol (described by R1 and R3.S.3-5 as a strength, but questioned by R2). We will clarify this throughout the manuscript and in particular in the abstract (R1.W).

Our evaluation on CT1k, a large independent dataset validates the generalization/performance of UNet parameters from HS (R3.S.4). Our early experiments with nnUnet did not show any performance gap (R3) and using a robust training loss [8] also did not yield noticeable performance boost. We showed however that HS can identify several groups and predicts the different corresponding UNet parameters.

We were also surprised by R2’s comment on the lack of “interpretable visualizations of annotation style disentanglement“ (R2.W.6) since Figure 2 illustrates that the different clusters correspond to the expected synthetically injected biases. Similarly, Figure 3 illustrates a systematic leakage bias visible in multiple cases from the red cluster.

The “scientifically sound approach [..] involve collaborative re-annotation of TS subset” requested by R2 already corresponds to the TS-V2 that is used. These corrections were used as (pseudo-)ground truth to validate the error detection capabilities of our method. We also note that our validation is on par with existing work (e.g. J. Fournel et al. MedIA 2021, [6]).

REPRODUCIBILITY:

R1: We added additional details on HS’s architecture and anonymously release our code: https://anonymous.4open.science/r/HyperSORT-E2E2 allowing reviewers to assess the clarity and completeness of our code.

MISC:

R1: We chose a fixed latent space dimension of 2, for the sake of simplicity/visualization. Although this should indeed be further investigated, we actually find interesting that 2 dimensions are already sufficient to capture various annotation styles.

A point of discussion on the applicability of HS to more complex structures is added to section 4.3 (R3).

We are grateful for the reviewers’ feedback which has strengthened our manuscript. We believe these revisions address all concerns and reaffirm the novelty and validity of our contributions.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces HyperSORT, a novel framework based on hypernetworks and latent space clustering for identifying and characterising annotation biases in large-scale medical segmentation datasets. Reviewers 1 and 3 support acceptance, highlighting the originality of the approach, the clear visualisation of bias patterns, and the potential impact of this work on quality control and refinement of datasets like TotalSegmentator. Reviewer 2 raised concerns regarding the paper’s methodological alignment and its limited scope in demonstrating downstream improvements. However, the rebuttal addressed these points, including the availability of code, validation on an independent dataset, and clarification of the novelty and experimental design.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewers had mixed opinions, although all of them found merit in the presented article. I also believe that the main idea has potential and could be of interest to the MICCAI community. However, as pointed out by Reviewers 2 and 3, the quantitative evaluation is not entirely convincing and I thus believe that the article can not be published in its current form. In particular:

    • The L1 loss is introduced to regularize the estimate of \lambda (i.e., to avoid infinite solutions), but it remains unclear whether, in cases of imbalanced ‘annotation styles’ or small segmentation errors, this strategy is sufficient to correctly capture all styles or errors.

    • There is no proper analysis of the dimensionality of \lambda or of the architecture of the hyper-network, which seems critical to the proposed approach.

    • There is no comparison with a simple baseline, such as a standard U-Net without a hyper-network. Would the latent space of the original U-Net also capture information about annotation styles and errors, as the \lambda space is intended to? This is fundamental to validating the article’s claims.

    • Finally, an important missing experiment is to show whether setting \lambda=0 allows one to correct or modify a segmentation to reflect the most common annotation style, as suggested by the authors.



back to top