Abstract

Quantitative performance metrics play a pivotal role in medical imaging by offering critical insights into method performance and facilitating objective method comparison. Recently, platforms providing recommendations for metrics selection as well as resources for evaluating methods through computational challenges and online benchmarking have emerged, with an inherent assumption that metrics implementations are consistent across studies and equivalent throughout the community. In this study, we question this assumption by reviewing five different open-source implementations for computing the Hausdorff distance (HD), a boundary-based metric commonly used for assessing the performance of semantic segmentation. Despite sharing a single generally accepted mathematical definition, our experiments reveal notable systematic differences in the HD and its 95th percentile variant across implementations when applied to clinical segmentations with varying voxel sizes, which fundamentally impacts and constrains the ability to objectively compare results across different studies. Our findings should encourage the medical imaging community towards standardizing the implementation of the HD computation, so as to foster objective, reproducible and consistent comparisons when reporting performance results.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2469_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

https://zenodo.org/records/7442914

BibTex

@InProceedings{Pod_HDilemma_MICCAI2024,
        author = { Podobnik, Gašper and Vrtovec, Tomaž},
        title = { { HDilemma: Are Open-Source Hausdorff Distance Implementations Equivalent? } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper evaluated and compared five open-source implementations of Hausdorff Distance (HD) metrics. Futhermore, the authors demonstrat statistically significant over- or under-estimation of the computed HD and suggest careful choices of the implementations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Metrics is crucial for benchmarking algorithms. This work identifies the complexity and variability in HD computation especially in 3D images, where surface elements should be taken into account.
    2. The authors examined the computation process across the implementations and compared the two main parts of the methods, boundary calculation and p-th percentile HD calculation.
    3. Experiments across the different HD implementations provided insightful results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The validation can be further enhanced: the experiments only contain one dataset and most of the targets have relatively compact shapes.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Most of the evaluated organs have a compact shape. It would be interesting to compare these HD implementations across on tubular structures (e.g., vessel). Here is an out-of-the-box dataset. https://figshare.com/articles/dataset/Aortic_Vessel_Tree_AVT_CTA_Datasets_and_Segmentations/14806362
    2. Does other distance-based metrics (e.g., ASD, NSD) have the similar issues? Please comment on this point.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    HD is a common metric in segmentation community. This work shows the variances across different implementations, which is important towards standardizing metircs.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors compare multiple open source implementations of the Hausdorff distance and the impact of the differences in their choice of surface definition and percentile definition. The use case chosen is one of radiotherapy planning

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very strong motivation for a reasonable question regarding a commonly employed evaluation metric
    • in depth description of the sources of difference with clear associated visualisations
    • Comparisons of results at HD and 95thpercentile HD
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The mesh-based surface calculation is used as default reference but there is no clear justification for this choice - it sounds simply natural that mesh-based implementation would be by default closer to this reference.
    • There is no discussion on multiple other aspects that may impact the calculation of surfaces such as the structure element chosen when considering erosion.
    • Absence of indication on alternative solutions to accommodate issues with non isotropic set-ups
    • Absence of indication of compute time that may be relevant for application purposes.
    • No assessment on the impact on evaluations for the specific application case chosen and whether observed differences / outliers correspond more to specific organs / shapes / volumes than others
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper poses an interesting question but may fall short in addressing the practical impact of the observed differences: from the run experiments, would there be differences in the choice of algorithms for instance or in the assessment of segmentation performance from a new trainee. How would that be (partially) alleviated by the choice of summary statistics? In which context / organs would it matter most?

    In addition to these main questions, a few other points come to mind:

    • the authors do not discuss the consideration of non definition of the metric and whether this is consistently addressed across implementations
    • the authors dismiss easily the implementation of the DSC but there may also be some subtleties there related to non-definition and use of probabilities.
    • With 60 scans and 30 OAR should not we get 1800 segmentation masks?
    • There should be more introspection on the presence of outliers according to the chosen masks - does it correspond to “noisy” segmentations, to segmentations of smaller organs or those with more complex shapes?
    • There is no indication of compute time or discussion around it while it is a potentially relevant factor for application at large scale.
    • Why considering MONAI and metricsReloaded separately if the implementations are similar? If there is a clear difference in implementation that does not yield any difference in results, this should be highlighted. Otherwise they should probably be grouped together.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Key factors leading to this rating are:

    • Interesting question relevant for the community
    • slightly biased analysis with lack of practical implication for the findings in the stated application
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper delves into the question of whether (and why) there are differences in the Hausdorff Distance functions from libraries widely used by the medical imaging and machine learning community. To this end, authors explored the implementations and conducted experiments to understand to what extent these implementation differences yield differences in the metric.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The research question addressed in this paper is of great importance. When researchers use HD we assume that it is the same across papers, but it is actually not.
    • The implementations in five commonly-used libraries were checked, and the differences were clearly explained in the paper.
    • Besides the implementation differences, authors also run experiments on real data to quantify the extent of the problem, i.e., the difference in Hausdorff distance from one implementation to another.
    • The paper is nicely written, it is easy to follow, and the illustrations are clear.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I couldn’t find any important weakness.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The analysis seems reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Introduction

    • Related work. I don’t fully agree on the statement “boundary-based metrics such as the HD offer desirable properties such as shape-awareness”. I don’t see how HD tells me anything about the shape of a segmentation (i.e., whether the segmentation is squared, compact, has holes, etc). In fact, in the next line it is written “it is also important to acknowledge their limitations, such as inclination towards overlooking holes”, which illustrates my point: HD doesn’t tell you much about the shape. Please, comment on this, remove this, or justify it.
    • To make the paper more enjoyable for the readers, I suggest a very minor update: change the citations so that they are sorted. For instance, from “[17, 16, 3, 28, 9]” to “[3, 9, 16, 17, 29]”. This makes it easier for readers to check the references.

    Method

    • Why “to simulate a practical scenario” it is needed to, first, have a voxel segmentation, then change to a mesh, then change back to a voxel, and then compute another mesh? I found this unclear.

    Experiments / Results

    • Are “Metrics Reloaded” and “MONAI” actually different? If so, what’s their differences? I’m asking this because the results (Table 1) are the same, and they come from the same github repository.

    Typos

    • Abstract: “a boundary-based metrics” -> “a boundary-based metric”.
    • Introduction: “A quantitative metrics” -> “A quantitative metric”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important question, and it answers it by doing an analysis (looking at the code) as well as measures its impact on real data. I couldn’t find any important flaw.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank the reviewers (R1, R3, R4) for raising relevant questions and providing invaluable feedback. Below, we address the comments from each reviewer:

[R1, R3] We appreciate the reviewers’ emphasis on the importance of separate analyses for different organ groups and the suggestion to use a publicly available dataset for evaluating tubular structures. Our dataset comprises 30 organs, ranging from very small (lacrimal gland, cochlea, and optic nerves) to medium-sized (parotid and submandibular glands, brainstem) to large ones (mandible, spinal cord). Some organs are tubular (e.g., carotid arteries, spinal cord, brainstem), some are complex (e.g., mandible), and others exhibit unique properties. Although we omitted detailed group-specific analyses for brevity, we can confirm that our implementation yields higher errors on tubular organs compared to small or large ones. We acknowledge the importance of stratifying our analysis by organ group (as suggested by R3: noisy, small, complex, and tubular) and will address this in a follow-up publication.

[R1] The reviewer raises a important question regarding the impact of reported observations on other distance-based metrics. We confirm that the identified differences in Hausdorff distance implementations affect other metrics such as MASD, ASSD, and NSD. We are preparing a separate publication that will thoroughly analyze these implementations and provide stratified analysis across different organ groups.

[R3] We use a mesh-based representation of the segmentation masks due to its well-defined surface, allowing us to compute surface element areas and precisely calculate distances from query points to the surface implicitly. We use triangle meshes for simplicity, but we acknowledge that different mesh types could be employed. Our reference implementation subdivides surface elements into smaller triangles to mitigate the potential effects of mesh type and enhance calculation accuracy.

[R3] We appreciate the suggestion to discuss the choice of structural elements. To maintain our focus on comparing implementations and referencing a baseline, we did not experiment with boundary extraction procedures for calculations in image space, but provide only description of what is used in the implementations.

[R3] Please note that our dataset included up to 30 organs per case, with some cases missing segmentations due to limited field of view or other artifacts, which is the reason why the total number of comparison is not 1800.

[R3] We acknowledge the importance of computation time and will address this in our follow-up publication with an extended analysis.

[R3, R4] The reviewers correctly note that the results for MONAI and MetricsReloaded are identical. Given that these implementations do not share the same code and differ in some other metrics implementations, we chose not to report them in the same column.

[R3] The reviewer highlights the importance of how different implementations handle cases where one segmentation is missing. We analyzed this but could not include it in the conference paper due to page limits. We will address this in our follow-up publication. Additionally, the reviewer’s point about the implementation of DICE is well taken. We confirm that all implementations yield identical values for binary segmentations (we did not test soft DICE).

[R4] We agree with R4 and will change the term “shape-awareness” to “boundary-awareness.”

[R4] To clarify, we first convert the voxel segmentation to a mesh solely for slight smoothing. We were also interested in the discretization effect produced by different voxel sizes, which we did not report in this paper for brevity.




Meta-Review

Meta-review not available, early accepted paper.



back to top