Abstract

Gamma camera imaging of the novel radiopharmaceutical [99mTc]maraciclatide can be used to detect inflammation in patients with rheumatoid arthritis. Due to the novelty of this clinical imaging application, data are especially scarce with only one dataset composed of 48 patients available for development of classification models. In this work we classify inflammation in individual joints in the hands of patients using only this small dataset. Our methodology combines diffusion models to augment the available training data for this classification task from an otherwise small and imbalanced dataset. We also explore the use of augmenting with a publicly available natural image dataset in combination with a diffusion model. We use a DenseNet model to classify the inflammation of individual joints in the hand. Our results show that compared to non-augmented baseline classification accuracy, sensitivity, and specificity metrics of 0.79 ± 0.05, 0.50 ± 0.04, and 0.85 ± 0.05, respectively our method improves model performance for these metrics to 0.91 ± 0.02, 0.79 ± 0.11, 0.93 ± 0.02, respectively. When we use an ensemble model and combine natural image augmentation with [99mTc]maraciclatide augmentation we see performance increase to 0.92 ± 0.02, 0.80 ± 0.09, 0.95 ± 0.02 for accuracy, sensitivity, and specificity, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3427_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Cob_Improved_MICCAI2024,
        author = { Cobb, Robert and Cook, Gary J. R. and Reader, Andrew J.},
        title = { { Improved Classification Learning from Highly Imbalanced Multi-Label Datasets of Inflamed Joints in [99mTc]Maraciclatide Imaging of Arthritic Patients by Natural Image and Diffusion Model Augmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposed several domain-specific data augmentation techniques to improve classification performance in detection of joint inflammation in nuclear imaging. Due to low availability of nuclear imaging data, application of semi-supervised methods is challenging. Instead, the paper proposes two generative data approaches via diffusion model - from segmentation masks, and (2-stage) from natural hand images. The results shows that both approaches seem to improve the classification performance, with augmentation of the original data being preferable. Additionally, extension of two approaches with Perlin noise-based modification of the masks is explored, yet it does not show any clear improvement.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Clear and well-structured paper.
    2. Niche and interesting domain of hand radioisotope imaging.
    3. Generally, a principled approach to method development - data preprocessing, validation, experiments.
    4. The developed techniques can, in principle, be repurposed for other nuclear imaging applications.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited methodological novelty. The study is more about the data augmentation techniques (which are, however, non-trivial).
    2. Lack of statistical analysis, which is particularly important given the small sample size (n=48). The test subset is reportedly 20%, i.e. 7 samples, which converts to 14% discretes in the single-joint performance score. Are the observed multi-joint accuracy and TPR improvements of 12% and 15%, respectively, significant?
    3. The justification(/hypothesis) of using Perlin noise for augmenting the masks is not provided. The referenced papers do not show any evidence, and rather report that, in a similar context, Perlin noise is similar to Gaussian one.
    4. It is not clear how exactly the final results (Tables 1 and 2, Figure 6) are computed. E.g. how the metrics, particulalry, TPR and TNR are averaged across the 15 regions?
    5. Despite several presented performance indicators, the results are not sufficiently discussed. What do the metrics tell in conjunction? What may be the reason behind 11k data not providing improvements comparable to M* data? Is missing wrist region, inflammation of which is dominant in the original, thus, test data, the reason behind the inferior performance?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • The method is, generally, clearly described.
    • Repeatability of the reference annotations is unclear. Only one reader assigned the labels. Considering the small sample size and lack of statistical analysis, it could have considerably affected the reliability of the final results.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Abstract: “improving the sensitivity by 58%”. Comparing sensitivity level makes sense only at fixed specificity levels, which is not the case. Please, rephrase.
    2. Page 3. The authors say “common accuracy metric as accuracy can be misleading for imbalanced datasets” and still proceed to report “accuracy” as one of the core performance indicators. The reported accuracy values should be removed or replaced with balanced accuracy.
    3. Page 3 “were trained 3 times each with different random initialisation and then the metrics were averaged”. Additionally, five-fold models were developed evaluated over 15 classes. Please, explain how exactly the numbers in Tables 1 and 2 were computed. Also, what are the numbers after +-?
    4. The procedure of Perlin noise augmentation (“For each individual joint/joint region…”) relies on several hyper-parameters, yet different values of those were not tested (according to the text). This limitation should be discussed.
    5. Were the classification (DenseNet) and diffusion (Palette) models optimized from random / ImageNet / other initialization? Please, state this explicitly, since it is important in the context of this study (i.e. small sample size).
    6. Figure 6: please, add grid lines and zoom on the meaningful metric ranges. Otherwise, the plots are difficult to read and they are lacking informativeness.
    7. Page 7: “The Perlin variation increases performance over the non Perlin version in all metrics except TNR which decreases slightly”. Considering the +- values (presumably, standard deviation), the improvement are most likely not significant. Please, avoid overstatements until the statistical testing is done. Same paragraph “the TPR increases 9% from the non Perlin m* augmented model”, same as in the comment 1, this comparison and quantitative estimation of the improvement is only meaningful at fixed specificity level. If this is the case, please, state explicitly, otherwise, remove.
    8. Table 2: considering the overlapping between the intervals (presumably, in case of +- being std.dev.), it is not clear what the bold values are supposed to indicate. Please, explain in the caption or undo the highlighting.
    9. Could the worse performance of 11k-based models be attributed to the absence of wrist data in the dataset? This should be discussed along with other potential reasons of the observed phenomenon. According to Figure 1, in the studied dataset, inflammation in wrist is 3-4 times more common than in other joints. A more detailed analysis of the classification models’ performance, such as separately looking into the metrics for wrist and other joint regions, would have been highly beneficial.
    10. Several metrics are presented yet they are never properly discussed. If the authors find them complementary, it needs to be opened in the discussion.
    11. “five-fold cross fold” -> “five-fold cross-validation”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Clarity and organization of the paper
    • Interesting imaging domain
    • Proposed synthetic data generation method, which is adequate and may be of interest to nuclear MIC community
    • Unclear derivation of the key results
    • Lack of statistical analysis given small sample size
    • Lack of clear conclusion
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I appreciate the feedback provided by the authors and their focused attempt to address my concerns. While the authors have promised to clarify practically all the minor points and provided additional information on the relevant past experiments, I believe two of my major concerns have not been sufficiently addressed: (1) The motivation behind using Perlin noise has not been clarified. At all. It is one the core parts of the paper, and lacking this justification undermines the relevance of the study design; (2) As a part of their response, the authors correctly pointed out my miscalculation of the test set sample size (7->10 patients). However, even with this corrected sample size, the need for conducing statistical testing to provide a conclusive statement remains. I find this issue to be particularly important given the results - improvement in accuracy (which is dominated by one class) and improvement in F1 (yet with a large increase in the confidence interval). The authors did not comment on this issue in their response.

    I have read through the feeback from other reviewers and noticed none of them provided critical comments on the analysis part of the paper. Nonetheless, I would like to highlight the aforementioned issues, as they are essential to the paper delivering reliable and clear scientific evidence. Other than that, the paper is of good quality, interesting, and principled in its engineering aspects.

    While I still do not see the paper to fully satisfy the acceptance criteria, I would like to raise my decision to “borderline”.



Review #2

  • Please describe the contribution of the paper

    Introduces an augmentation technique using a diffusion model to incorporate synthetic data points into small medical image datasets(n=96 patient data points).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Addresses the pertinent problem of limited data points in medical imaging. Detailed explanation of each augmentation technique. Application of diffusion models to small medical image datasets adds novelty. Demonstrates understanding of clinical feasibility.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Issues with figure clarity, particularly in Figures 4 and 5. Lack of motivation for the choice of DenseNet architecture. Unclear definition of “G-mean,” potentially leading to confusion among readers.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Consider revising figures for better clarity, adjusting y-axis range in Figure 5 and increasing font size in Figure 4. Provide more motivation for selecting DenseNet over other architectures. Clarify the definition of “G-mean” to avoid confusion. Maybe use geometric mean definition.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant problem in medical imaging and proposes a novel augmentation technique. Despite weaknesses in figure clarity and motivation for architecture choice, the paper contributes meaningfully to the field.Therefore, recommendation for acceptance is based on its significance and potential impact.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Seems to clarify points raised in the reviews.



Review #3

  • Please describe the contribution of the paper

    This paper uses data from a novel radiotracer in rheumatoid arthritis (RA) of the hand joints, and demonstrates segmentation and classification methods of the joint, augmenting the data with 500 images from the 11k dataset, achieving a 0.95 specificity. In order to achieve this, the model uses DenseNet and diffusion networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Rheumatoid arthritis of the hands is challenging due to the multi-joint aspects of the tissues, especially when novel imaging methods are used.

    The main strength of this paper is in showing how to perform segmentation of the multiple focal points that RA can present, especially when limited datasets are available. The authors use augmentation of the data to create 500,000 training samples from an original 98 dataset size of 2D RA technetium-PET images.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The primary difficulty is the ability to address the novelty of the radiotracer and thus patterns in the hand, due to the necessity for anonymity. This data uses only 48 patients, which also creates concerns on the generalizability of the models.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors used Perlin noise and data from the 11k dataset to help the results become more repeatable, and also repeated their measurements several times with multiple initialisations. The methods are reasonably well described.

    The main flaw with reproducibility is that the 48 patient dataset is not available outside of this work, and it is uncertain exactly the novelty of the radiotracer images when compared with PET used in RA generally.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This work used a clinician for annotations, so bonus points for that.

    The largest question about this work is its generalisation outside of the 48 patients, even if the data was augmented with transformations to be a much larger size.

    The use of 2 extra dorsal/palmar views from one patient might confuse the results, and I’m unsure it adds to the results.

    The main thing that I would have liked to have seen would be a comparison with any other RA method that is more readily available and not just limited to 48 patients in the world. The authors might be able to find RA images from other datasets.

    Additionally, describing whether code or images would be online after publication would strengthen this work and its reproducibility.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work had mostly clear descriptions (with the exception of a few lines), and performed a lot of work in ensuring that its measurements were reasonably accurate, such as repeating the results. The machine learning networks are interesting.

    The numbers are slightly low for 2D images, and without knowing many details of the novel radiotracer, it’s difficult to assess the novelty of some of the aspects .

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their feedback. For brevity, we are not addressing all of the minor issues in this rebuttal, but all are noted and will be addressed in the final paper. R#1 raises concerns about generalisability of our model with only 48 patients: we would like to emphasise that this is exactly why our work innovates in data augmentation. Nonetheless, more data will soon be obtained from another site to provide further assessment of our methods in future work. In agreement with the reviewer, the novel tracer cannot be revealed in order to comply with MICCAI anonymity requirements at this stage. R#4 requests a motivation of the usage of the DenseNet and clarification of the definition of the G-mean metric. We chose DenseNet because of its established performance on a wide range of benchmarks. For G-mean, R#4 asks that we perhaps use the geometric mean definition instead. It will be a minor update of our paper to use the terminology “G-mean score” instead, otherwise leaving it defined exactly as is in the present paper. R#5 asks that accuracy be either removed or replaced with balanced accuracy. In our view, no single metric can give a comprehensive assessment of the model’s performance for our dataset, hence we present a variety of metrics. Whilst a balanced accuracy is a simple calculation to make (we had in fact done this in an earlier version of the work), we believe the model can be more accurately assessed using the 5 metrics we have provided. As a minor update, we will briefly describe the complementarity of these metrics (such as how TNR and TPR show how the model performs with type I and II errors respectively). R#5 queried if the 11K augmentation could be worse than the m* augmentation due to the 11K augmentation not containing the wrist. We investigated this before submission, and yes, to some extent the lack of wrist augmentation does affect the results , but it is not the complete story, as this was a secondary finding of our work it was not included in the initial manuscript. We can add a single line stating the effect. R# 5 prefers the terminology “cross validation” instead of “cross folds”: we agree and will make the update to clarify that the models were trained using five-fold cross validation . The numbers in Tables 1 and 2 correspond to the mean and standard deviation of the performance metrics, found over 5 different training runs, please see our further clarifications below. We will clarify this in a minor update to the paper. R#5 requests clarification regarding parameter initialisation for our models and the use of any hyperparameter optimisation in the Perlin augmentation. All models were trained from Kaiming uniform initialisation. The hyperparameters were chosen as a trade-off between computational cost and variability of the segmentations. We will state these in a minor update of the paper. R#5 requests a statistical analysis given our small dataset and some of our results statements. R#5 asks us to not overstate our results, quoting our paper in their review “Perlin variation increases performance over the non Perlin version in all metrics except TNR which decreases slightly”. We can easily clarify in this line that these results are promising even if not yet strictly established to be significant. R#5 does appear to misapprehend the size of the test dataset and the results presented. To clarify, the models were trained in a five-fold cross validation, each test dataset being 20% of the data ~10 patients, ~20 images per test dataset, but all results are the mean performance over the five models evaluated on the separate five datasets, so the results are based on all 48 patients, 96 images, not just 7 samples as R#5 seems to indicate. R#5 requests that figure 6 be updated to use grid lines and zoom in on the relevant sections and that we rephrase an aspect of the paper regarding an increase in specificity levels. We agree with R#5 on these points and will make these minor changes.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top