Abstract

The accurate evaluation of left atrial fibrosis via high-quality 3D Late Gadolinium Enhancement (LGE) MRI is crucial for atrial fibrillation management but is hindered by factors like patient movement and imaging variability. The pursuit of automated LGE MRI quality assessment is critical for enhancing diagnostic accuracy, standardizing evaluations, and improving patient outcomes. The deep learning models aimed at automating this process face significant challenges due to the scarcity of expert annotations, high computational costs, and the need to capture subtle diagnostic details in highly variable images. This study introduces HAMIL-QA, a multiple instance learning (MIL) framework, designed to overcome these obstacles. HAMIL-QA employs a hierarchical bag and sub-bag structure that allows for targeted analysis within sub-bags and aggregates insights at the volume level. This hierarchical MIL approach reduces reliance on extensive annotations, lessens computational load, and ensures clinically relevant quality predictions by focusing on diagnostically critical image features. Our experiments show that HAMIL-QA surpasses existing MIL methods and traditional supervised approaches in accuracy, AUROC, and F1-Score on an LGE MRI scan dataset, demonstrating its potential as a scalable solution for LGE MRI quality assessment automation. The code is available at: https://github.com/arf111/HAMIL-QA

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3535_paper.pdf

SharedIt Link: https://rdcu.be/dVZes

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_26

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3535_supp.pdf

Link to the Code Repository

https://github.com/arf111/HAMIL-QA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sul_HAMILQA_MICCAI2024,
        author = { Sultan, K. M. Arefeen and Hisham, Md Hasibul Husain and Orkild, Benjamin and Morris, Alan and Kholmovski, Eugene and Bieging, Erik and Kwan, Eugene and Ranjan, Ravi and DiBella, Ed and Elhabian, Shireen Y.},
        title = { { HAMIL-QA: Hierarchical Approach to Multiple Instance Learning for Atrial LGE MRI Quality Assessment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {275 -- 284}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study aimed to conduct an automated LGE MRI quality assessment by developing HAMIL-QA, a multiple instance learning (MIL) framework. This is critical for enhancing diagnostic accuracy and clinical outcomes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The authors present a novel multiple-instance learning method to evaluate the LGE-MRI quality. • They evaluated results on fully supervised, classic AB-MIL and DTFD-MIL.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Are the dataset of 424 LGE-MRI scans collected by the authors? Is it publicly available? When you say 424 scans, are those from different patients? (424 patients) or does the same patient have different scans? • Authors mentioned that class imbalance problem because most scans are in the 2 to 4 range. Can you quantify the data distribution before and after ? • What is the reason behind using only ResNet10? Can you show results for other networks, such as ResNet18, ResNet50 and VGG19? • Can you think of a reason behind the large performance drop in the F1 score for DTFD-MIL in Table 1?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The structure and clarity of this paper need considerable improvement, resembling a draft rather than a complete research work. A thorough revision is recommended to meet academic standards.

    2. The paper lacks of innovation, merging various components without clear purpose or novelty. Identifying and emphasizing any novel aspects is crucial for its contribution to the field.

    3. The effectiveness of this study is questionable due to the lack of an ablation study and reliance on a private dataset. Incorporating publicly available datasets and conducting an ablation study are necessary for a comprehensive evaluation and to ensure the validity of the approach

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See above

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors introduce a multiple-instance learning framework that enhances the quality assessment of LGE MRIs. This improvement is critical as it serves as the foundation for various diagnoses, including the detection and quantification of myocardial scar and the assessment of cardiomyopathies, among other applications.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written, making it easy to read, and addresses a problem with broad applications for managing cardiovascular disease (CVD).

    The method section provides a detailed process for approaching multiple-instance learning, which enhances reproducibility—a valuable aspect of the research.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Generally, the paper appears to focus on assessing the overall quality of the left atrium, which can indicate various conditions. However, in the abstract, they mention specific conditions such as left atrial fibrosis and atrial fibrillation. This divergence somewhat distorts the paper’s focus.

    The paper claims the use of image-patches would reduce the computation complexity, but doesn’t give much detail on how so?

    The paper’s main focus is on a hierarchical approach, but it does not clearly explain what hierarchy means in this context. Additionally, the conclusion diverges slightly by referring to the method as a dual module, making it difficult to grasp the approach’s hierarchical nature.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It would be beneficial if the authors could contextualize the scope of their work. Are they explicitly aiming to improve the quality assessment of LGE MRI for conditions like Atrial Fibrillation or fibrosis in general? Clarifying the scope would aid in evaluating the potential effectiveness of the proposed approach.

    I would advise against using the term “cognitive process” as it may imply that all radiologists interpret images in a similar manner. Radiologists often have unique approaches based on their individual experiences and expertise levels. Instead, I recommend referencing these studies(https://pubmed.ncbi.nlm.nih.gov/27322975/, https://www.academicradiology.org/article/S1076-6332(18)30388-X/abstract) that explore the diversity in radiologist expertise to gain a more nuanced understanding of the subject.

    I can’t recall where in the paper the YGT in Fig 1 is referred to, nor the arrows coming from it. It would be better to include those to make it clearer.

    Adding details on whether the authors re-implemented the three models considered for comparison would enhance this work’s evaluation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The well-organized structure of the paper and its focus on addressing a crucial problem with wide application in cardiovascular diseases are commendable aspects of the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the raised comments; however, it is important that they include clarification on computational efficiency in the camera-ready version, as this is one of the main arguments for their paper’s merit.



Review #3

  • Please describe the contribution of the paper

    The paper addresses the issue of quality assurance of late gadolinium enhancement (LGE) in cardiac MRI. The authors propose a hierarchical multiple instance learning (MIL) neural network to improve the identification of diagnostic or non-diagnostic images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a Hierarchical MIL Framework (HAMIL-QA). The model uses a two-tier hierarchical bag and sub-bag structure for targeted analysis within sub-bags and aggregates insights at the volume level.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Further quantitative assessment of the model is needed to directly address the claimed benefits of the proposed model. The reader is not able to reach on their own the conclusion that the proposed model is efficient, generalizable or addresses the issue of limited data.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • While the paper focuses on MIL, somewhat similar architectures that have achieved strong results in various imaging tasks should be evaluated, such as vision transformers and/or masked autoencoders to create a stronger case for MIL.
    • As one of the arguments of using MIL is computational efficiency, the hardware configuration, number of trainable parameters, and training/inference time should be reported for the evaluated models.
    • Additional experiments are needed to address the benefits the model bring. For example, reporting the results after applying suitable out-of-distribution augmentations and retraining the models with each time decreasing the training set.
    • In terms of evaluating generalizability, it might be
    • A recent paper on quality assurance of LGE MRI has been published (Quality assurance of late gadolinium enhancement cardiac MRI images: a deep learning classifier for confidence in the presence or absence of abnormality with potential to prompt real-time image optimisation, Zaman S.). It will be ideal to briefly compare between the two approaches.
    • It will be useful for the reader to see examples of both diagnostic and non-diagnostic images.
    • In table 1, it is not clear why 6 sub-bags and 60 instances is selected. There is a selective bias as ABMIL and DTFD-MIL perform better under different parameters.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have overall presented promising results on quality assurance of cardiac LGE. Some additional experimental results are need so the reader can reach to the same conclusions the authors have.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We appreciate the reviewers’ feedback and the opportunity to respond. NOVELTY, PURPOSE(#R1) & AIM(#R3):The primary aim of the paper is to develop an automated method for assessing the diagnostic quality of LGE MRI scans of the left atrium (LA), focusing on fibrosis of the LA wall. Our novelty lies in using a hierarchical framework with two levels: the Sub-Bag Level and the Bag Level, inspired by a common expert assessment process. Experts often examine individual MRI slices to identify potential fibrosis (though methods may vary), similar to how our sub-bag module analyzes each slice. They then integrate these observations to assess the entire scan, which our bag module mimics by aggregating slice-level evaluations to provide an overall diagnostic quality score. We apologize for any confusion regarding the hierarchical and dual-module terminologies. We will clarify these and revise the terminology in the camera-ready version. MODEL EFFICIENCY(#R3, #R4): We did an ablation experiment where we found our model is 700x and 89x more efficient in computation than fully supervised and two other models, respectively, since our model processes 2D image patches instead of full volume or 3D patches. MIL models are conceptually less complex as they aggregate information at the bag level rather than processing each instance individually. COMPARISON(#R4): As for the comparison with MIL, training ViTs with limited data is challenging due to their lack of inductive biases. ViTs often require pre-training on large-scale datasets like ImageNet, but the significant domain gap makes them suboptimal for fine-tuning on medical datasets with limited sample sizes (Ref: https://arxiv.org/pdf/2202.0670). Hence, we opted for the MIL approach. Also, the paper suggested by #R4 to compare focuses on the left ventricle (LV) by classifying LGE certainty to prompt real-time image optimization. In contrast, our paper focuses on the left atrium (LA), assessing the diagnostic quality of LGE MRI scans to evaluate fibrosis. For future work, we plan to incorporate self-supervised learning strategies like Masked Autoencoder. SELECTIVE BIAS(#R4): We treated the number of sub-bags and instances as hyperparameters. To ensure fair comparison, we standardized parameters across all models. While these models might perform better with different parameters, using the same settings ensures that performance differences are due to model architectures rather than hyperparameter tuning. Ablation study, detailed in the supplementary, showed that 6 sub-bags and 60 instances provided the best performance. Some other comments we’d like to address: #R1: The reason for using ResNet10 as a backbone network is its compactness and lower complexity, which reduces the risk of overfitting. Our main contribution focuses on the MIL part, making the choice of backbone network independent. #R1: The dataset is private; all 424 scans are from unique patients. No publicly available dataset exists for LGE MRI quality assessment of fibrosis in the LA wall. We plan to release the data with an MOU for access. We will add the data distribution regarding class imbalance in the camera-ready version. #R1: The F1 score for DTFD-MIL is sensitive to hyperparameter settings, and using 0.5 as the default threshold might not be optimal. Hence, to provide a more comprehensive performance evaluation, we also reported the AUROC score, which shows that our model performed better overall. #R4: We will include examples of both diagnostic and non-diagnostic images in the camera-ready version. #R4: We plan to conduct experiments using different sizes of training data for future work to address limited data scenarios. #R3: We apologize for not clarifying Y_GT and for implying uniformity with “cognitive process.” We will address these issues in the camera-ready version. #R3: For ABMIL, DTFD-MIL we took the official implementation and incorporated it with our dataset. We will publish our code once it is accepted.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers are happy to accept this paper after the rebuttal, as their concerns have been solved very well.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers are happy to accept this paper after the rebuttal, as their concerns have been solved very well.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    #R1 was very critical and did not vote again. However, when looking at the Author’s rebuttal I feel that certain points were addressed; comparison to other works can be found in the paper (Table 1) and ablation study in the supplemental material.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    #R1 was very critical and did not vote again. However, when looking at the Author’s rebuttal I feel that certain points were addressed; comparison to other works can be found in the paper (Table 1) and ablation study in the supplemental material.



back to top