Abstract

Contrast-enhanced magnetic resonance imaging (MRI) is pivotal in the pipeline of brain tumor segmentation and analysis. Gadolinium-based contrast agents, as the most commonly used contrast agents, are expensive and may have potential side effects, and it is desired to obtain contrast-enhanced brain tumor MRI scans without the actual use of contrast agents. Deep learning methods have been applied to synthesize virtual contrast-enhanced MRI scans from non-contrast images. However, as this synthesis problem is inherently ill-posed, these methods fall short in producing high-quality results. In this work, we propose Conditional Autoregressive Vision Model (CAVM) for improving the synthesis of contrast-enhanced brain tumor MRI. As the enhancement of image intensity grows with a higher dose of contrast agents, we assume that it is less challenging to synthesize a virtual image with a lower dose, where the difference between the contrast-enhanced and non-contrast images is smaller. Thus, CAVM gradually increases the contrast agent dosage and produces higher-dose images based on previous lower-dose ones until the final desired dose is achieved. Inspired by the resemblance between the gradual dose increase and the Chain-of-Thought approach in natural language processing, CAVM uses an autoregressive strategy with a decomposition tokenizer and a decoder. Specifically, the tokenizer is applied to obtain a more compact image representation for computational efficiency, and it decomposes the image into dose-variant and dose-invariant tokens. Then, a masked self-attention mechanism is developed for autoregression that gradually increases the dose of the virtual image based on the dose-variant tokens. Finally, the updated dose-variant tokens corresponding to the desired dose are decoded together with dose-invariant tokens to produce the final contrast-enhanced MRI. CAVM was validated on the BraSyn-2023 dataset with brain tumor MRI, where it outperforms state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1909_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/Luc4Gui/CAVM

Link to the Dataset(s)

https://www.synapse.org/Synapse:syn51156910/wiki/622356

BibTex

@InProceedings{Gui_CAVM_MICCAI2024,
        author = { Gui, Lujun and Ye, Chuyang and Yan, Tianyi},
        title = { { CAVM: Conditional Autoregressive Vision Model for Contrast-Enhanced Brain Tumor MRI Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors of this study presents a Conditional Autoregressive Vision Model (CAVM) for synthesizing contrast-enhanced brain tumor MRIs from non-contrast images. The core contribution of CAVM is its autoregressive strategy, which incrementally increases the dosage of a virtual contrast agent to synthesize images of increasing contrast doses, culminating in the desired standard-dose image.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The visual representation provided in Figure 1 is truly impressive, effectively conveying essential information and capturing the essence of the proposed methodology.
    2. The approach has been rigorously evaluated against state-of-the-art models, showcasing substantial improvements, which is supported by quantitative data and visual examples from the BraSyn-2023 dataset .
    3. The CAVM introduces a novel decomposition tokenizer and a dose-variant autoregression strategy, leading to high-quality synthesis of contrast-enhanced brain tumor MRI scans without using actual contrast agents.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors base their model training on the generation of corresponding lower dose (33% dose) and higher dose (66% dose) images using a method described in reference [18]. However, this method has not been validated on the dataset used in this study. Therefore, it is uncertain whether the generated images contain noise and errors. This uncertainty significantly impacts the feasibility and accuracy of the proposed method in the paper.
    2. The authors have not provided a detailed description or mathematical formulation of the loss function, nor have they explained how this loss function is extended to 3D data and how it constrains the training of the network. This omission of critical technical details hinders the reader’s ability to understand and evaluate the implementation and effectiveness of the proposed method.
    3. The ablation studies conducted do not take into account the effects of high-dose images. It is recommended that the authors incorporate high-dose images in the ablation experiments to more comprehensively assess the model’s performance.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In Table 2, the authors only present the SSIM and PSNR results for brain tumor images. Compared to other tables, this table lacks results for tumor segmentation Dice scores and outcomes for healthy tissue. To fully assess the model’s performance, it is recommended that these crucial results be included.
    2. In addition to the input images, the authors also use corresponding masks but do not validate the specific impact of these masks on the synthetic outcomes. It is recommended that the authors include validations of the mask’s impact on synthesis
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written and easy to follow and experiment is sufficient, and validation results look promising. But the experimental part is somewhat lacking.Morever, there is a fundamental flaw in the paper regarding the generation of low-dose and high-dose images. The authors have not adequately demonstrated the efficacy and reliability of the methods used to generate these differing dose images, which could compromise the accuracy of the overall results and the credibility of the experiments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The authors have not addressed the issues I raised. The incompleteness of the experiments presented makes it impossible to verify the effectiveness of the proposed method. The rebuttal still fails to provide a satisfactory explanation.



Review #2

  • Please describe the contribution of the paper

    This article addresses the issue of generating contrast-enhanced MRI images from non-contrast MRI images, making the assumption that non-contrast imaging and contrast-enhanced imaging are similar under low doses of contrast agents. It proposes a conditional autoregressive vision model (CAVM), which generates contrast-enhanced images with different doses of contrast agents gradually in an autoregressive manner, ultimately generating virtual images with standard doses of contrast agents. On the brasyn2023 dataset, CAVM demonstrates superior performance compared to several state-of-the-art medical imaging models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper adopts a reasonable approach by hypothesizing a minimal difference between contrast-enhanced and non-contrast images under low contrast agent dosages. Leveraging a self-autoregressive model, it progressively generates contrast-enhanced images under higher contrast agent dosages, akin to the “Chain-of-Thought” method in natural language processing. Furthermore, to better utilize all previous token states, CAVM introduces a staircase-shaped self-attention mask, which not only enhances synthesis quality but also accelerates inference speed. Experimental results on the publicly available BraSyn-2023 dataset demonstrate that CAVM surpasses several state-of-the-art medical image synthesis methods in synthesis quality, particularly in the synthesis of tumor regions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I believe that there are still some shortcomings in this article, as follows: (1) The model employs a structure known as the Decomposition Tokenizer to effectively decompose non-contrast-enhanced MRI images into components affected by contrast agent dosage and those unaffected. Within this framework, the fDV unit utilizes the two stages of the Swin UNETR encoder, while the fDI unit incorporates the four stages. Based on this setup, it is postulated that the fDV output represents the contrast agent dosage-affected portion, while the fDI output corresponds to the non-dosage-affected part. However, this definition remains unproven. (2) Given the extensive utilization of autoregressive and attention mechanisms in the model, it is crucial to assess its inference speed compared to other state-of-the-art (SOTA) models. Additionally, while the model proposes modifications to the self-attention masks through CAVM, there seems to be a lack of experimental validation regarding its efficiency and the enhancement in model performance. (3) The paper lacks a thorough investigation and discussion on the limitations of the model. This could encompass crucial aspects such as the model’s inference speed and its generalization capabilities. A comprehensive analysis of these factors would provide a more rounded understanding of the model’s performance and applicability in real-world scenarios.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Here are some constructive suggestions to address the shortcomings mentioned: (10)Regarding the correlation between the fDV unit and the contrast agent dosage-affected portion, as well as the correlation of the fDI unit with the non-dosage-affected part, a supplementary experiment could be conducted to validate these assumptions. This experiment should provide quantitative metrics and visualizations of the outputs to support the decomposition achieved by the model. (2) A comprehensive evaluation of the model’s inference speed, compared to other state-of-the-art (SOTA) models, is crucial. This evaluation should involve benchmarking the model on standard datasets and reporting the average inference time per image or per batch. Additionally, experiments should be conducted to assess the efficiency of the proposed modifications to the self-attention masks through CAVM. By measuring the impact of these modifications on both inference speed and model performance, we can quantitatively compare the model with its baseline version. (3) A detailed discussion on the limitations of the model, particularly focusing on inference speed and generalization capabilities, is essential. It is important to identify potential bottlenecks in the model’s architecture or computational requirements that may hinder its real-world applicability. To address these limitations, suggestions for future improvements should be provided, such as exploring lighter-weight model architectures, optimizing the computational efficiency of the attention mechanisms, or developing techniques to enhance the model’s generalization ability across different datasets and scenarios.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposes a method for generating contrast-enhanced MRI images using a conditional autoregressive vision model (CAVM) to address the problem of generating contrast-enhanced MRI images from non-contrast MRI images. Experimental results demonstrate that CAVM surpasses several state-of-the-art medical imaging models in synthesis quality, particularly excelling in synthesizing tumor regions.

    However, this work still has some potential limitations and areas for improvement:

    Lack of evidence: Although the authors hypothesize that the fDV unit represents the part influenced by the contrast agent dosage, and the fDI unit represents the part unaffected by the dosage, this hypothesis lacks empirical validation. Supplementary experiments can verify this hypothesis and provide quantitative metrics and visual outputs to support the model’s decomposition. Inference speed: Although the model extensively utilizes autoregressive and attention mechanisms, its inference speed compared to other

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I think the author’s work content is novel, although there are shortcomings in experimental content. The question I raised in the first comment, due to space constraints, the author has decided to validate my proposed follow-up experiments in the next journal, which is reasonable.



Review #3

  • Please describe the contribution of the paper

    The manuscript presents conditional autoregressive vision model to improve the synthesis of contrast-enhanced brain tumor MRI.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed CAVM gradually increases the contrast agent dosage and produces higher-dose images based on prevous lower-dose ones until the final desired dose is achieved.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. More qualitative results should be present to demonstrate its efficiency.
    2. The failed ones are also suggested to be present, and then provide the detailed explanations on them.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. More qualitative results should be present to demonstrate its efficiency.
    2. The failed ones are also suggested to be present, and then provide the detailed explanations on them.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is difficult to synthesis contrast-enhanced images from non-contrast images, and the manuscript should provide more reuslts to demonstrate its efficiency.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    All my concerns have been addressed.




Author Feedback

R1

Q: The method [18] for generating lower- and higher-dose images is not validated on the dataset used in this work. A: First, this method has been validated on an in-house dataset in [18], and in our work as there is no lower- or higher-dose image in the public BraSyn-2023 dataset, we have previously visually inspected that the generated images are sensible (e.g., see Fig. 1). Based on these qualitative/quantitative evaluations on two different datasets, we believe that this method [18] is reliable. Second, as our goal is to synthesize standard-dose images, rigorous validation of the method [18] itself is beyond the scope of this paper. It is possible that generated lower- and higher-dose images could contain small errors and noises, but our ablation results show that the use of lower- and higher-dose images generated by [18] allows improved synthesis quality for our task, and thus these potential errors and noises do not have a large impact on our method.

Q: A detailed description of the loss functions is needed. A: We are sorry that we did not describe the loss functions in detail as they are commonly used ones in existing image synthesis works. Briefly, the L1 and adversarial losses are borrowed from ResViT, and the L2 loss is the typical mean squared error. We will provide a more detailed description and make our code publicly available.

Q: Ablation studies should take into account the effects of higher-dose images. A: Thank you for your reminder. We hypothesize that the use of higher-dose images can reduce the changes needed by the individual autoregressive steps and thus they are beneficial. As new results in the rebuttal will lead to automatic rejection, we will include these ablation results in the extended journal version.

Q: Evaluation metrics for the ablation studies are incomplete. A: We did not include the Dice scores and outcomes for healthy tissue in Table 2 due to the page limit and because they have similar trends to the SSIM and PSNR results in the tumor. We will add these results in the journal version.

Q: The impact of the tumor mask on the result is not validated. A: We have empirically observed that the mask is beneficial to the synthesis quality for the proposed and competing methods. Therefore, for each method the mask is included as input and this allows fair method comparison. Note that without the mask input, our method still outperforms the competing methods. As new results are not allowed and there is no space for these previous results not shown, they will be included in the journal version.

R2

Q: It is not proven whether the tokenizers fDI and fDV truly produce dose-invariant and dose-variant components, respectively. A: It is indeed interesting to experimentally confirm the assumptions above. We have previously observed that the tokens in the autoregression change after each step, suggesting that the tokens given by fDV are dose-variant. But as new results are not allowed in the rebuttal, this issue will be more comprehensively explored in the journal version. The insightful suggestion also motivates potential improvement of our method in future work, where the decomposition can be encouraged with modified model and loss designs.

Q: Inference speed needs to be evaluated, as well as the benefit of the modified self-attention design. A: We have empirically observed that the proposed method has similar inference speed to the other methods thanks to the modified self-attention design. Without the design, the inference speed is much slower. As our main goal is to improve the synthesis quality and new results are not allowed, detailed results on speed will be reported in the journal version.

Q: Limitations should be discussed. A: We will add discussions on limitations and future works that address the limitations.

R3

Q: More qualitative results and failure cases should be presented and discussed. A: As new results are not allowed, we will discuss failure cases as much as allowed.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces a conditional autoregressive vision model (CAVM) designed to generate contrast-enhanced MRI images from non-contrast MRI images in a gradual, autoregressive manner. It features a novel autoregressive strategy involving a decomposition tokenizer (which decomposes an image into dose-invariant and dose-variant tokens) and a decoder (which updates dose-variant tokens and uses domain-invariant tokens to produce the final synthesis). The proposed methodology is innovative, and the experimental results are promising compared to several medical synthesis models. However, the lack of comparison on downstream segmentation tasks is a drawback. Additionally, it is unclear how much the proposed method relies on the input mask, which is not required by other methods being compared. This is a borderline work. I slightly tend to accept it.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces a conditional autoregressive vision model (CAVM) designed to generate contrast-enhanced MRI images from non-contrast MRI images in a gradual, autoregressive manner. It features a novel autoregressive strategy involving a decomposition tokenizer (which decomposes an image into dose-invariant and dose-variant tokens) and a decoder (which updates dose-variant tokens and uses domain-invariant tokens to produce the final synthesis). The proposed methodology is innovative, and the experimental results are promising compared to several medical synthesis models. However, the lack of comparison on downstream segmentation tasks is a drawback. Additionally, it is unclear how much the proposed method relies on the input mask, which is not required by other methods being compared. This is a borderline work. I slightly tend to accept it.



back to top