Abstract

Transformer-based methods have demonstrated impressive results in medical image restoration, attributed to the multi-head self-attention (MSA) mechanism in the spatial dimension. However, the majority of existing Transformers conduct attention within fixed and coarsely partitioned regions (\text{e.g.} the entire image or fixed patches), resulting in interference from irrelevant regions and fragmentation of continuous image content. To overcome these challenges, we introduce a novel Region Attention Transformer (RAT) that utilizes a region-based multi-head self-attention mechanism (R-MSA). The R-MSA dynamically partitions the input image into non-overlapping semantic regions using the robust Segment Anything Model (SAM) and then performs self-attention within these regions. This region partitioning is more flexible and interpretable, ensuring that only pixels from similar semantic regions complement each other, thereby eliminating interference from irrelevant regions. Moreover, we introduce a focal region loss to guide our model to adaptively focus on recovering high-difficulty regions. Extensive experiments demonstrate the effectiveness of RAT in various medical image restoration tasks, including PET image synthesis, CT image denoising, and pathological image super-resolution. Code is available at \href{https://github.com/Yaziwel/Region-Attention-Transformer-for-Medical-Image-Restoration.git}{https://github.com/RAT}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0515_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/Yaziwel/Region-Attention-Transformer-for-Medical-Image-Restoration.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yan_Region_MICCAI2024,
        author = { Yang, Zhiwen and Chen, Haowei and Qian, Ziniu and Zhou, Yang and Zhang, Hui and Zhao, Dan and Wei, Bingzheng and Xu, Yan},
        title = { { Region Attention Transformer for Medical Image Restoration } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. The author introduced a novel region-based multi-head self-attention (R-MSA) mechanism into the transformer to reduce interference from unrelated regions during attention operation, and designed a novel focal region loss to force the model to prioritize high-difficulty regions.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The author employed a new AI model named SAM to provide region-size guidance for the restoration process, which is new.
    2. The author introduced a novel region-based multi-head self-attention (R-MSA) mechanism into the transformer, which can ensure that only pixels within the same semantic region complement each other, thereby eliminating interference from irrelevant regions.
    3. The technology developed by the author can be widely used in different image restoration tasks, like PET Image Synthesis, CT Image Denoising, and Pathological Image Super-Resolution.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The motivation is insufficient. Not all transformers require patching. Even if patching is necessary, the author does not adequately explain the consequences of random patching. The manuscript lacks references and experimental results to support the mentioned “two flaws in attention computation.”
    2. The region-based multi-head self-attention is inaccurate. The author added the interpolated and reshaped segmentation mask to the QK^T to constrain the attention range within individual semantic regions, which is inaccurate since the receptive field of the target spatial features is larger than the original regions.
    3. The dataset for the task of CT Image denoising is too small, with only one for testing, which may not be insufficient to effectively train and evaluate deep learning models. By the way, the author didn’t mention whether they used the leave-out strategy to evaluate the performance.
    4. The proposed method RAT does not exhibit a significant improvement over suboptimal alternatives in synthetic metrics such as PSNR and SSIM. For example, RAT (PSNR: 40.9487 dB) is comparable to CycleWGAN (40.6238 dB), and RAT (SSIM: 0.9712) is similar to AR-GAN (SSIM: 0.9702) in Table 1 (as for Table 2, too). Besides, no statistical tests were done, such as t-tests. 4. The author overclaims the role of focal region loss, since the improvement of the case using focal loss over L1 loss (In the Ablation Study part (3.4)) is not significant, and there are no statistical tests done.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The author should compare more transformer-based methods rather than CNN-based methods.
    2. The design of focal region loss, relying solely on MAE loss (δ=1) to weigh different regions and designate high-difficulty areas, may be imprecise. This approach overlooks potential complexities inherent in different regions of the image. Factors such as texture intricacies, structural anomalies, or the presence of critical features might influence the difficulty of a region. Therefore, a more comprehensive methodology is necessary to accurately identify and address high-difficulty regions, ensuring a more robust and precise training process.
    3. Statistical tests should be done, as the improvement is not significant.
    4. Error maps should be displayed as well, as it can sometimes be challenging to directly evaluate different synthetic images, even with provided zoomed views. 
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The motivation is not sufficiently justified. The primary motive for the proposed method is to address the interference from irrelevant areas when calculating attention in Transformers. However, a major strength of Transformers is their ability to capture global or long-range information. Is including information from dissimilar areas truly detrimental to feature extraction?
    2. Furthermore, by the time Transformer attention is calculated, the input has already undergone CNN downsampling, where pixels have been aggregated.
    3. Lastly, the experimental results are not convincing. The authors have chosen to compare with many CNN-based methods, but there are already numerous pure Transformer-based approaches available. Moreover, the quantitative improvements are minimal and exhibit high variance, making it uncertain whether the improvements are due to the model or merely due to variability in the data.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The author’s response has mostly addressed my concerns, but I still have reservations about the insignificant improvement in results. In Table 1, RAT shows an improvement of 0.37 dB in PSNR compared to the second-best method, but the variance is 2.5233. How can we be sure that this improvement is stable and not due to variance perturbation with such a high variance?



Review #2

  • Please describe the contribution of the paper

    The paper introduces the Region Attention Transformer (RAT), which leverages a novel region-based multi-head self-attention mechanism (R-MSA) to enhance medical image restoration. This method dynamically partitions input images into semantic regions, thereby reducing interference from irrelevant areas and improving restoration quality. The addition of a focal region loss to prioritize difficult regions further underscores the novelty in methodological contribution.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The use of R-MSA for dynamic semantic region partitioning is a notable innovation. This approach allows the model to focus attention within semantically similar regions, which is a significant improvement over traditional methods that use fixed or coarsely partitioned regions.
    2. The paper presents extensive experiments across multiple medical imaging tasks (PET, CT, and pathological image super-resolution), where RAT consistently outperforms existing methods. This empirical evidence strongly supports the efficacy of the proposed methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the paper compares with several methods, it lacks comparison with very recent state-of-the-art methods in the domain, which could provide a clearer picture of the RAT’s performance.
    2. There is no discussion on the scalability of the proposed method, especially concerning computational resources and time, which are critical in clinical settings.
    3. The datasets used are relatively standard; however, the paper does not discuss potential biases or limitations in these datasets, which could impact the generalizability of the results.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Consider adding comparisons with more recent state-of-the-art methods to better position your method within the current research landscape.
    2. Include a discussion on the computational efficiency and scalability of RAT, as these are crucial for real-world applications.
    3. Expand on the potential biases in the datasets used and how they might affect the applicability of RAT in varied clinical scenarios.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation stems primarily from the lack of broader comparisons and discussions on scalability and dataset biases, which are important to establish the robustness and applicability of the proposed methods. Enhancing these areas could potentially elevate the paper’s impact and relevance to both the academic and clinical communities.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper provides a novel framework named Region Attention Transformer (RAT) for medical image restoration which 1) incorporates semantic knowledge derived from SAM for dynamic patch partitions; and 2) introduce a focal region loss to guide the model to adaptively focus on recovering high-difficulty regions. Experiments on various medical modalities have demonstrated the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of involving prior knowledge of image grouping from SAM is novel, which is cost-efficient.
    2. Based on this adaptive partition, the proposed focal region loss can benefit the learning of difficult regions.
    3. Extensive experiments are conducted which covers multiple medical modalities, showing the superiority of the proposed framework.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. For the architecture of RAT, further explanations and ablation studies on the concatenation of R-MSA and W-MSA are needed, e.g., what if solely R-MSA is applied?
    2. Will the failure of SAM partition heavily affect the restoration performance? It would be expected for the authors to discuss this.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors claimed to release the source code in the abstract. The method details are clearly described and hyperparameters are provided which help reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Additional questions:

    1. What if the quality of an image is extremely low and SAM provides undesired partitions?
    2. Also if there is an artifact solely segmented by SAM as one of the patches, with no context information contained in this patch, how will the restoration performance be? In other words, how robust is the proposed method to artifacts?
    3. How will the number of segments be determined for SAM? How is the influence of this hyperparameter to the performance?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiments are comprehensive. The proposed method shows technical novelty in patch partition which is generally neglected by previous research. The paper is well-written and easy to follow. Additional ablation studies on the design of RAT block are needed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Strong Accept — must be accepted due to excellence (6)

  • [Post rebuttal] Please justify your decision

    Most of my concerns are solved by the authors after reading the response, I would like to raise my rating.




Author Feedback

We sincerely appreciate the reviewers for acknowledging our methodological contribution and providing constructive comments for further clarification. Our feedback is as follows.

Q1(R4): Insufficient motivation. A1: The interference issue from irrelevant regions in attention calculation (where a token is fused with many other irrelevant tokens in attention operation, resulting in interference) is also highlighted by sparse attention [1,2] and deformable attention [3]. Both methods aggregate similar tokens for attention calculation to avoid interference. As stated in [1], standard self-attention may assign high scores to irrelevant tokens. Moreover, data is often insufficient in medical tasks to properly train standard self-attention. In such cases, more concentrated attention is needed to avoid interference. We will add these references in the final submission. [1] Zhao G, et al. Explicit sparse transformer: Concentrated attention through explicit selection. [2] Mei Y, et al. Image Super-Resolution with Non-Local Sparse Attention. [3] Xia Z, et al. Vision transformer with deformable attention.

Q2(R1&R4): Lack comparison with SOTA Transformer-based methods. A2: In our paper, we have compared RAT with Transformer-based methods such as CTformer (2023) and SwinIR (2021). In fact, RAT also outperforms Eformer (2021), Restormer (2022), and Spach Transformer (2023). We will include these conclusions in the future.

Q3(R1) Computational time. A3: The average inference time of RAT on AAPM CT image is 3.73s, consisting of 3.58s for SAM branch and 0.15s for restoration branch. In the future we will use Efficient-SAM to improve efficiency.

Q4(R1): Add dataset bias. A4: The pathological images in TMA dataset exhibits diversity with bias, as they contain six different stains. RAT demonstrates the best performance and robustness.

Q5(R3): Failure of SAM. A5: SAM is robust to diverse degradations. Minor failures generally exist in our study and it does not affect the overall results very much.

Q6(R3): Only applying R-MSA. A6: R-SAM is responsible for inter-region attention and W-MSA is utilized for cross-region connection. There will be performance drop without cross-region connection.

Q7(R4): Inaccurate attention range. A7: This is common in segmentation tasks, where the segmentation network increases the receptive field of each token by feature extraction but does not change its semantic label. The final output segmentation map still corresponds to the input image at a token-to-token level. In our study, although tokens from the target spatial feature are enriched with neighboring information by the CNN encoder, the semantic label of each token generally remains consistent with the interpolated segmentation mask. When computational resources permit, we will try to conduct R-MSA at the original resolution.

Q8(R4): Small CT dataset. A8: We just follow EDCNN and CTformer to use the AAPM dataset. The one testing patient consists of 421 images, each sized 512×512. We will implement our method on a larger CT dataset in further studies.

Q9(R4): Limited improvement. A9: As shown in the paper of SwinIR, EDCNN and reference [2], an improvement around 0.05 dB in PSNR can be regarded as significant improvement in image restoration. RAT obtains 0.37/0.19/0.1 dB over the second-best methods across 3 tasks. We conduct significance tests and our improvements over all comparison methods are statistically significant (p<0.05).

Q10(R4): Imprecise focal region loss. A10: We acknowledge that only using MAE loss to assess the difficulty of each region is too simple. However, the incorporation of dynamic weighting can still represent a modest advancement compared to the commonly employed L1 or MSE loss. In the future, we will take the texture and structure into consideration for difficulty identification.

Other minor issues will be addressed in the final submission.

Thank you once again to all the reviewers for your constructive comments!




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Most of the reviewers’ concerns were addressed after the rebuttal, which made R3 and R4 both increase their scores. R1’s comments have also been answered in the rebuttal, but there are no additional comments given by R1.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Most of the reviewers’ concerns were addressed after the rebuttal, which made R3 and R4 both increase their scores. R1’s comments have also been answered in the rebuttal, but there are no additional comments given by R1.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I acknowledge the technical contribution of the proposed region-based multi-head self-attention and the extensive validation on various medical image restoration tasks. However, the incorporation of SAM segmentation to provide semantic clues significantly increased the computational cost, requiring 3.58 seconds for SAM and 0.15 seconds for restoration, as mentioned in the authors’ feedback to Reviewer #1. This additional computation overhead makes the performance improvement from the proposed model seem marginal, and without statistical tests to verify its significance, the benefit is questionable. This paper is a borderline work to me. I slightly tend to accept it.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I acknowledge the technical contribution of the proposed region-based multi-head self-attention and the extensive validation on various medical image restoration tasks. However, the incorporation of SAM segmentation to provide semantic clues significantly increased the computational cost, requiring 3.58 seconds for SAM and 0.15 seconds for restoration, as mentioned in the authors’ feedback to Reviewer #1. This additional computation overhead makes the performance improvement from the proposed model seem marginal, and without statistical tests to verify its significance, the benefit is questionable. This paper is a borderline work to me. I slightly tend to accept it.



back to top