Abstract

In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease’s shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. Code is available at https://github.com/XylonXu01 /TFS-Diff.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3901_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3901_supp.pdf

Link to the Code Repository

https://github.com/XylonXu01/TFS-Diff

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xu_Simultaneous_MICCAI2024,
        author = { Xu, Yushen and Li, Xiaosong and Jie, Yuchan and Tan, Haishu},
        title = { { Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15007},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript presents a simultaneously realize tri-modal medical image fusion and super-resolutin model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed TFS-Diff is based on the diffusion model generation of a radom iterative denoising process, and the loss function is constructed by a simple objective function and fusion super-resolution loss.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. What is the real effects of the model in clinical scenario? More clinical-task related assessments should be conducted.
    2. In the tri-modal medical imaging task, it really needs three different modalities. If there exsits two of them in scenario, how to address the issue?
    3. What is the generalization ability of the proposed method on external studies if there exist significant heterogeneity between the internal and external studies?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. What is the real effects of the model in clinical scenario? More clinical-task related assessments should be conducted.
    2. In the tri-modal medical imaging task, it really needs three different modalities. If there exsits two of them in scenario, how to address the issue?
    3. What is the generalization ability of the proposed method on external studies if there exist significant heterogeneity between the internal and external studies?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clinical-task related assessments and external studies are missing in the manuscript.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    All my concerns have been addressed.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a simultaneously realized tri-modal medical image fusion and super-resolution model, called TFS-Diff, which consists of two key innovations: the TMFA block and the PSF loss. TMFA is used to fuse features and PSF loss is used to balance pixel-level accuracy and structural similarity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1、The paper proposes a tri-modal method to simultaneously achieve image fusion and super-resolution. 2、The proposed method is simple and easy to implement. 3、The experimental results presented all achieve the best performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1、The novelty of the paper deserves consideration by the reviewers. The backbone used in the article is based on SR3, and this mode is based on a latent diffusion method, although TMFA is used to extract three-modal features during feature extraction. In addition, the PSF loss proposed by the author is essentially a trade-off between MSE loss and SSIM loss. Therefore, the reviewers consider these innovations to be trivial. 2、The description of the motivation in the article needs further clarity. The author mainly explains the motivation for fusion, but the motivation for super-resolution needs further consideration. Because the image quality problem leads to suboptimal fusion performance, there is not enough reason for super-resolution. For example, MR has the FastMRI method and CT has sparse view reconstruction. I think it is reasonable to consider different imaging modalities according to the problems mentioned by the author.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1、The writing of the paper needs to be improved and there are typing errors. The author should proofread. 2、The paper states “However, no deep learning fusion methods for bimodal medical images…” where the use of “bimodal” is confusing. There are obviously many bimodal image fusions. 3、The paper uses PSF loss, which combines MSE loss and SSIM loss, and can perform ablation experiments for MSE loss only or SSIM loss only.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Comprehensive consideration based on the clarity, quality, novelty and reproducibility of the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper combines tri-modality image fusion with super-resolution using conditional diffusion model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has some novelties in terms of TMFA block and a new loss function. However, the super-resolution part is similar to SR3. So, overall the novelty is good but not impressive.

    With extensive experiments and ablation studies, the proposed method is well validated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some parts of the paper is not clearly written.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Why particularly tri-modality? It seems the model can work for any number of modalities.

    Why low resolution is R^HW and high resolution is R^3HW? Isn’t it 8x, 4x, 2x?

    Fig. 1 and Section 2.1 are not very clear to me, especially how the diffusion process is incorporated. How many steps for denoising? Did you do an inference of diffusion model and compare with ground truth with L_PSF? Usually, inference of diffusion model is quite long time. How can you manage this during training?

    Reference [20] is latent diffusion model but not DDPM.

    How many cases/images were used for training and testing?

    Can this method work for 3D medical image?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a paper with some novelties and good validation. However, some parts need further explanation.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The rebuttal is well-prepared, and all my concerns were addressed with a satisfactory level of detail. As a result, I have raised my score.



Review #4

  • Please describe the contribution of the paper

    This paper introduces a simultaneously realize tri-modal medical image fusion. The study proposed a conditional diffusion model for tri-modal medical image fusion by introducing the block optimized the model’s capability to fuse features from different modal medical images, and simultaneously, the PSF Loss for balanced pixel level accuracy and structural similarity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Good background review. 2) Introducing a new fusion and super-resolution loss function. 3) Topic is important.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The authors should elaborate more on these:

    • Does upsampling to the target resolution through bicubic interpolation sampling affect the results?
    • How TFS-Diff adopts the U-Net structure 2) Figure 1 does not contain much information.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It is an interesting paper, but the methods used in the paper needs to be elaborated more. Some parts are hard to understand.
    • Figure 1 does not contain much information. As an important part of the paper, I recommend the authors to redraw this figure and include more details.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The topic is interesting and important.
    • The results of the experiments are promising.
  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers and editors for the great efforts in reviewing process.

Reviewer5 (Q1-6) A1: Our model theoretically can fuse any number of modalities. We focus on tri-modal because: 1) Currently, no studies can achieve tri-modal fusion and super-resolution simultaneously, we take the first step to do this work. 2) Tri-modal is better than the existing two-modality for clinical assistance. Using four modalities simultaneously for diagnosis is uncommon and may lead to information redundancy, affecting clinical judgment. A2: In R^3HW, 3 is the number of channels, not the super-resolution factor. R^HW is a single-channel image. A3: We will make Figure 1 clearer. The TMFA Block is similar to the encoder, integrating features from different modalities before diffusion. After the images are encoded by the TMFA, they enter the conditional diffusion process to yield the fused images. The model undergoes a 4000-step diffusion process. We perform inference on the validation set after every 10 epochs. For the loss function L_PSF, the U-Net is used to predict the noise during the diffusion process. So L_PSF is calculated by comparing the actual noise added (noise in ground truth) with the noise predicted by U-Net. A4: We will correct it and comprehensively check our paper. A5: We used 84, 10 and 25 images as the training, validation, and test sets, respectively, we will add these information. A6: TFS-Diff is suitable for 3D images, we may address it in future work.

Reviewer6 (Q1-3) A1: Our tri-modal work is significant for clinical applications. We will modify the Introduction and carefully introduce the clinical scenarios involving medical image segmentation, prediction, and detection. For more clinical-task-related assessments in medical image fusion, which may involve constructing new datasets and labeling, we will address them in our next work. A2: The model can fuse two modality images. When dealing with image fusion involving only two modalities, we simply concatenate the images from the two modalities and then train the model. A3: The quantitative results are all derived from the test set, demonstrating the good generalization of our method. During training, to further enhance the model’s generalization ability, we adopted dropout, L2 regularization, and data augmentation techniques (such as random rotation and scaling). We will share all the fusion results via GitHub.

Reviewer7 (Q1-5) A1: Our model is the first to simultaneously realize tri-modal medical image fusion and super-resolution. Although we introduce SR3 as the backbone and propose PSF, which may not be impressive innovations, the proposed simple yet effective model can successfully solve tri-modal fusion and super-resolution simultaneously. Extensive qualitative and quantitative experiments consistently confirm the superiority of our model. A2: Although Fast MRI and sparse-view reconstruction in CT can improve image quality, they mainly focus on single modalities and cannot address the simultaneous enhancement of multiple low-resolution images. Integrating these tasks into a single end-to-end task is not an easy task, which can: 1) Reduce error propagation: Processing two tasks in steps carries the risk of errors propagating, potentially introducing artifacts into the fused results. 2) Increase efficiency: An end-to-end model allows sharing of parameters for both tasks, reducing the overall computational burden. 3) Enhance data learning: Training both tasks simultaneously, the model can learn more robust and generalized features optimal for both super-resolution and fusion. When tasks are split, this synergy is lost as each model learns features specifically for super-resolution or fusion, possibly missing information beneficial to both. A3-5: We will meticulously review our paper to ensure no grammatical errors and citation mistakes. We will replace “bimodal” with “tri-modal”. In the ablation experiment for PSF loss, we substituted PSF loss with MSE loss.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The work has a good rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The work has a good rebuttal.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top