Abstract

Pathology image assessment plays a crucial role in disease diagnosis and treatment. In this study, we propose a Patch alignment-based Paired medical image-to-image Translation (PPT) model that takes the Hematoxylin and Eosin (H&E) stained image as input and generates the corresponding Immunohistochemistry (IHC) stained image in seconds, which can bypass the laborious and time-consuming procedures of IHC staining and facilitate timely and accurate pathology assessment. First, our proposed PPT model introduces FocalNCE loss in patch-wise bidirectional contrastive learning to ensure high consistency between input and output images. Second, we propose a novel patch alignment loss to address the commonly observed misalignment issue in paired medical image datasets. Third, we incorporate content and frequency loss to produce IHC stained images with finer details. Extensive experiments show that our method outperforms state-of-the-art methods, demonstrates clinical utility in pathology expert evaluation using our dataset and achieves competitive performance in two public breast cancer datasets. Lastly, we release our H&E to IHC image Translation (HIT) dataset of canine lymphoma with paired H&E-CD3 and H&E-PAX5 images, which is the first paired pathological image dataset with a high resolution of 2048×2048. Our code and dataset are available at https://github.com/coffeeNtv/PPT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0817_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0817_supp.pdf

Link to the Code Repository

https://github.com/coffeeNtv/PPT

Link to the Dataset(s)

https://github.com/coffeeNtv/PPT

BibTex

@InProceedings{Zha_Highresolution_MICCAI2024,
        author = { Zhang, Wei and Hui, Tik Ho and Tse, Pui Ying and Hill, Fraser and Lau, Condon and Li, Xinyue},
        title = { { High-resolution Medical Image Translation via Patch Alignment-Based Bidirectional Contrastive Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This study focuses on image translation within the histopathology domain, specifically from H&E to IHC images. The proposed model employs contrastive loss for paired input and output images, and incorporates additional losses to enhance the translation task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The research direction of virtual staining in pathology is currently a prominent and vital topic for the community. The authors have developed and published the HIT dataset for canine lymphoma, featuring two sets of 2048x2048 high-resolution paired images: H&E-CD3 and H&E-PAX5. Unfortunately, the dataset was not made available to the reviewer. Providing a link to the dataset would have facilitated a more detailed evaluation. Moreover, the availability of the dataset developed by the authors would present an intriguing opportunity for further research in this field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This methodology draws inspiration from previously developed algorithms; however, the introduced bidirectional contrastive loss appears misleading as it essentially comprises two components of contrastive loss between the input image and the ground truth (GT), and between the generated image and GT. Additionally, the patch alignment loss appears to be a pixel-wise loss, which contradicts the study’s claims. The content loss is calculated using a pre-trained VGG-19 model, which is trained on visual domain data. A model pre-trained on pathology domain data would likely be more beneficial.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Several improvements are suggested for this study as follows:

    1. The authors should validate the effectiveness of the proposed FocalNCE loss by comparing the current model’s performance with models using PatchNCE loss.
    2. The term ‘bidirectional FocalNCE loss’ is misleading as it involves two types of FocalNCE: one between the input and generated image, and another between the generated image and GT. The inclusion of two terms for FocalNCE loss in Equation (3) needs clarification. Why are two terms necessary when one might suffice? The authors need to justify this choice and evaluate the model’s performance concerning these terms.
    3. The proposed Patch loss appears to be a pixel-wise loss, contradicting the authors’ claims. Clarification on the use of this terminology is required.
    4. The authors have employed the VGG-19 network to calculate content loss, but since VGG-19 is trained on vision domain data, experimenting with pathology domain-specific models might yield more relevant results. Additional experimentation with various feature extractors could elucidate the impact of this choice.
    5. The model’s evaluation using the FID score, which is trained on ImageNet, may not accurately reflect performance in histopathology image evaluation. A recent MICCAI study from last year suggests that such FID scores do not correlate well with virtual stain model performance. Evaluating the FID score using a pathology-trained feature extractor could better bridge this gap.
    6. In Table 2, it is unclear why the authors did not compare the challenging stain generator, such as CD3, with comparative models like CycleGAN, CUT, etc., as done for the PAX5 dataset in Table 1. Including such comparisons for CD3 and other datasets could provide deeper insights into the proposed model’s performance on these challenging IHC markers.
    7. The authors should use consistent dataset naming, as current variations create confusion. For better consistency and clarity, the reviewer suggests using MIST-HER2 instead of HER2 for Table 2 and Figure 4.
    8. The visualizations for other datasets in Figure 4 should include comparisons with competitive models such as CycleGAN, CUT, and Pix2PixHD to demonstrate the impact of the proposed model more effectively.
    9. While expert evaluation on the PAX5 dataset is provided, extending this analysis to other models would more comprehensively demonstrate the proposed model’s effectiveness. Evaluating only the proposed model does not provide a complete picture.
    10. More appropriate metrics should be utilized for evaluating the models, as the current metrics may not adequately represent or reflect the quality of the generated IHC images. This is also supported by recent MICCAI studies.

    The reviewer hopes that these modifications will help the authors improve their current study.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although this study promises to release a dataset crucial for medical imaging, several aspects require further consideration. The proposed methodology needs more detailed explanation. Moreover, additional experiments, evaluations, and results are necessary to robustly substantiate the proposed model.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The reviewer agrees that this kind of dataset could help the community. However, I would like to keep my score unchanged due to remaining concerns previously raised. While the authors have addressed many issues satisfactorily, some concerns still persist, preventing me from changing my score.



Review #2

  • Please describe the contribution of the paper

    The paper proposes paired image-to-image translation to generate immunohistochemistry (IHC) images from H&E pathology images. A new approach for high-resolution virtual staining is presented which uses a bidirectional contrastive loss and context+frequency loss for better IHC generation. The paper also presents a novel patch alignment loss to ensure non-registered pairs do not affect the paired generation process. Experiments are performed on 3 different datasets (which are released to the community as a part of the paper) which shows state of art results wrt chosen baselines. Ablations are performed to highlight which components contribute to their final results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and the motivations for the choices are well explained. The problem of virtual staining is an ill-posed and difficult but relevant problem since being able to generate these IHCs can help generate data for training models even if not being useful directly in clinical practice as an IHC replacement.
    2. The use of patch-level alignment, and its use of both content and frequency objectives for ensuring tissue alignment is an interesting idea. Using frequency losses and patch-level instead of pixel-level alignment in pathology is well motivated.
    3. The results show promising results on multiple datasets across different perceptual metrics. The ablations are useful in showing the relative contribution of each proposed loss. Finally, the expert assessment is a nice framework for qualitatively evaluating staining patterns, cellular features, and morphological characteristics and shows the proposed approach compared to a 2-panel expert assessment.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. 6 baseline methods are compared on PAX5 dataset, but only 2 are compared on the CD3, BCI, HER2 datasets. No reasons are given for why the rest of methods are not evaluated on the pother problems.
    2. Mis-aligned H&E-IHC slides is a big problem in training image-to-image models in pathology. The paper makes a lot of good design choices in terms of using patches and additional perceptual constraints. However, the level of alignment obtained using these methods is not explicitly evaluated. Also, the proposed method might not be able to handle non-rigid deformations in tissue. It’s also not clear how big of a problem this is in the datasets used for experimentation.
    3. Since this is paired generation, usual losses compare the generated output to the ground truth. In this case, the contrastive loss brings in both the H&E features to the generated IHC, and the generated IHC features to the ground truth IHC. This H&E to IHC feature alignment is an interesting idea, but it would further strengthen the paper to add an ablation showing what happens if the contrastive loss is not in all directions, but only with generated HC and the ground truth.
    4. [Minor] The figure does not show the full detail of how GANs are used to generate IHCs and how these new objectives interact with the adversarial GAN objective. This is described in equations, but will be a good idea to modify the figure to show a complete picture of the whole method.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please respond to why all baselines are not used for comparison across all datasets. Also it would be good to hear the author’s thoughts on points 2, 3 in the weaknesses section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting method to do virtual IHC staining based on H&E images. The task is ill-defined and a challenging one and the paper presents an array of new objectives to make it work in pathology. I’m currently on the borderline for this paper due to the missing baselines on 3 different problems. Would like the authors to provide a response to that and other questions raised above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I would like to maintain my rating of a weak accept. I would also like to urge the authors to add the complete set of results from all baselines for all datasets in the camera ready if the paper is accepted. I’m willing to take the authors’ word for it on their claim that all other models do worse than their approach on all datasets.



Review #3

  • Please describe the contribution of the paper

    This paper introduces a Patch alignment-based Paired medical image Translation (PPT) for virtual IHC generation from H&E images. The authors release the H&E-IHC dataset of canine lymphoma with a resolution of 2048*2048.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Release the H&E-IHC dataset.
    2. Use additional BCI/MIST online datasets to support.
    3. The authors invite two pathologists to evaluate the generated results, which are more convincing than the PSNR/FID image quality evaluation metrics.
    4. Comprehensive model comparison.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Relatively low SSIM value (no more than 0.5).
    2. The used quantitative metrics (SSIM/PSNR/FID/LPIPS) are hard to reflect the performance regarding the accuracy of cell type prediction.
    3. The expert evaluation was only performed on results from the proposed method, which are not comparable.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The author used 9-residual blocks as the generator and 1024*1024 input resolution. Is there possible to choose a light generator block or reduce the image size to decrease the GPU memory usage?
    2. What are the image sizes provided to pathologists for scoring? Will images with different field-of-view affect the pathologist’s judgment?
    3. Limited description regarding the dataset. The authors describe that the released dataset is paired H&E-IHC, how to ensure the minimal anatomical differences between those two slides? Do they use de-stain and re-stain in the same tissue slice or adjacent layer? How many patients do those data come from?
    4. In Figure 3, why separate the baseline models into two groups? Why not compare all baseline models with the proposed one from both aspects?
    5. The authors use 4*4 patch size for patch alignment loss. Is this the optimal size? Is there any supporting material available to illustrate this?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors conducted extensive experiments including in-house dataset and online dataset for the translation between H&E and IHC. Moreover, the authors promised to publish the H&E-IHC image dataset, which is the first paired high-resolution pathological image dataset of lymphoma.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed my main concerns. I’m willing to keep my score.




Author Feedback

We thank all reviewers for their time and effort in reviewing our paper and providing insightful suggestions, and we are grateful for reviewers’ recognition of our novel patch alignment-based translation method, the opensource for replicability, and the release of the first paired pathology image dataset with a high resolution of 2048×2048. We address major comments below: 1.Baseline methods for comparison[R1,R3] We experimented on all baselines on our CD3 and PAX5 datasets and our model performed best among all methods. Results on CD3 are consistent with the trend on PAX5. It is notable that Pyramid(CVPR2022) and ASP(MICCAI2023) have proven their effectiveness on their datasets against all other baselines we compared. Therefore, due to page limit, we only showed results from the best SOTA Pyramid and ASP on CD3 dataset, as comparison with the best SOTA conveys enough information to indicate the effectiveness of our model.

2.Ablation details (1)[R1,R3] We conducted ablation studies on the directions and components in contrastive loss, and our design achieves the best performance. Loss between H&E and IHC is for detail preserving, while others are for staining style preserving. Due to page limit, we only show ablation results of major components for conciseness. (2)[R3] We proposed FocalNCE loss for our method design as PatchNCE performed worse than FocalNCE in experiments. Due to page limit, we only included major results.

3.Clarification[R3] (1)We made a clear statement in the manuscript: “we apply the FocalNCE loss between the output and the ground truth images in a bidirectional manner” and it is also indicated in Eq. 3. (2)Our patch loss is calculated patch-wise, and we conducted ablation studies on patch loss to indicate its effectiveness. Please see 7.

4.Model Pre-trained with Visual Image[R3] We observed performance increases when we applied pre-trained VGG with visual images in ablation studies. It is notable that visual and medical images share similar low-level features such as edges, textures, and shapes, and thus pre-trained VGG with visual images can also be useful in medical imaging. It was also adopted in Pyramid(CVPR2022). It is desirable but hardly achievable to obtain a pathology pre-trained VGG due to the scarcity of medical images.

5.Method evaluation (1)[R3] We use the same FID to evaluate pathology images as ASP(MICCAI2023), and our model achieved best results in our datasets. (2)[R3] We did the same as Pyramid(CVPR2022) using expert evaluation as additional and complementary results to the quantitative metrics to indicate the effectiveness of our method from the pathological perspective. (3)[R3,R4] Current SOTAs all have low SSIM, which cannot fully reflect the image similarity, so we evaluated our model on 4 different metrics. Moreover, we noticed quantitative metrics may not reflect performance in clinical settings, so we conducted expert evaluation as complement. (4)[R4] We used 1024*1024 images for scoring on five pathological factors, and our test set covered samples from various field-of-views to ensure a comprehensive evaluation.

6.Dataset details (1)[R1] We have two rounds of registration(SIFT) in preprocessing with manual annotations of key points, and poor-quality pairs were removed by pathologists to ensure proper alignment. Figures in the manuscript indicate our datasets do not have the non-rigid deformation problem. Our datasets will be released for public use. (2)[R4] We use the adjacent slide with two rounds of registration in preprocessing to ensure proper alignments in our datasets. We used 37 cases with 40 section slides for data collection.

7.Patch size choice[R4] 4*4 is the optimal size in metrics and visual appearance. We did not include results (window sizes ranging from 4~64) due to page limit, and we found larger sizes resulted in blur and artifacts.

8.Figures[R4] We compared all methods in Fig. 2. Due to page limit, we separated them into two sets in Fig.3 for better view.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal has adequately addressed reviewers’ comments.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The rebuttal has adequately addressed reviewers’ comments.



back to top