Abstract

Test-Time Adaptation (TTA) shows promise for addressing the domain gap between source and target modalities in medical image segmentation methods. Furthermore, TTA enables the model to quickly fine-tune itself during testing, enabling it to adapt to the continuously evolving data distribution in the medical clinical environment. Consequently, we introduce Spatial Test-Time Adaptation (STTA), for the first time considering the integration of inter-slice spatial information from 3D volumes with TTA. The continuously changing distribution of slice data in the target domain can lead to error accumulation and catastrophic forgetting. To tackle these challenges, we first propose reducing error accumulation by using an ensemble of multi-head predictions based on data augmentation. Secondly, for pixels with unreliable pseudo-labels, regularization is applied through entropy minimization on the ensemble of predictions from multiple heads. Finally, to prevent catastrophic forgetting, we suggest using a cache mechanism during testing to restore neuron weights from the source pre-trained model, thus effectively preserving source knowledge. The proposed STTA has been bidirectionally validated across modalities in abdominal multi-organ and brain tumor datasets, achieving a relative increase of approximately 13\% in the Dice value in the best-case scenario compared to SOTA methods. The code is available at: https://github.com/lixiang007666/STTA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1458_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/lixiang007666/STTA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_CacheDriven_MICCAI2024,
        author = { Li, Xiang and Fang, Huihui and Wang, Changmiao and Liu, Mingsi and Duan, Lixin and Xu, Yanwu},
        title = { { Cache-Driven Spatial Test-Time Adaptation for Cross-Modality Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a test-time adaptation pipeline with three major components: 1) a student-teacher approach (specifically, “mean teacher”), 2) entropy minimization to refine the pseudo-labels obtained from the teacher network, and 3) a cache mechanism that combines the weights from the source model and the test-time adapted student model. This pipeline is applied in two datasets, showing good results. Furthermore, an ablation study is conducted.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Important and relevant topic to the MICCAI community.
    • Experiments on two publicly-available datasets (BraTS and a dataset commonly used in domain adaptation).
    • An ablation study showing the contribution to the performance of each component of the pipeline.
    • The code is available
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Novelty seems a bit limited for MICCAI. —- Each of the three core components of the pipeline are methods from published papers [9,13,22]. —- The novelty seems to be on these three existing methods being put together to solve a 2.5D segmentation task. While this can be relevant, I think it is not enough for MICCAI.
    • The formulas are very loosely defined, making it unclear what they mean.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The details of the pipeline are well described in paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    On certain statements

    • Page 2, “We present the STTA method, conceptualized for clinical environments”. I couldn’t find what part of the method makes it “conceptualized for clinical environments”.
    • Page 4, “Our hypothesis is that lower confidence suggests a larger domain gap, while relatively high confidence levels indicate a smaller domain gap [19]”. While this could be true in natural images (as in [19]), this is not always true in medical images. In a single image, you could have areas that are inherently uncertain due to the low image contrast, resolution and/or the quality of the labels. I suggest adding some discussion regarding this on the paper. Could this mean that for certain medical images it is a good idea to hypothesize this?
    • Page 5 “with shallow layers defined as the two blocks adjacent to the input and output”. Shallow layers are the ones near the input, not near the output. Those are deep layers.

    Method

    • Eq. 6: The upper part of the equation will activate when j = th. “j” refers to the slices, and in “implementation details” it is written that “th” is set to the number of slices. So, it seems that the upper part of the equation will activate only in the last slice. Is this correct?

    Unclear / confusing math

    • The variable “x” refers to the images, and its superindex is used to indicate whether it’s from the target or source domain. However, for source dataset images, the subindex indicates “images” while for the target dataset images it indicates “slices”. In other words: x^s_i is an image, whereas x^t_j is a slice. It would be clearer if the subindex in both cases indicates the same thing. It would also help if the variables are more clearly defined, e.g., saying that x^s_i \in R^{h x w x d}.
    • Eq 2. I suggest simplifying a bit the notation. I personally found it challenging to read variables defined as: \tilde{y}’_j^t (one variable and 4 index/symbols in it). For example, this paper is about test-time adaptation that assumes we have a pretrained model, thus, I think that the superindex ^t that indicates “target dataset” is unnecessary; here, we don’t deal with images/labels from the source domain. Considering MICCAI’s small template, it could also help to remove Eq. 1 that doesn’t add any information to the paper beyond “there is a pretrained model trained in a standard way on a source dataset”. Additionally, it is unclear what the quotation mark ‘ means.
    • Regarding Eq. 2, conf(f_theta…) > p_th. 1) I would remove the “conf” and specify that these are softmax probabilistic values that, in here, we consider as confidence. Otherwise, it looks like “conf” is a function. 2) Since the input of conf(..) is x^t_j (a slice), this will output a 2D matrix \in R^{w x h}, however, we are comparing this 2D matrix with a single value p_{th} \ in [0,1]. Is this done element-wise? Or is “conf” a function that takes the 2D matrix, does some computation, and outputs a single number that later is compared with the threshold? As I mentioned before, defining the dimensions of the variables helps understanding this.
    • Eq. 3: It seems to me that \bar{p}_{c,n} is the same as one of the “y”s in Eq. 2. Is this correct? Furthermore, why “p” has a bar and it’s bolded? What does that indicate?
    • Eq 6: Here, “Concat” is not really a concatenation. If you concat the two Ws, then, the resulting tensor will have twice the size of the individual Ws. Here, it seems that concat(A,B) is used to indicate that the final network is composed by the set of weights A and the set of weights B. This should be defined differently to avoid confusion.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Due to not being able to see if/where is the major novelty of the paper (as the core methods are already published), and the unclear aspects I highlighted above, I give a “weak reject” because I think that having the rebuttal will be beneficial.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Post-rebuttal paper stack: 3 (n=4)

    After reading the authors’ response and the other reviews, I decided to change from weak reject to weak accept. The main reason is because novelty, which was one of my main concerns, has been clarified.

    However, many of my other concerns remain, and I suggest the authors to consider them together what other reviewers mentioned for the camera-ready version if the paper gets finally accepted.

    1) While I understand your definition of “shallow layers”, please note that your definition of “shallow” and “layer” is non-standard (see my original comment). I suggest to keep it as in the rest of the literature. 2) “Our method is a form of TTA research, which fine-tunes models during testing in clinical settings to adapt to diverse medical data from various devices, operators, and patient groups.” It adapts to the dataset but not to the devices, patient groups, etc specifically (or, if so, how?) Any TTA could be then argued to be relevant for clinical setting. I suggest removing this. 3) “The prediction confidence originates from the source model. Whether the low confidence is due to domain gaps or the inherent uncertainty of medical images, the pseudo-label enhancement strategy (Eq.2) is applicable” Yes, but does it make sense to apply it considering that low confidence does not always mean domain gap? The borders of low-contrast medical images are typically low confidence.



Review #2

  • Please describe the contribution of the paper

    This paper introduces Spatial Test-Time Adaptation (STTA), a method for improving medical image segmentation by addressing domain gaps between source and target modalities. STTA enables models to adapt quickly during testing to handle the evolving data distribution in clinical environments. The key contribution lies in integrating inter-slice spatial information from 3D volumes into Test-Time Adaptation (TTA), thereby reducing error accumulation and preventing catastrophic forgetting. A cache mechanism is introduced during testing to preserve source knowledge effectively. Experimental results on abdominal multi-organ and brain tumor datasets demonstrate significant performance improvements over state-of-the-art methods, with a relative increase of approximately 13% in Dice.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written and a pleasure to engage with.
    • The authors introduce a novel STTA method aimed at mitigating domain gap issues.
    • Through a thorough evaluation across multiple datasets, the authors showcase the model’s performance and juxtapose it against existing TTA methodologies.
    • The performance improvements remain consistent across various datasets, while the ablation study delves deeper into the individual contributions of each component.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors utilized 2.5D modeling to address the challenge of exploiting spatial information, which is limited in 2D models. However, it would be intriguing to observe a comparative analysis among 2D, 2.5D, and 3D methodologies using STTA.
    • Regarding SSTA, does it involve duplicating the decoder of the teacher model K times? A thorough ablation study examining the selection of K seems warranted here.
    • The paper lacks clarity regarding whether this approach utilizes ensemble prediction with the teacher model during inference or relies solely on the student model. If multiheaded prediction (ensemble) is indeed employed during inference, it raises questions about the fairness of comparing STTA with other TTA methods.
    • Could the authors elaborate on the overall objective function utilized in this study, as well as the training procedure for the student-teacher model?
    • In the cashing mechanism, how is the threshold chosen to differentiate between shallow and deep caches for source model weights? How does this threshold impact the overall model performance?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned in section 6.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • This paper introduces a novel approach for TTA domain adaptations, aimed at bridging the gap between source and target domains.
    • The authors rigorously tested and evaluated their methods against existing TTA approaches, achieving notable performance improvements in two medical imaging segmentation tasks.
    • The paper is well-written, making it easy to follow. Additionally, the authors have generously shared their code for reproducibility.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I would like to thank the authors for providing additional details in the rebuttal and for addressing the reviewers’ comments and feedback.

    The rebuttal addressed some of my major comments, but I am still not satisfied with the explanation of the 1) deep and shallow parts of the cache mechanism and how it can be adjusted based on dataset compatibility and 2) enhance the student model’s long-term adaptability. I hope the authors provide further evidence in the supplementary materials for the camera-ready version.

    Based on these reasons, I would like to keep my rating as a weak accept.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a test-time adaptation (TTA) framework for cross-modality medical image segmentation. The key components include 1) slab-based input to exploit spatial correspondence among three adjacent slices; 2) pseudo-labeling based on consistency regularization based on multiple perturbations (augmentations); 3) cache mechanism to regularly fully restore deep layers weights or partially restore shallow layers weights from pre-trained source model. The framework is evaluated on two datasets (abdominal CT/MRI and brain FLAIR/T2 MRI). The ablation studies justify the effectiveness of each component.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. TTA is an important topic and can be useful in many real-world settings.
    2. The proposed framework is evaluated on two datasets, with consistent improvement observed.
    3. Open-source code is provided.
    4. Applying the cache mechanism in TTA is a great idea as the model is not able to access the ground truth labels during adaptation, and it seems to work very well in TTA segmentation.
    5. I also like the idea of feeding three adjacent slices as a multi-channel input, but it would be great to elaborate on the choice of slice number (how about 5, 7, or even larger?). Please refer to more details in my comments below.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Selected comparison methods are mostly Batch Norm-based or entropy/uncertainty-based methods, which is not comprehensive enough. For example, many pseudo-label-based methods [1,2] and others [3] are also strong baselines.
    2. Some of the parameters (e.g., input slice number and augmentation times k) are not well studied.
    3. Some improvements could be made for clarity. Please refer to more details in my comments below.

    [1]: Karani, Neerav, et al. “Test-time adaptable neural networks for robust medical image segmentation.” Medical Image Analysis 68 (2021): 101907. [2]: Chen, Cheng, et al. “Source-free domain adaptive fundus image segmentation with denoised pseudo-labeling.” Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer International Publishing, 2021. [3]: Valanarasu, Jeya Maria Jose, et al. “On-the-fly test-time adaptation for medical image segmentation.” Medical Imaging with Deep Learning. PMLR, 2024.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I divide this section into two parts. The first part is the confusion/questions the reviewer currently has and the suggestions the reviewer hopes the authors could take in the final version. Answers/revisions to these topics may have a positive impact on the reviewers’ final recommendation (if applicable). The second part is some comments/suggestions for future improvement, in which the reviewer does NOT expect any reaction on those items as some may need extra experiments / more space, which is impractical for a conference submission.

    Part I:

    1. Eq.2: Based on my understanding, there should be an invert-transformation when the multi-head predictions ensemble is derived from different augmentations. Is that correct? I cannot find the invert-transformation term in Eq.2.
    2. Eq.6: The order of shallow and deep layer parameters should be the same in the two scenarios.
    3. What is the value of parameter k? It is not introduced in the manuscript. Is the same k used for both datasets?
    4. It is claimed that for the abdominal dataset, each CT scan was cropped to match the spatial dimension of MRI. How is this cropping determined? Is it center-cropped?
    5. Typos and errors: In Section 2.1 “In previous work”, should be “works”. In Section 3.1 Implementation details, the 5% quantile is not a usual expression. I guess what the authors meant to say is, “conf^S indicates the 5th percentile of softmax confidence from the source model …”.
    6. When leveraging the spatial information, why not include more slices? As more slices are only treated as more input channels, it won’t significantly increase the computation
    7. Could you explain why the cache mechanism is only applied to the student model? Have you conducted ablation studies on applying the cache mechanism to both student and teacher models? Part II: For the future improvement, it would be great to: Have ablation studies on input slice number Ablation studies on parameter k (number of decoder and data augmentation) More comparisons with other TTA methods (e.g., ref in the weakness section).
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposed to utilize multiple components to mitigate pseudo-label drift and catastrophic forgetting problems. Upon evaluation of two datasets and some ablation studies, the contribution of the paper is justified, while it would be good to have more ablation studies to study some critical components/hyperparameters. The reviewer still has some questions and hopes the authors could provide some further clarification accordingly. Therefore, I recommend a weak accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading author’s rebuttal and other reviewers’ comments, I’d like to reiterate my rating of “4 - Weak Accept”, while maintaining my confidence score of 3. I rank this paper 2 (second best) in my stack of rebuttal papers (n=3). I appreciate author’s time and effort in preparing the manuscript and rebuttal.

    I’d like to re-emphasize that I did not ask for more comparisons in my initial review. It is just a suggestion for future work (e.g., a potential journal version).

    The rebuttal has addressed some of my addressable concerns. However, I’m not convinced by the author’s reply about the input slice number, as they stated, “Given that most source domain models have 1 or 3 input channels, we opted for 3 slices as input during adaptation.” Unless a model pretrained on natural images is employed, I do not see why you need to limit the input channel to three.

    Why not higher score:

    As stated in my initial review of Section 6 Weakness, the comparison with existing methods is limited. This is not addressable as more experiments are not allowed. So this inherent limitation leads to a rating of “weak accept” instead of clear or strong accept.

    I hope the authors can incorporate all clarification/changes committed in this rebuttal in the final version accordingly.




Author Feedback

We thank all reviewers for their overall support of our paper “important topic” (R4&5), “a novel method” (R1), “Improvements remain consistent” (R1&4&5), “ablation study showing the contribution” (R1&5). Below, we address 2 general concerns followed by specific responses for each reviewer.

G1: Parameter k. (R1&4) We set k=6 (the number of teacher model heads), with the input including the five augmentations mentioned in section 2.2 and the original image. Performance improves with increasing k, plateauing at k=7. Considering performance and memory trade-offs, we opted for k=6. Details will be included in the revision.

G2: Deep and shallow layer in cache mechanism. (R1&5) For encoder-decoder architectures like U-Net or DeeplabV3, we define “shallow layers” as those near the encoder’s input and the decoder’s output end (a layer consists of Conv2d, BN, and ReLU). The threshold for dividing layers into “deep and shallow” can be adjusted for dataset compatibility. We determined that setting one shallow cache layer is optimal, with details to be refined in the revision.

R1Q1: Comparative analysis of 2D, 2.5D, and 3D. The baseline in the first row of Table 3 is the 2D approach. Introducing 2.5D improves performance, but 3D requires significantly more computational resources. R1Q2: The fairness of ensemble. STTA utilizes the ensemble predictions of the teacher model, following the effective multi-branch ensemble approach in TTA. Similar practices have been adopted by comparative methods such as UPL-TTA [22] and URMA [11]. R1Q3: Training procedure. We will add the overall loss to Algorithm 1 in section 2.2.

R4Q1: More comparisons. Apart from the non-TTA method in [2], [1] and [3] are strong baselines for future comparisons. R4Q2: Optimize equations. The inverse transformation is displayed in Fig. 1, and we will update Eq. 2 to include it. We will also reorder Eq. 6. R4Q3: More slices. Testing indicates that accuracy improves with an increase in the number of slices. However, given that most source domain models have 1 or 3 input channels, we opted for 3 slices as input during adaptation. R4Q4: Cache application. The cache mechanism enhances the student model’s long-term adaptability without compromising the teacher model’s stability. Our tests indicate that applying it to both models disrupt the teacher’s weight update and reduces performance.

R5Q1: Novelty. STTA is the first TTA method to incorporate 3D spatial information and implement a multi-head ensemble with data augmentation. It introduces a novel cache mechanism (recognized by R4 as innovative “Applying cache mechanism in TTA is a great idea”), addressing catastrophic forgetting, unlike [9] where caching accelerates inference in AIGC. While we adopted the deep and shallow cache division from [9], STTA has devised a unique cache access method, validated effectively in experiments. Additionally, we adapted the MT [13] for TTA tasks and used L_Ment [22] to enhance the ensemble effects. R5Q2: Unclear math. We will make equations simpler and clearer according to your insightful suggestion. Also, to be clear, the quotation mark ’ is used to distinguish between the teacher and student models; we compare the confidence matrix with the threshold p_th element-wise; ‘p’ with a bar represents the ensemble output; “concat” will be changed to “merge” to avoid confusion. R5Q3: Clinical environment design. Our method is a form of TTA research, which fine-tunes models during testing in clinical settings to adapt to diverse medical data from various devices, operators, and patient groups. R5Q4: Confidence hypothesis. The prediction confidence originates from the source model. Whether the low confidence is due to domain gaps or the inherent uncertainty of medical images, the pseudo-label enhancement strategy (Eq.2) is applicable. We will refine this statement. R5Q5: Clarification on Eq.6. The deep cache is activated only at the last slice of each 3D volume among the continuous input volumes.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper presents a novel method called Spatial Test-Time Adaptation (STTA) for improving medical image segmentation by addressing domain gaps between source and target modalities. The STTA framework allows models to adapt dynamically during testing by integrating inter-slice spatial information from 3D volumes into the adaptation process. The strengths of the paper include addressing a critical issue in TTA, consistent improvements across datasets, providing open-source code, and the innovative use of a cache mechanism. Weaknesses include limited comparison methods, insufficient parameter studies, and the need for clarity improvements in certain sections. Overall, the paper provides a valuable contribution to TTA for medical image segmentation, which outweighs the weaknesses. I would suggest accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper presents a novel method called Spatial Test-Time Adaptation (STTA) for improving medical image segmentation by addressing domain gaps between source and target modalities. The STTA framework allows models to adapt dynamically during testing by integrating inter-slice spatial information from 3D volumes into the adaptation process. The strengths of the paper include addressing a critical issue in TTA, consistent improvements across datasets, providing open-source code, and the innovative use of a cache mechanism. Weaknesses include limited comparison methods, insufficient parameter studies, and the need for clarity improvements in certain sections. Overall, the paper provides a valuable contribution to TTA for medical image segmentation, which outweighs the weaknesses. I would suggest accepting this paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper received all possitive reviews after rebuttal. I agree with the comments of its effectiveness and novelty. The clarity could be addressable given the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper received all possitive reviews after rebuttal. I agree with the comments of its effectiveness and novelty. The clarity could be addressable given the rebuttal.



back to top