Abstract

Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. Task-Incremental Learning (TIL) offers a privacy-preserving training paradigm using tasks arriving sequentially, instead of gathering them due to strict data sharing policies. However, the task evolution can span a wide scope that involves shifts in both image appearance and segmentation semantics with intricate correlation, causing concurrent appearance and semantic forgetting. To solve this issue, we propose a Comprehensive Generative Replay (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs to mimic past task data, which focuses on two aspects: modeling image-mask correspondence and promoting scalability for diverse tasks. Specifically, we introduce a novel Bayesian Joint Diffusion (BJD) model for high-quality synthesis of image-mask pairs with their correspondence explicitly preserved by conditional denoising. Furthermore, we develop a Task-Oriented Adapter (TOA) that recalibrates prompt embeddings to modulate the diffusion model, making the data synthesis compatible with different tasks. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting. Code is available at https://github.com/jingyzhang/CGR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0187_paper.pdf

SharedIt Link: https://rdcu.be/dZxdb

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_8

Supplementary Material: N/A

Link to the Code Repository

https://github.com/jingyzhang/CGR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_Comprehensive_MICCAI2024,
        author = { Li, Wei and Zhang, Jingyang and Heng, Pheng-Ann and Gu, Lixu},
        title = { { Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {80 -- 90}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a Comprehensive Generative Replay (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs to mimic past task data, while focusing on two aspects: modeling image-mask correspondence and promoting scalability for diverse tasks. The authors evaluate their method on incremental tasks (cardiac, fundus and prostate segmentation).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Text is well written and structured, Figures look good
    • Source code is provided
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Citations should be ordered if multiple citations are provided at once
    • Figure 1 looks good, however seems very small compared to the large caption
    • Regarding the generation of the memory: How close is it to the training data ? As replayin medical environments is problematic due to privacy regulations, so having a metric that shows how “far away” or “close” the generated memory is from the (already seen) training distributions would be helpful –> Figure on how this memory looks like would be interesting (some generated samples)
    • Table 1: Additionally to providing mean Dice, Backward Transfer and Forward Transfer as CL metrics would be more suitable and give an overview on generalizability but also the amount of forgetting over time –> I would suggest replacing HD with BWT and FWT
    • EWC/LwF/PLOP method hyperparameters are not mentioned in the experimental setup?
    • Given Figure 2, EWC performance on Prostate does not loog to good –> Seems that the network is to rigid, i.e the lambda hyperparameter is too high, making it difficult to learn on new tasks. Therefore, how did the authors select the hyperparameter, ablation missing for EWC/LWF/PLOP?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see list of minor/major weaknesses above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Ablation on hyperparameters for CL methods like EWC/LwF/PLOP are missing. Further, the experiments should be evaluated using CL metrics that include the amount of forgetting and generalizability over time, like Backward and Forward Transferability. Without, these ablations and metrics it is difficult to assess the true performance of the proposed method over time, i.e. in a CL setup.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    authors addressed the points in the rebuttal as best as they could, therefore raised to weak accept.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a task incremental segmentation framework to handle wide range of tasks, using generative replay with conditional denoising to synthesize image mask pairs and task-oriented adapter to calibrate CLIP embedding for generation. The model shows leading results compared to other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper explains the advantage of TIL compared to CIL/DIL, and formulates the main challenge as “concurrent appearance and semantic forgetting”, which provides reasonable motivation for the proposed method.

    • The proposed Bayesian Joint Diffusion model has good novelty and clear theoretical explanation, which uses conditional denoising to successfully preserve image-mask correspondence and restore good image-mask pairs for TIL replay.

    • The paper demonstrates the experiments of two orders TIL, ablation studies on BJD and TOA modules, showing the order robustness and the effectiveness of proposed modules.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • One concern is about the generated image-mask quality and potential forgetting of the diffusion model. During generative replay, although the segmentation model is able to recall old knowledge from synthesized image-mask pairs, but the diffusion model itself is also incrementally learning new tasks, which may also suffer forgetting. The proposed method uses generated samples when BJD model updating, which can alleviate BJD forgetting, however, since it uses the self-generated samples for training, I think it’s not possible to completely avoid forgetting issue. It would be clearer if the author could provide some evaluation over BJD generation quality.

    • The proposed method highly relies on CLIP embedding and diffusion model, which both works for 2D image input only. This actually limits its application in clinical practice, as radiological scans such as CT/MRI are all 3D images. although 2D model is able to handle segmentation over each 2D slice, it will miss significant semantic information in axial direction and thus be underperforming, especially for vessels, nerves and tumor segmentation.

    • In experiments dataset preprocessing, the prostate CT and cardiac MRI slices are resized to 256x256, which is fine for natural images, but may not be accurate for radiological scans, as the physical size in CT/MRI may also provide important semantic information in segmentation. I would suggest cropping 256x256 patches over resampled spacing for training.

    • The datasets and tasks is still a little simple for evaluating TIL models. each task contains no more than three target classes, which is not enough to show the robustness of the generative replay over more complicated tasks, e.g. multi-organ segmentation (TotalSegmentator, Decathlon, etc.) and tumor segmentation (large variation of size and shape).

    • The parameter growth rate is not provided

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author uses three public datasets, provides online code and data preparation details and training hyperparameters. I think the paper has good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • I suggest adding the evaluation of generation quality of BJD model at final step, in order to demonstrate the robustness and forgetting rate of generative model.

    • 2D model is limited in clinical practice. Since the motivation of TIL is to handle wide variation of tasks, the author may consider exploring the possibility of 2D-to-3D adaptor on CLIP and 3D medical diffusion model.

    • Again, compared to CIL, the advantage of TIL is adapting wide range of tasks, the author should use more complicated datasets like multi-organ segmentation (BCV, TotalSeg, FLARE, StructSeg) and tumor segmentation (BraTS, LiTS, KiTS) to further demonstrate the effectiveness and robustness of the proposed method, instead of only using single organ/sub-organ segmentation datasets in the paper.

    • It’s better to use cropping instead of resizing for radiological scans, which keeps more semantics for segmentation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has reasonable motivation and novel method. The experiments show the effectiveness of the method, but still contains some limitations to demonstrate its robustness and generalizability. Therefore, I recommend the paper as “weak accept”.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After reading the authors’ response, I keep my rating as “weak accept”.

    The motivation is reasonable and the method is novel and reproducible. The results on the three datasets demonstrate its effectiveness. The model should also be able to adapt to 3D inputs, as from the rebuttal.

    But still, the two concerns not fully solved. 1. A good performance closely reaching the joint-training upper bound does not highly correlated to the generative quality, as the training process is complicated and has many variables except generated quality. 2. As I mentioned in my review, the three datasets used in the paper are only single target segmentation and not able to demonstrate the potential, robustness and generalizability of the proposed method. Multi-organ or tumor segmentation is still needed to fully evaluate a task incremental learning (TIL) method.



Review #3

  • Please describe the contribution of the paper

    The authors tackle the challenge of task-incremental learning, which combines both class-incremental and domain-incremental learning. Novel methods are proposed for 1) jointly modeling image AND mask during the diffusion process and 2) an adapter to recalibrate prompts during the diffusion process to better guide the image generation for each task. The code is made available.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novelty. The proposed methods are new and explore interesting directions.
    • Evaluation. These methods are compared to many other methods for different tasks on publicly available datasets. Qualitative results are also provided.
    • Ablations. Ablation studies are performed and show the usefulness of individual method components.
    • Code made available.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Too little information on generated samples:
    • No mention of how many synthetic image/mask pairs are generated for each task. Is it comparable to GAR?
    • How does this parameter affect the performance?
    • Limited information on generated image/mask pairs quality
    • Computational requirements from BJT during training compared to no BJT
    • Since the image-mask correspondence is modelled in the diffusion process, why not ditch the segmentation model and instead use the diffusion model to generate a mask with guidance from the test image that needs to be segmented? How do these two approaches compare? (e.g. performance, inference time…)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The datasets used are publicly available and the code has been released, which greatly help with reproducibility. While some hyperparameters are given in the paper, having configuration files or the command used for each experiment would ease the reproduction of the experiments. In particular, I could see no mention of the number of images synthesized for each task.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See Weaknesses

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is novelty, a few questions need to be answered in the rebuttal before this can become a full Accept (see weaknesses).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I thank the authors for their detailed answers. On the topic of using BJD for segmentation directly, I agree with the authors that the higher inference time is not suitable for the clinical practice. However, within the scope of a technical paper, we can evaluate the method beyond what would be used in practice. While the metrics provided by the authors show that the downstream model performs better with BJT, these are only indirect measures of the generated joint image-mask pair. Since this is one of the main contributions, I would have enjoyed a further evaluation in that direction, i.e. segmenting test images by generating masks using these images for the conditioning and computing a Dice score. However, I also acknowledge that per the guidelines, we cannot ask for this experiment. Since my other questions have been answered, I have updated my rating.




Author Feedback

We thank the reviewers for their comments. They highlighted that our method is with good novelty (R3&R4), clear explanation (R3), leading results (R3&R4), available code (R4&R5) and well-structured text (R5).

Q1: Quality evaluation of generated data (R3&R4&R5) In addition to visualizing generated data (Fig. 3), we quantitatively evaluate their quality by comparing our method’s segmentation performance to JointTrain (Table 1). Since JointTrain and our method train the same segmentation model with the only difference in using raw and generated data, their performance gap reflects how closely the generated data match the raw data distribution. Our method has comparable results to JointTrain, revealing the high generation quality that is close to raw data.

Q2: Extensibility to 3D image (R3) Our BJD is a flexible diffusion pipeline without constraints on model architecture. Therefore, a potential way for 3D image replay is to adopt a 3D diffusion model as suggested, still following the BJD pipeline using loss Eq. (3) without modification. We will study it in the future.

Q3: Dataset and pre-processing (R3) Given paper limit, we evaluated our method on fundus, cardiac and prostate datasets to include extensive ablation in submission. We used them given their challenging setting with diverse imaging conditions, objectives, and widely used. Our promising results show great potential on other data, such as suggested multi-organ or tumor datasets. We will include it in the future, and adopt the suggested cropping operation.

Q4: Number of generated samples per task (R4) We empirically set this number to 3000, same to GAR. This is because the results on the validation set increase rapidly up to 3000 samples, after which it plateaus.

Q5: Computational cost of BJD vs. non-BJD counterparts (NJD) (R4) While BJD incurs higher FLOPs than NJD during training, it does not involve extra learnable parameters, thus avoiding higher FLOPs during inference.

Q6: Why not use BJD for segmentation directly (R4) While our BJD has segmentation potential, it leads to tremendously higher inference time compared to classical segmentation models, even with DDIM acceleration. Such high time consumption may be acceptable to generate training samples for model development, but it is prohibitive for deploying BJD directly on segmentation tasks, which often require a fast response.

Q7: Hyperparameter analysis on EWC, LwF and PLOP (R5) The hyperparameters of EWC, LwF and PLOP were empirically set at 5000, 10 and 0.01, respectively. Notably, this setting involves a trade-off between the model’s plasticity on incoming tasks and memorizability on previous tasks. Lower hyperparameters may improve results on the incoming prostate task but at the cost of huge drops on previous tasks (even complete failure). These methods cannot balance the plasticity and memorizability even with alternative settings, while our method can achieve both high.

Q8: Evaluation over time using BWT and FWT (R5) -BWT is measured after each training stage to quantify forgetting over time. Since new experiments are not allowed in the rebuttal, we try to use the results in the last training stage (Table 1) to intuitively analyze the potential forgetting throughout the learning period, i.e., BWT over time. Our method shows the least forgetting in the last training stage, suggesting the best BWT in this stage, which has undergone gradual memory fading with tasks incrementally involved. When it comes to the earlier stages with less memory fading, the forgetting would be relieved inherently and our best BWT trajectory can be retained over time. We will quantitatively evaluate this in future work. -FWT appears ill-suited in our task-incremental setting, where tasks vary widely without objective restrictions, making it impossible to anticipate future tasks in advance. For example, a model trained on fundus and cardiac segmentation cannot be directly applied to an unseen task of prostate segmentation.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All the reviewers agree to accept it, and I also believe that this is a good article.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All the reviewers agree to accept it, and I also believe that this is a good article.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The rebuttal addresses reviewers’ concerns well. The paper is well-motivated with good evaluations. The paper can be more convincing with comparisons on larger-scale datasets to show its potential for the real-world deployment.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The rebuttal addresses reviewers’ concerns well. The paper is well-motivated with good evaluations. The paper can be more convincing with comparisons on larger-scale datasets to show its potential for the real-world deployment.



back to top