Abstract

Longitudinal monitoring of multiple sclerosis (MS) lesions provides crucial biomarkers for assessing disease progression and treatment efficacy. However, it remains challenging to detect and segment numerous MS lesion instances accurately. One key limitation lies in the common average blending of sliding-window predictions during inference, where unreliable patch-level outputs often lead to many false-positive results. To address this issue, we propose a Calibrated Inter-patch Blending (CIB) framework for new MS lesion segmentation, leveraging patch-level segmentation performance as blending weights. Specifically, our CIB model incorporates a multi-scale design with two additional prediction heads: one estimates the overall segmentation performance of the input patch, while the other predicts the performance of smaller grids within the patch. This dual-head architecture enables the model to capture both global and local contextual information, reducing over-confident lesion predictions. During inference, the predicted segmentation scores serve as calibration weights for adaptively blending patch predictions. Extensive experiments on the MSSEG-2 dataset demonstrate that our CIB model can significantly enhance both new MS lesion detection (e.g., a 12.82% F1 gain) and segmentation (e.g., a 4.01% Dice gain) across various backbones. Our code will be made public.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0702_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Yejin0111/CIB

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YeJin_New_MICCAI2025,
        author = { Ye, Jin and Dao, Son Duy and Wu, Yicheng and George, Yasmeen and Nguyen-Duc, Thanh and Schmidt, Daniel F. and Shi, Hengcan and Chong, Winston and Cai, Jianfei},
        title = { { New Multiple Sclerosis Lesion Segmentation via Calibrated Inter-patch Blending } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {366 -- 376}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a method for improving the segmentation of new MS lesions using a calibrated inter-patch blending framework. Instead of uniformly averaging the patches to obtain a full segmentation mask, the authors propose to use patch-level and grid-level segmentations as weights for blending patches. Additionally, they also proposed patch-level and grid-level loss functions as regularizers by employing intermediate features from the UNet segmentation network during training. Overall, they showed that calibrating the final segmentation predictions with patch/grid-level weights results in a better performance with few false positives.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    i) The paper tackles a relevant problem in the MS community – segmenting new/modified lesions in MS plays a critical role in quantifying longitudinal changes in the lesions.

    ii) The authors present several ablation studies and demonstrate the modularity of their approach by integrating their framework in previous works.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    i) Paper organization is weak (more details below)

    ii) Motivation for why previous approaches for recomposing patches during inference results in high false positives is unclear.

    iii) The results lack proper statistical analysis.

    iv) Substantial part of their training set up is borrowed from a previous work, Coactseg [1], however, crucial information on preprocessing and data augmentations used is missing.

    [1] Wu, Yicheng, et al. “Coactseg: Learning from heterogeneous data for new multiple sclerosis lesion segmentation.” International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland, 2023.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    i) References in the introduction don’t support the claim. In the Introduction, the authors say the previous blending approaches result in high false positives due to sparsity and size of lesions. References cited are that of coactseg [1], msseg-2 challenge [2] and a loss function [3] papers – this does not provide any information on why previous blending approaches obtain high FP rates. Readers would benefit from the literature on what is the default way of combining patches during inference? Why does it result in high FPs?

    ii) Paper organization. The authors need to work on structuring the paper: (i) some results regarding the inter-patch blending are already presented in 2.1 “Problem Definition”, (ii) it was difficult to understand until we reach Table 3 that the authors were actually proposing two things: calibrated inter-patch blending “and” regularization at patch/grid-level during training helps.

    iii) Providing error intervals. The results do not contain any error intervals – the readers don’t know whether only 1 model was trained on a specific seed or multiple seeds, or whether cross-validation was done, etc. Unless error intervals are provided, it’s hard to gauge improvement without knowing if the results highly vary across folds/seeds. As their dataset only contained 78 samples, having error intervals becomes all the more important.

    iv) Section 2.2: It is unclear to me how the L_patch loss is calculated. L_patch is defined as MSE(d_hat_i, d_i) and d_i is already the Dice between predicted patch and the GT lesion mask. I am not sure how MSE is calculated here. More details on precisely which entities are being compared in MSE loss are required.

    v) Section 2.2: Do the authors have any insight of how the results change when intermediate features from different layers are used?

    vi) Section 3 “Implementation Details”: What data augmentations were used and how were the images cropped to 80x80x80? If the images were resampled, it is important to mention which interpolation method was used during resampling as this might modify the voxels of small lesions.

    vii) Section 3 “Implementation Details”: For the two-time point dataset, did the authors do any registration of the baseline and follow-up scans? If yes, this should also be mentioned in the paper

    viii) Section 3.1: Authors say that their approach works “across diverse architectures” – I believe this is a strong claim as the baselines they compared against are UNet-based and don’t include, say, transformers or mamba architectures. So, this claim should be softened a bit

    [1] Wu, Yicheng, et al. “Coactseg: Learning from heterogeneous data for new multiple sclerosis lesion segmentation.” International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland, 2023. [2] Commowick, O., Cervenansky, F., Cotton, F., Dojat, M.: Msseg-2 challenge proceedings: Multiple sclerosis new lesions segmentation challenge using a data management and processing infrastructure. In: MICCAI 2021. p. 126 (2021) [3] Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for image segmentation using 3d fully convolutional deep networks. In:MLMI2017.pp.379–387. Springer (2017)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Poor paper organization, insufficient literature review on patch recomposition during inference, lack of proper statistical analysis and missing crucial details on training

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have made an effort to address my concerns, hence I am increasing my scores.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a Calibrated Inter-patch Blending (CIB) framework for new MS lesion segmentation from 3D MRI. The key idea is to mitigate false positives arising from traditional sliding-window average blending by weighting patch predictions during inference. This is achieved through two auxiliary heads that estimate patch- and grid-level segmentation performance. These predictions are then used as weights to calibrate the final segmentation output. The authors demonstrate performance improvements across multiple backbones on the MSSEG-2 dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Addresses a practical challenge in MS lesion segmentation: inter-patch blending via simple averaging is indeed suboptimal, especially in the context of sparse and small lesions, making the proposal timely and relevant.

    2) Simple but effective idea: The use of patch/grid-level calibration to weight segmentation outputs is a conceptually simple solution, yet shows performance benefits.

    3) Architecture-agnostic design: The proposed heads can be plugged into existing models with minimal overhead (~7.35k parameters, 7.18 GFLOPs), making the approach practical for integration.

    4) Demonstrated improvements in both Dice and F1 scores across multiple models (CoactSeg, SNAC, Neuropoly), indicating generalizability.

    5) False positive reduction is well-motivated and quantitatively supported by FDR metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Limited novelty in formulation: The idea of calibrating or weighting predictions is not new in segmentation literature. Similar ideas exist in test-time augmentation ensembling or uncertainty-weighted fusion. The authors fail to clearly delineate how their method conceptually differs from existing uncertainty- or confidence-based fusion strategies.

    2) Performance estimation supervision is trained using ground truth Dice scores during training — this is potentially circular and simplistic, and the generalization of these learned “performance predictors” to unseen data is questionable. There is no clear analysis of calibration error or uncertainty quality of these predictions.

    3) Over-reliance on empirical improvements without sufficient theoretical or interpretability insight. It’s unclear why the grid-level performance prediction in particular works well; no visualization or analysis is provided beyond a few qualitative examples.

    4) No baseline comparison to confidence or uncertainty-based weighting schemes, such as MC-Dropout or entropy-based ensembling. This weakens the claim of novelty and practical superiority.

    5) Ablation study limited in scope: While the contribution of each head is shown, the grid head’s benefit is mostly empirical, with no discussion of failure cases or when it might hurt performance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1- In Table 1, you directly report the real segmentation performance. I can understand this for the training phase, but what about the testing phase? Are these results from training only?

    2- You mention that the method “utilizes intermediate features from the decoder to estimate the grid-level performance.” Which decoder layers did you select for the patch head, and what was the rationale behind this choice? Have you experimented with different layers?

    3- You state that “S is a similarity function, set as a common Dice score.” Why did you choose the Dice score? Have you tried other similarity functions?

    4- You mention training the baseline model using only the patch head for 10k iterations (i.e., λ₁ = 1 and λ₂ = 0). Could you clarify the reasoning behind this training setup?

    5- The reported results of SNAC [15] and Neuropoly [14] differ from those published in CoactSeg [23]. However, the reported results for CoactSeg [23] remain consistent. Could you explain this discrepancy?

    6- Is the improvement gained by using the heads for regularization and as weights statistically significant? Have you performed a statistical test (e.g., p-value)?

    7- The improvement from using the heads for regularization and weighting seems limited in terms of F1 and Dice scores. Could you comment on the practical significance of these gains?

    8- I could not find the full MS lesion segmentation results of CoactSeg [23] on the MSSEG-v1 dataset reported in the original paper. Did you compute these results yourself?

    9- You state that “The combined model achieves a higher F1 score of 75.80%, yielding a 1.02% improvement in F1 and setting a new benchmark for new MS lesion segmentation on MSSEG-2.” However, the MSSEG-2 dataset includes both training and test sets, which can be requested from the challenge website. Why did you only perform cross-validation on the training set instead of evaluating on the official test set to establish a new benchmark?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper tackles an important and clinically relevant problem in MS lesion segmentation. The proposed calibrated inter-patch blending idea is intuitive and yields empirical gains. However, the novelty is incremental, and the technical formulation lacks rigorous justification and comparison to existing alternatives like uncertainty-aware fusion or confidence estimation. Furthermore, the evaluation, while promising, does not fully validate the generalizability or robustness of the method. With stronger baselines, deeper analysis of calibration quality, and improved reproducibility, this work would merit a higher score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The authors do not give reasonable answers or justifications to the comments and questions in the reviews



Review #3

  • Please describe the contribution of the paper

    The paper proposes a new technique for patch blending in segmentation networks, that weights each patch by its estimated segmentation performance. The technoque outweights the state of the art in multiple sclerosis lesion segmentation from MRI.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea is very interesting and efficient. It is tested in MSE lesion segmentation, but cand be applied to other applications of CNNs in medical imaging. The authors present experiments on standard databases of MS patients MRI images, both for single time point lesion segmentation, as for new lesion detection.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The sizes of the patches and grids used in the experiments are chosen arbitrarily and should depend on the real constraints of the CNN leading to the requirement of patch-level processing.
    2. There is no study on the influence of the size of the patches, or the size of the patch overlap in the result.
    3. As in most papers on MS leson segmentation, the authors aim at improving the dice score and F1 score, but do not give an estimate of the accuracy needed in a real clinical setting.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. It seems that the grid level refinement can be applied to other scenarios, independenttly from patch overlap. Have the authors explored this? It could be interesting to add a comment in the paper.
    2. What is the accuracy needed in this problem. In the case of new lesion detection, What number of new lesions and in what time frame are critical for a differential diagnosis?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a simple idea that leads to good results in the experiments shown by the authors

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their comments. All reviewers recognize our strong performance and simple yet effective design. Below, we first address common concerns.

Common

  1. Clinical Relevance of Metrics (R1, R3) Dice evaluates global overlap, while F1 measures lesion-level detection accuracy [4]. Both metrics are widely used in MS clinical practice, as lesion numbers and areas serve as critical biomarkers for assessing longitudinal lesion activities. In our experiments, our method (67.8 Dice, 74.8 F1) outperforms CoactSeg (63.8, 62.0) and even exceeds average human performance (65.9, 70.0), supporting its potential for clinical practice.
  2. Motivation (R2, R3) New MS lesions are small and sparse, leading to high inter-rater disagreement (Table 2, bottom) and aleatoric uncertainty [r1], which result in unreliable predictions. For example, CoactSeg shows 44.4 FDR. These noisy outputs are then averaged in patch-based 3D recomposition. Thus, our key idea is to blend patch outputs adaptively using estimated performance. This significantly reduces FDR (31.2) and improves segmentation across three base works. Table 1 validates the idea using ground-truth scores.
  3. Novelty (R2, R3) Using predicted performance (Dice) for patch and grid calibration is new in 3D segmentation, especially for tiny lesions. We propose a two-level uncertainty estimation, while prior work [26, r2] only considers a single level. And, the grid head is customized for small lesion detection, see page 4 bottom. Then, using these scores for blending is non-trivial. Tab. 5 shows our SEG-weight calibration outperforms alternatives. In addition, we indeed have included the uncertainty comparison, see Tab. 3. The ablation model is trained to predict the performance, similar to the aleatoric uncertainty estimation works [r1]. We find that our work achieves better performance.
  4. Statistical Results (R2, R3) We show the variances below: Model: Dice; F1 CoactSeg: 63.8 ± 26.4; 62.0 ± 27.4 Ours: 67.8 ± 18.3; 74.8 ± 23.1 The results show that our method not only improves mean performance but also reduces variance, indicating its robustness and stability.
  5. L_patch (R2, R3) As shown in Fig. 3 and Eq. 4,5, we use MSE between predicted and ground truth Dice for training, similar to [r3 (p.17)].

R1:

  1. Patch/Grid Size Thank you for the insightful comment. We agree that optimal patch/grid size depends on input resolution and clinical scenarios. We will explore this following nnUNet in future work.

R2:

  1. Organization To clarify, the preliminary results in Sec. 2.1 motivates our work by showing the upper-bound performance when real Dice scores are used for patch and grid-level blending. Based on this, we introduce prediction heads to estimate these scores dynamically and design a training strategy accordingly.
  2. Preprocessing BSpline (image) and nearest-neighbor (label) interpolations are used for resampling. Two-time-point scans are pre-registered by MSSEG2. Augmentations include random flips and rotations. We will add implementation details and release the code.

R3:

  1. Unseen Domain We agree that domain generalization is important. However, our focus is on reducing FPR in instance segmentation via confidence-based learning and blending. Our current design sufficiently demonstrates our idea.
  2. Effect of Grid Heads As noted at the bottom of Page 2, grid-level supervision is essential, as small false positives can remain even when patch-level Dice is high (Fig. 2, right). Grid-level calibration helps isolate and suppress these fine-grained errors.
  3. Inconsistent Results We retrain SNAC and Neuropoly using their public code, as pre-trained weights are unavailable. CoactSeg, however, provides an official checkpoint.
  4. MSSEG2 Test Set Only the MSSEG2 training set can be accessed now.

[r1] What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? [r2] Uncertainty-calibrated test-time model adaptation without forgetting [r3] Segment Anything




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    While two of the reviewers have proposed a weak reject, and only one is actually proposing to accept without rebuttal, I think the concerns raised by reviewers #2 and #3 should be addressed (specifically, does that do not require further experiments).

    There are concerns on the paper structure, clarity and readability that could be easily addressed through rebuttal. Furthermore, proper citations and references for some of the claims, could also be provided to address the concerns from the reviewers.

    In that sense, I would like to invite the paper for rebuttal to properly address the comments and feedback from the reviewers.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two reviewers recommended acceptance and one recommended rejection. There are both strengths and weaknesses. But overall, I feel the strengths outweigh the weaknesses, as the method has presented some new and interesting ideas in the context of new MS segmentation, and the results support that the proposed method improves the segmentation accuracy.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviews remain divided after the rebuttal phase. Notably, the paper still requires substantial revision, particularly in conducting a comprehensive review of relevant literature and providing critical implementation details regarding data augmentation and preprocessin, both of which may fall outside the permissible scope of revision under the MICCAI review policy. Additionally, the claimed versatility of the proposed plug-in module, which is presented as being compatible with any network beyond UNet, appears overstated without sufficient evidence. The extended preliminary discussion in Section 2.1 also feels excessive and disrupts the overall structure of the paper. Based on these points, I recommend rejection.



back to top