Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Stroke is a leading cause of death and disability worldwide, necessitating accurate lesion segmentation for effective diagnosis and treatment. Multimodal images provide complementary insights into stroke detection and progression. However, existing segmentation methods often struggle to fully leverage the distinct and dynamic sensitivities of these modalities. Current approaches, including encoder-decoder networks and SAM-based models, are either limited to single-modality data or rely on suboptimal fusion techniques, hindering their ability to adapt to the distinct nature of stroke lesions. To address these challenges, we propose SAM-driven Multimodal Fusion Network (SMF-Net) for enhanced stroke lesion segmentation. SMF-Net incorporates a multimodal Siamese image encoder based on the Swin Transformer to extract modality-specific features, alongside two novel fusion strategies: (1) Complementary dynamic fusion module, which uses pairwise co-attention and dynamic learnable weights to model interdependencies and adaptively combine multimodal features; and (2) Context-aware intermediate layer fusion module, a lightweight, multi-layer fusion mechanism that captures multiscale features while preserving modality-specific information. Extensive experiments on an open benchmark dataset demonstrate that SMF-Net outperforms previous stroke lesion segmentation methods through effective multimodal integration.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2479_paper.pdf

SharedIt Link: https://rdcu.be/eHwRS

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_57

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AtlMek_SMFNet_MICCAI2025,
        author = { Atlaw, Meklit AND Chen, Geng AND Jiang, Haotian AND Wen, Xuyun AND Cui, Hengfei AND Xia, Yong},
        title = { { SMF-Net: Unlocking Multimodal Insights for Enhanced Stroke Lesion Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {598 -- 608}
}

Reviews

Review #1

Please describe the contribution of the paper

In this work, SMF-Net is proposed, a SAM extension for the segmentation of ischemic stroke lesions over multiparametric MRI (DWI, ADC and FLAIR). The strategy consists on the implementation of CIF (Intermediate Layer Contextual Fusion), DLF (Dynamic Learnable Fusion) and CDF (Complementary Dynamic Fusion) modules on a Swin-Transformer architecture. These modules aid in multimodal information fusion at different processing levels. This information is then leveraged by a SAM decoder to produce the lesion segmentations. The training and validation of the proposed approach was carried out using the train split of a public database (ISLES22).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written. The authors clearly state why is important to work on multimodal techniques for ischemic stroke lesion segmentation and accurately list the limitations of SAM based models for multimodal analysis. The methodological components are clearly explained and validated, showing an important improvement on segmentation metrics upon usage.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- More details are needed about the integration of the captured infomration by the Siamese encoder: how and where are skip connections integrated?
- Redundancy on metrics for performance evaluation: F1 and Dice metrics, which to my knowledge are the same, but results are different. Also, they included IoU, which is similar to Dice.
- The authors didnt validate on the hidden test set of ISLES22, which could have further allowed comparison with other state of the art approaches that are not SAM based.
- The work lacks an statistical analysis to draw conclusions on the differences observed between the implemented strategies.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- In section 2.2, authors mention “To effectively model modality relationships while preserving modality-specific characteristics, we use a multimodal Siamese image encoder with shared weights, ensuring consistent feature extraction…” What do they mean by consistent feature extraction?
- In Fig 2, reduce the redundancy on this figures by replacing IoU and F1-Score by other metrics.
- Explain in depth how the information captured by the siamese encoder is integrated into the SAM decoder.
- In section 3.2, what do they mean by “we selected axial slices with a margin of 3”?
- Are the weights used for model inference, the weights from the last epoch?
- In Fig 3, the segmentations seem very similar. Maybe adding the DSC would help the visual inspection. Also, this similarity reinforces the need for carrying out an statistical analysis to determine differences between approaches.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work introduces a novel strategy for ischemic stroke segmentation. The results show considerable performance gains from using the proposed approach w.r.t. other alternatives in the literature. Nevertheless, this validation was restricted to the train split of the ISLES22 dataset, hindering its comparability with other strategies from the literature that are not SAM-based. Also, other important drawbacks are the redundancy of the results (DSC, F1, IoU) and the lack of statistical analysis.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

To summarize, I had some minor comments regarding: (1) the redundancy in evaluation metrics (specifically, the concurrent presentation of the F1-score and Dice coefficient), and (2) the lack of detail on how and where skip connections from the Siamese encoder are integrated. The authors addressed these minor concerns in the rebuttal, which alleviates my initial reservations. Furthermore, I also suggested that the authors conduct additional statistical analyses and consider submitting their model to the ISLES22 hidden test set to enable more meaningful comparisons with existing state of the art approaches. However, recognizing that these additional experiments may not be feasible at this stage and that these issues do not constitute major concerns, I recommend acceptance of the paper.

Review #2

Please describe the contribution of the paper
This paper proposes a SAM-driven Multimodal Fusion Network (SMF-Net) for stroke lesion segmentation, that incorporates a multi-modal siamese image encoder based on the Swin Transformer to extract modality-specific features with two fusion strategies, complementary dynamic fusion module and context-aware intermediate-layer fusion module.
- The integration of Segment Anything Model (SAM) with a multimodal segmentation pipeline.
- A dual-branch encoder based on Swin Transformer
- Complementary Dynamic Fusion Module (CDFM): Pairwise co-attention and learnable dynamic weights to model interdependencies between modalities and combine features
- Context-Aware Intermediate-Layer Fusion Module (CIFM): A multi-scale fusion technique applied at intermediate network layers to preserve modality-specific information.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- SAM multimodal fusion via Siamese Swin Transformers is a novel formulation with CDF (pairwise co-attention + adaptive weighting), CIF (multilayer attention-based fusion) as innovative modules.
- This work outperforms single modality input SOTA methods on the ISLES 2022 dataset.
- Direct application to stroke segmentation with real-world imaging modalities enhances its clinical relevance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The comparison in Fig. 2 presents SMF-Net evaluated with three input modalities (DWI, FLAIR, and ADC), while the baseline models (MedSAM, Swin-LightMedSAM, and Swin-Unet) are originally designed for single-modality input. The authors haven’t clarified as to whether the baselines were adapted to accept multimodal input. If not, how do the authors justify this comparison?
- The authors have specified that “a combination of MSE loss, dice loss, and focal loss” is used in the paper. The details and the rationale of this composite loss are missing.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

In Fig. 1, the topmost CIF block has some overlaid text and looks confusing.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- This proposed SAM based multimodal fusion strategy provides a strong clinical relevance for stroke lesion segmentation.
- Novel formulation of different innovative modules.
- Clarity on multimodal vs unimodal input for segmentation required.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Based on the satisfactory responses to our queries, the revised version of the paper incorporating the changes could be accepted.

Review #3

Please describe the contribution of the paper

This paper proposes a SAM-based framework aimed at leveraging multiple MRI modalities as input to improve stroke lesion segmentation performance. Specifically, it introduces two interesting fusion strategies: one that integrates features at the end of the encoder using pairwise co-attention, and another that incorporates modality features at the skip connections.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This work addresses a clear limitation of the original SAM and MedSAM model, which handles single-modality input, extending its applicability to multiple MRI modalities.
2. The proposed fusion strategies (CDF and CIF) thoughtfully combine features from different modalities. The features after the full encoder and the features at the skip connections are both considered.
3. Ablation studies clearly demonstrate incremental improvements in stroke lesion segmentation by incorporating additional MRI modalities. These studies also effectively highlight the individual contributions of each part of the proposed fusion strategy.
4. Architectural diagrams provided are clear and informative.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

There is a lack of clarity regarding the previous methods used for comparison—specifically, whether they were tested using only a single modality and which modality was selected.

If the reviewer understands correctly, and the authors did not modify these methods to support multi-modality input during comparison, then all the compared methods—except the proposed one—operate with only single-modality input. While such comparisons may illustrate that integrating multiple modalities can enhance segmentation performance, this is already a well-established finding in the medical imaging community.

Therefore, the effectiveness and value of the proposed complex fusion strategies remain uncertain.

It is not difficult to identify multi-modality baselines for comparison. The use of multiple modalities is standard practice in traditional CNN-based models such as nnU-Net, and also supported in transformer-based architectures like Swin UNETR [1]. Since this work includes a training split of the dataset, incorporating these models into the experimental comparison would be reasonable.

If the authors aim to focus exclusively on SAM-based models, a straightforward and meaningful baseline would be to run MedSAM on each modality independently and combine the results via majority voting. Any of these comparisons would provide useful insight into the effectiveness of the proposed fusion strategies. However, the paper does not include any such comparisons, making it difficult to evaluate the actual contribution of the proposed methods.

Additionally, since the proposed model is trained on the ISLES2022 dataset, it is nice to clarify whether MedSAM was also fine-tuned on the same dataset.

Reference: [1] Hatamizadeh, Ali, et al. “Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images.” International MICCAI Brainlesion Workshop. Cham: Springer International Publishing, 2021.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this paper presents a reasonable target—extending SAM to handle multi-modality input. The proposed fusion strategies appear thoughtfully designed. However, as discussed in the weaknesses, the experimental results do not clearly demonstrate the actual effectiveness of the proposed fusion mechanisms. Without appropriate multi-modality baselines or detailed comparisons, it is difficult to assess the contribution and advantage of the method over existing approaches.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have not fully resolved my concern. While they note that structured fusion across all three modalities has not been investigated in this exact setting, prior work has explored alternative fusion functions for related tasks—including brain-disease segmentation. A comparison with those approaches would strengthen the paper. Even so, the proposed fusion strategies are thoughtfully designed, and I remain inclined toward acceptance.

Author Feedback

We sincerely thank all reviewers for their thoughtful and constructive feedback.

Baseline Input Choices and Multimodal Comparison (R1, R3) We initially tested a multi-channel input approach by stacking modalities, but this configuration led to reduced baseline performance, possibly due to early fusion without modality-specific processing or structured integration. We then evaluated different unimodal inputs independently and found that DWI consistently produced the best results. This aligns with clinical observations, as DWI is considered the most sensitive modality for acute stroke detection, which is why DWI was selected as the baseline modality. Given our focus on structured fusion strategies, we did not implement majority voting, and to our knowledge, no prior work has explored structured fusion using all three modalities in this context. Moreover, from our ablation experiments, we observe that performance improves steadily as structured fusion components are introduced, further highlighting the importance of both modality selection and architectural design in achieving improved segmentation.

Loss Function Justification (R1) We use Dice loss for overlap accuracy, Focal loss to address class imbalance by emphasizing hard samples, and MSE loss to supervise IoU prediction. We will further clarify this rationale in the final paper.

Figure Revisions (R1, R2) We will revise Figure 1 to fix the overlapping text in the topmost CIF block. Figure 3 will be updated to include Dice scores for improved visual comparison.

Non-SAM Baseline Comparisons (R2) We focused on SAM-based methods because they have shown strong performance in medical image segmentation [11] and offer a consistent framework for evaluating our fusion strategy. This allows us to assess the effect of our method more reliably, without results being influenced by differences in model architecture. We agree that including non-SAM baselines in future work would make the comparison more complete.

Redundant Metrics (R2) We thank the reviewer and confirm that the Dice scores are accurate after rechecking. However, we identified minor errors in some of the reported F1-scores, which will be corrected in the final version. To enhance clarity and provide a more informative evaluation, we will remove the F1-score and include NSD in the final paper.

Skip Connection Integration from Siamese Encoder (R2) As shown in Figure 1, the skip connection mechanism in the decoder involves concatenating the deeper CIF outputs (scales 2–4), which are processed through a convolution block and passed to the first upsampling stage. The shallowest CIF output is integrated at a later upsampling stage to refine spatial detail. At the final scale, features are fused using the CDF module, which also functions as a deep skip connection and provides the image embedding to the decoder. The decoder structure remains identical to that used in MedSAM[11]. Additionally, “consistent feature extraction” refers to the use of a shared-weight multimodal Siamese Image Encoder, where identical convolution and transformer blocks are applied across all modalities. This design ensures consistent features and effective cross-modal integration.

Statistical Analysis (R2) We acknowledge the importance of statistical analysis and will incorporate appropriate significance tests in the final version where feasible.

Experiment Details (R2, R3) Regarding the training setup, “a margin of 3” refers to including three axial slices before and after the lesion during training to provide sufficient negative context, which was found to improve model performance. This margin is applied only during training, while full 3D volumes are used for evaluation. For inference, we confirm that the final epoch weights were used, as training had stabilized, ensuring fairness and reproducibility across evaluations. Additionally, MedSAM was trained on the ISLES2022 dataset using the same setup as our proposed method.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors provided a thorough and convincing response to the reviewers’ concerns, reinforcing the paper’s contributions. Key clarifications include the rationale for using DWI as the baseline modality, the design of the composite loss function, and the integration of skip connections in the decoder. The commitment to revising figures, adding statistical tests, and streamlining metrics further strengthens the presentation.

The paper’s core strength lies in its novel SAM-based multimodal fusion framework (CDF and CIF), which addresses a critical gap in stroke lesion segmentation. The ablation studies effectively validate the proposed modules, and the clinical relevance is well-justified. While comparisons with non-SAM baselines could be explored in future work, the current focus on SAM-based methods ensures a consistent evaluation of fusion strategies.

Overall, the paper presents a meaningful methodological advancement with clear clinical impact. The authors’ thoughtful rebuttal addresses the reviewers’ concerns, and the revisions will enhance the paper’s clarity and rigor. Recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Despite authors still have a few concerns regarding this work, they believe that it has some merits, and unanimously recommend its acceptance. Reading the paper and reviewers comments, I do not have any major concern that conflicts with reviewers final scores and thus recommend the acceptance of this work.

back to top

SMF-Net: Unlocking Multimodal Insights for Enhanced Stroke Lesion Segmentation

Author(s):