Abstract

Lesion segmentation in medical images is a key task for the intelligent diagnosis of lung diseases. Although existing multimodal methods have achieved significant progress in medical image segmentation by combining image and text information, these methods still rely on textual input during the inference phase, limiting their applicability in real-world scenarios. To address this limitation, this paper proposes an innovative Memory-Guided UNet model (MG-UNet). MG-UNet introduces a learnable memory bank that automatically extracts and stores textual information during the training phase. In the decoding stage, the proposed memory-guided decoder retrieves knowledge relevant to the current image from the memory bank, thereby eliminating the need for textual input during inference. Extensive experiments were conducted on the QaTa-Cov19 and MosMedData+ datasets to validate the effectiveness of MG-UNet. The experimental results demonstrate that MG-UNet not only outperforms existing unimodal and multimodal methods in terms of segmentation performance but also excels in text-free inference scenarios using only 15% of the training data, surpassing the current best unimodal methods. This characteristic significantly reduces the reliance on annotated data for medical image segmentation, offering greater flexibility and scalability for practical clinical applications. The code will be available soon.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1062_paper.pdf

SharedIt Link: https://rdcu.be/eHwLi

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_34

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DinShu_MGUNet_MICCAI2025,
        author = { Ding, Shuaipeng AND Li, Mingyong AND Wang, Chao},
        title = { { MG-UNet: A Memory-Guided UNet for Lesion Segmentation in Chest Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {355 -- 365}
}

Reviews

Review #1

Please describe the contribution of the paper

The main contribution of this paper is the design of MG-UNet, a novel multimodal segmentation framework for chest lesion segmentation that leverages a learnable memory bank to store and retrieve textual information. Unlike existing multimodal approaches that require text input during training and inference, MG-UNet can perform text-free inference by retrieving previously learned knowledge from the memory bank. Additionally, the authors introduce an intermittent memory bank updating (IMBU) strategy to adapt the model from text-guided training to text-free inference progressively. The proposed method performs strongly on two public datasets, outperforming state-of-the-art unimodal and multimodal baselines, even when trained on limited data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper proposed a novel image segmentation on lung disease imaging using text information but contain text-free inference capability.
2. The paper proposed a novel intermittent memory bank updating (IMBU) strategy.
3. The model achieves state-of-the-art or highly competitive results on two public datasets。
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Insufficient explanation of MG-UNet vs. MG-UNet: The paper lacks clear procedural details on how MG-UNet (text-free inference) is implemented in practice.
2. The memory bank is essentially a learned embedding pool, not a fundamentally new architecture.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents MG-UNet, a multimodal segmentation framework that introduces a learnable memory bank to store textual features and enables text-free inference, addressing a practical limitation of existing vision-language models in clinical settings. The proposed intermittent memory bank updating strategy is simple yet effective, and the model achieves strong performance with lower computational cost on two public datasets, including under low-data scenarios. While the overall idea is moderately novel and practically useful, the paper lacks clarity in several key areas, including how the transition between MG-UNet and MG-UNet* is handled and how the memory bank is reused during inference. Despite these weaknesses, the paper fills an important gap in the literature and offers a compelling solution for scalable, text-free deployment of multimodal models. I recommend Weak Accept, contingent on clarification during the rebuttal.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper
The paper proposes MG-UNet, a novel memory-guided UNet architecture for lesion segmentation in medical chest images. The key contribution is the introduction of a learnable memory bank that stores textual information during training and retrieves relevant information during inference. This approach enables text-free inference while retaining the benefits of multimodal learning. Specifically:
1. The memory bank eliminates the need for textual input during inference by automatically retrieving relevant knowledge based on the current image. 2.MG-UNet demonstrates strong segmentation performance, surpassing unimodal and multimodal methods while using only 15% of the training data.
2. The text-free inference capability and reduced reliance on annotated data make MG-UNet practical and flexible for real-world clinical applications.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed memory-guided decoder and text-free inference are innovative contributions that address a key limitation of existing multimodal methods that require textual input during inference. The memory bank design introduces a novel way of leveraging textual information during training while removing its dependency during deployment.
2. The paper provides extensive experiments on two medical imaging datasets (QaTa-Cov19 and MosMedData+), demonstrating the superiority over both unimodal and multimodal state-of-the-art methods. The significant performance improvement using only 15% of the training data highlights the method’s efficiency and robustness.
3. By eliminating the need for textual input during inference, MG-UNet becomes more practical for clinical scenarios where multimodal data (e.g., paired text and images) may not always be available.
4. The ability to achieve strong performance with limited annotated data is a key advantage, addressing a common bottleneck in medical image segmentation tasks.
5. The paper clearly identifies and addresses a limitation in multimodal methods (text dependency during inference), providing a focused and meaningful contribution.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Lack of Explanation for Mathematical Notations. The mathematical equations are presented without sufficient explanation of the notations. For instance, the meaning of variables like D_i^sa is not clarified, making it difficult for readers to fully understand the methodology.
2. Unclear Figure Explanation:The explanations accompanying the figures are insufficient. For example:Figure 2 does not include the original image, making it difficult for readers to visually compare the segmentation results. Figure 1 lacks detailed annotations or descriptions that explain the components of the model.
3. The paper aims to eliminate the need for textual input during inference, but the memory bank is still constructed using textual information during training. This raises the question of why textual input is preferred over images or some other form of representation for building the memory bank, given the goal of text-free inference.
4. While the paper emphasizes scalability, it does not provide detailed insights into how the size of the memory bank might impact computational efficiency or performance during inference.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces a novel and practical solution to a meaningful problem in medical image segmentation, with strong experimental validation and clear advantages over state-of-the-art methods. Though some weaknesses in explanation and design choices slightly reduce its impact, the methodology and results are sufficient to warrant publication.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper presents a novel deep learning network that combines visual and textual information for chest lesion segmentation in the training phase. In the inference phase, it allows the model to gradually transition from relying on textual inputs to operating without them during the training phase. This strategy not only improves the accuracy of model segmentation but also achieves greater adaptability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Authors propose the memory bank updating mehcanism, which can store textual information of lesions during training phase. While in the inference phase, the method can retrieve the relevant knowledge without the textual input. This strategy not only improves the accuracy of model segmentation but also achieves greater adaptability.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

I don’t think this work has significant weakness. My only concern is that the textual annotations appear to include lesion position information (e.g., “lower left lung” and “lower right lung,” as shown in Fig. 1(b)). If lesions originally located in the lower part of the image are shifted to the upper part in your cropped image, how to ensure that the positional information in the textual annotations remains consistent with the visual content of the cropped images? Wouldn’t this introduce a mismatch between image content and annotation?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. Very interesting work. You’ve used Qv, Kv, Vv for visual features representations, and used Km, Vm for textual features representations. My suggestion is keeping the same symbols in the following sections and equations.
2. Can you elaborate on the decoder feature Di in Sec2.3?
3. In your equations (5)(6)(7)(8), it would be better to label the symbols, like F_i^sa, F_i^ca, F_i in your Fig.1.
4. As described in Sec 3.1 Datasets. I didn’t see annotations in Fig.1(a), but one textual annotation in Fig.1(b).
5. As described in Sec 3.1 Implementation. All images are cropped to a size of 224x224. What is the original resolution of the images? The textual annotation seems to contain the position information of the leisions (Fig.1(b)). Additionally, the textual annotations appear to include lesion position information (e.g., “lower left lung” and “lower right lung,” as shown in Fig. 1(b)). If lesions originally located in the lower part of the image are shifted to the upper part in your cropped image, how do you ensure that the positional information in the textual annotations remains consistent with the visual content of the cropped images? Wouldn’t this introduce a mismatch between image content and annotation?
6. Is the arrow direction in Table 1 Params wrong?
7. You should indicate which rows come from QaTa-COV19 and which come from MosMedData+ in Fig.2.
8. Please explicitly indicate which methods are unimodel methods, and which are multimodel methods in Table 1 and Fig.2.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work is interesting and presents a novel approach to effectively integrating textual and visual information. The proposed memory bank updating mechanism is particularly noteworthy—it enables the model to leverage relevant information from historical training data during inference, even in the absence of textual input. This design not only enhances segmentation accuracy but also improves the model’s adaptability across different scenarios.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

N/A

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

MG-UNet: A Memory-Guided UNet for Lesion Segmentation in Chest Images

Author(s):