Abstract

Recent advancements in medical vision-language models have increasingly accentuated the substantial potential of incorporating textual information for better medical image segmentation. However, existing language-guided segmentation models were developed under the assumption that the attributes/clauses of textual prompts are uniformly complete across all images, neglecting the unavoidable incompleteness of texts/reports in clinical applications and thus making them less feasible. To address this, we, for the first time, identify such incomplete textual prompts in medical image referring segmentation (MIRS) and propose an attribute robust segmentor (ARSeg) by constructing attribute-specific features and balancing the attribute learning procedure. Specifically, based on a U-shaped CNN network and a BERT-based text encoder, an attribute-specific cross-modal interaction module is introduced to establish attribute-specific features, thereby eliminating the dependency of decoding features on complete attributes. To prevent the model from being dominated by attributes with lower missing rates during training, an attribute consistency loss and an attribute imbalance loss are designed for balanced feature learning. Experimental results on two publicly available datasets demonstrate the superiority of ARSeg against SOTA approaches, especially under incomplete and imbalanced textual prompts. Code is available at https://github.com/w7jie/ARSeg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3688_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/w7jie/ARSeg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanQij_Towards_MICCAI2025,
        author = { Wang, Qijie and Lin, Xian and Yan, Zengqiang},
        title = { { Towards Robust Medical Image Referring Segmentation with Incomplete Textual Prompts } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {638 -- 647}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper focused on medical image referring segmentation, which integrates the incomplete textual information. Based on incomplete attribute-specific prompts, the authors proposed a consistency loss and a balancing loss, enhancing the robustness in clinical senarios.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method and figures are described clearly.
    2. The problem of incomplete attributes is interesting.
    3. The improvements in quantitative and qualitative results are significant.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper only uses the CNN encoder and decoder, how would the method perform when using ViT-based architectures.
    2. What are the detailed calculars of Eqs. (2) and (3), it seems that the cross-attention is novel enough.
    3. Lack of definitions of M and c in Eqs. (7) and (8).
    4. It will be better if the authors verify the method for ViTs.
    5. In Table 5, why the ACI-only performs better on MosMedData+, and worse on QaTa-COV19? Explain the effectiveness of ACI and ACBL for both datasts.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see weakness

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript proposes ARSeg, a vision-language model designed for balanced attribute feature learning under imbalanced attribute distributions in textual prompts. An attribute-specific cross-modal interaction module is introduced to generate attribute-specific features for decoding, enabling the decoder to accurately extract features guided by individual attribute.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Existing language-guided segmentation models assume that textual prompts contain complete and uniformly available attributes, which is often not the case in clinical practice due to incomplete texts or reports. To address this limitation, the manuscript introduces ARSeg—an attribute-robust segmentor designed for medical image referring segmentation. ARSeg constructs attribute-specific features and balances the learning process to handle missing attributes effectively. To prevent the model from being biased toward attributes with lower missing rates, the method incorporates an attribute consistency loss and an attribute balancing loss to promote balanced feature learning.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Writing Weakness: The architecture overview in Figure 2 is difficult to interpret, even after reviewing the methodology section. The distinction between the offline and online stages is unclear. Additionally, the handling of ground truth data needs further clarification. Contribution Weakness: Many studies have used multimodal data for medical image segmentation. The manuscript should clearly outline how the proposed method differs from existing approaches. Comments on Experiments: The manuscript should provide more details on how to select λ₁ and λ₂ in formula (9). Furthermore, it is essential to compare the segmentation results of the proposed method with SAM-Med2D.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The four modules of the vision-language model (image encoder, text encoder, fusion, and pretraining objectives) should be presented more clearly. Additionally, the role of text in improving segmentation performance needs to be explained more thoroughly.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    New Contributions is Acceptable, Responses to Reviewers is Acceptable.



Review #3

  • Please describe the contribution of the paper

    The main contribution of this paper is the proposal of ARSeg, a robust framework addressing incomplete textual prompts in medical image referring segmentation. By introducing an attribute-specific cross-modal interaction module to decouple attribute dependencies and a dual loss framework combining attribute consistency loss and an attribute balancing loss, ARSeg achieves stable segmentation performance even with incomplete or imbalanced textual inputs. Experiments on QaTa-COVID and MosMedData+ datasets demonstrate its superiority over state-of-the-art methods, with a 3-5% improvement in Dice scores under text-missing scenarios, offering a more reliable solution for clinical applications.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. First to systematically address incomplete textual prompts in medical image referring segmentation, a critical but overlooked challenge in clinical practice where reports often lack structured attributes. 2.Innovative method design: an attribute-specific cross-modal interaction module is introduced to establish attribute-specific features, thereby eliminating the dependency of decoding features on complete attributes. 3.Comprehensive and rigorous experimental validation: The performance advantage is verified under multi-scenario datasets, and it outperforms the SOTA method under both complete and incomplete text conditions, especially when attributes are missing. 4.Detailed ablation experiments: the necessity of ACI and ACBL modules is verified through ablation experiments. 5.Clear technical implementation: U-Net and BERT are integrated by modular architecture, and the meta-vector update and loss function are clarified by mathematical formulas to enhance interpretability.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The manuscript would benefit from additional language polishing to improve clarity and precision. Several instances require attention, such as the ambiguous phrase “which is under-explored” on page 2 (the text should clearly state what aspect is under-explored), and the notation “Fik/ Ftk” on page 4 (which should use “or” instead of the slash for proper grammatical form). Additionally, spelling error like “segmentationl” in the abstract should be corrected to “segmentation.”
    2. The figure notes for Figure 2 are too short. Some descriptions should be added.
    3. Figure 2 does not clearly show the specific connection paths of attribute consistency loss and an attribute balancing loss. It is suggested to add: 1) Label the calculation nodes of attribute consistency loss and an attribute balancing loss in the decoder section, for example, using arrows to indicate their input sources. 2) To display input dependencies, connect the following components with dashed lines or arrows of different colors :first, Input of Attribute Consistency Loss: Fik and Ftk; Second: Input of attribute balancing loss: ci,ct and Fik, Ftk.
    4. In the “ablation experiment” or “method” section, it is necessary to clarify the complete architecture of the baseline model and specify whether the baseline model has completely removed ACI and ACBL, or only partially adjusted; Does the baseline model retain the image encoder (U-Net), text encoder (BERT), and other basic modules consistent with the complete model (ARSeg) to ensure that the variable control for the comparative experiment only comes from the removal of ACI and ACBL.
    5. The formatting of this manuscript could be further improved. For example:1)The symbol after equation (5) should be a comma like the other equations.2) Units of Dice and mIoU to be harmonized;3) The dataset name is uniformly QaTa-COV19. 6.The values of the missing rate vector R (e.g., 0.1/0.4/0.7) are artificially set and do not indicate how they reflect the distribution of missingness in real clinical data. If the actual missing rate differs significantly from the experimental settings, this may affect the reliability of the conclusions. 7.Please refine your overall writing by avoiding the use of personalizing words such as “We”.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is innovative, and the experiment is sufficient to support the scientific conclusion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for their valuable comments and thoughtful feedback and appreciate their recognition of interesting and critical problem (R1, R3) and effective method (R1, R3, R4). We will thoroughly revise the manuscript for better readability of both figures, equations, and methodology. Major concerns are addressed as follows: Q: The values of the missing rate vector R are artificially set. (R1) A: Text missingness in real clinical data is hard to obtain. Thus, we follow the common missing rates adopted in incomplete multi-modal MRI segmentation (e.g., PASSION (ACM MM 2024)). To avoid possible bias, we further traversed R for comprehensive evaluation. Q: How would ARSeg perform on ViT-based architectures? (R3) A: As shown in Fig.2, in ARSeg, both attribute-specific cross-modal interaction (ACI) and additional attribute-related losses are independent of encoder/decoder architectures. Thus, we believe ARSeg is highly applicable to address the issue of incomplete textual prompts under ViT-based architectures. In Tables 1&2, both CNN-based and ViT-based SOTA approaches were included and ARSeg achieved the best performance. Q: Why the ACI-only performs better on MosMedData+ and worse on QaTa-COV19? (R3) A: As shown in Fig.3, feature distributions of MosMedData+ and QaTa-COV19 vary dramatically, with the former having much smaller lesion sizes than the latter. Thus, given a textual prompt, building cross-modal interaction on MosMedData+ is harder than that on QaTa-COV19, and ACI plays a more critical role for MosMedData+. Comparatively, it is easier to build a closer relation with text attributes on QaTa-COV19, making ACBL perform more prominently in addressing text incompleteness. Though ACI and ACBL have different importance on MosMedData+ and QaTa-COV19, both can improve the baseline performance, proving their effectiveness on different datasets. Q: How ARSeg differs from existing multimodal approaches? (R4) A: Existing approaches can be categorized into multi-modal MRI and language-guided segmentation. Though incomplete multi-modal MRI segmentation has been extensively studied, bridging the gap across different MRI modalities is relatively easier. Comparatively, in language-guided segmentation, the modality gap between text and image is more severe to bridge. In this paper, we for the first time identify and formulate the challenge of incomplete textual prompts in language-guided segmentation, which is a common situation in clinical practice but completely ignored. Results in Tables 1&2 demonstrate the superiority of ARSeg over existing methods for robust language-guided segmentation. Q: How to select λ₁ and λ₂ in Eq. 9. (R4) A: Due to page limitation, ablation studies were not included. In our experiments, λ₁ and λ₂ were set by grid searching within {0.1, 0.2, 0.3, 0.4, 0.5}. Finally, they were set as 0.5 and 0.1. It is also noted that ARSeg’s performance is stable across different settings. Q: How texts improve segmentation performance? (R4) A: On the one hand, pioneering work LViT (TMI 2023) has pointed out that “it may be more feasible to use complementary and easy-to-access information (i.e., medical notes) to make up for the quality defects of medical images” and validated this through experiments “LViT (with text and image) outperforming LViT-TW and nnUNet (with image only) with 83.66% vs. 81.12% vs. 80.42% in Dice on Qata-COV19”. On the other hand, compared to point/bbox/mask prompt relying on heavy interaction, text prompts (e.g. directly captured from audio) are low-cost. In summary, texts provide complementary clues to images for better segmentation in an efficient manner. Q: Comparison with SAMMed2D. (R4) A: It is infeasible for direct comparison as SAMMed2D does NOT support text prompts. Following previous works, only language-guided segmentation methods were included for comparison. Actually, adding a text encoder and ARSeg will enable SAMMed2D to learn from text prompts.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work focuses on an interesting topic-medical referring segmentation via incomplete prompts and the technical design is sound. Reviewers also acknowledge the the performance is improved. Therefore, an acceptance is given.



back to top