Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Tissue segmentation is essential for pathology image analysis. Conventional deep learning based segmentation methods require large amounts of annotated data and are constrained by the predefined classes, making them less flexible in adapting to diverse clinical requirements and user-specific queries. The language-guided referring segmentation (LGRS) model can help identify and segment specific objects based on user-provided descriptions. However, the existing LGRS models lack the capability to explicitly reject nonexistent targets, and struggle in effectively segmenting multiple target regions. Based on the above considerations, we propose LTSE, a language-guided tissue referring segmentation assistant for pathology images, which inherits the powerful multi-modal alignment capabilities of Multi-modal Large Language Models (MLLMs) to implement tissue segmentation according to the instructions. Specifically, we expand the original vocabulary with multiple [SEG] tokens to support multiple mask references and a [REJ] token to reject the empty targets. In addition, we enhance the adaptability and accuracy in multi-target segmentation by developing an Adaptive Expert Mixture (AEM) module that can dynamically select specialized expert decoders based on the textual and visual characteristics of the input data. We for the first time curate a vision-language pathology dataset BCSS-Ref for the tissue referring segmentation task with matched images, masks and textual information, and the experimental results demonstrate the superiority of our method in comparison with the existing studies.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2309_paper.pdf

SharedIt Link: https://rdcu.be/eHdSY

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_39

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TanJia_LTSE_MICCAI2025,
        author = { Tang, Jiao AND Qian, Bo AND Wan, Peng AND Shao, Wei AND Zhang, Daoqiang},
        title = { { LTSE: Language-guided Tissue Referring Segmentation in Pathology Images with Adaptive Expert Mixture } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {405 -- 415}
}

Reviews

Review #1

Please describe the contribution of the paper
- The paper contributes a new visual-text pathology tissue segmentation dataset for language-guided referring segmentation (LGRS).
- The authors propose an Adaptive Expert Mixture (AEM) module with the ability of selecting expert decoders given the textual and visual data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The research topic is about pathology tissue referring segmentation by using multimodal llms, which is a novel and practical task.

For evaluation on this novel task, the paper proposes a new vision-language pathology dataset for tissue referring segmentation. The dataset contains matched images, masks, and textual information.

The authors propose an Adaptive Expert Mixture (AEM) module with the ability of selecting expert decoders given the textual and visual data.

The proposed method has better performance when compared with conventional LGRS methods and MLLM-based LGRS methods.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Can the authors explain more carefully about the importance of rejecting nonexistent (empty or irrelevant) targets in practice? For example, in this sentence “there is a high possibility that the endothelial cells may be mis-segmented as tumor cells if the target of tumor cells are not rejected”, what is the meaning of “if the target of tumor cells are not rejected”?

How are the text and mask data generated? The authors should have a brief description.

Can the authors explain more clearly about what are the differences of decoder experts? How are their weights initialized? And how do different decoder experts pay attention to different textual and visual characteristics of the input data?

In abstract, the authors mention “In existing LGRS models lack the capability to explicitly reject nonexistent targets”. In experiment section, the author mention “GSVA [26] has the ability to reject empty targets”. Which make your proposed method different from GSVA in rejecting empty targets?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- It is not clear how the data was made. Especially on the masks and the texts.
- It is not clear how different decoder experts pay attention to the characteristics of visual data and text data.
- It is not clear how the proposed method is different from GSVA in rejecting empty targets.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors have explain all questions from reviewer.

Review #2

Please describe the contribution of the paper

The paper proposes LTSE, a language-guided tissue segmentation model for pathology images. It extends multi-modal LLMs with [SEG] and [REJ] tokens for multi-region segmentation and target rejection. An Adaptive Expert Mixture (AEM) module improves accuracy by selecting specialized decoders. A new dataset, BCSS-Ref, is introduced, and experiments show LTSE outperforms existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. novelty: Based on GSVA, the authors involve the Adaptive Expert Mixture (AEM) module to dynamically select specialized decoders, achieving higher segmentation performance.
2. Experiment: The manuscript shows powerful qualitative and quantitative results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited dataset: The data is come from one dataset, it is unknown that whether this method has generalizability to other pathology images.
2. Limited comparison: The authors only use LLaVA-7B-v1-1 as its backbone, but did not conduct experiment and comparison with other backbone models like the original GSVA paper, it is unknown that whether this method can be adapted to other MLLMs backbones.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I am interested whether existing pathology-specific MLLMs can further improve the segmentation task, like Quilt-LLaVA [1], WSI-llava [2] etc., given that they have already learned the pathology knowledge. [1]Seyfioglu, M.S., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt- llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. In: Proceedings of the IEEE/CVF CVPR (2024) [2] Liang, Y., Lyu, X., Ding, M., Chen, W., Zhang, J., Ren, Y., He, X., Wu, S., Yang, S., Wang, X., et al.: Wsi-llava: A multimodal large language model for whole slide image. arXiv preprint arXiv:2412.02141 (2024)  
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors involves the Adaptive Expert Mixture (AEM) module to a general segmentation method GSVA, achieving better pathology segmentation. The experiment result shows effectiveness of the method.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I will still keep weak accept due to the limited dataset and base model.

Review #3

Please describe the contribution of the paper

This paper introduces LTSE for flexible language-guided tissue-referring pathology image segmentation. An Adaptive Expert Mixture module is proposed to enhance overall performance. Extensive experiments on the crauted BCSS-Ref dataset demonstrate the effectiveness of the entire framework.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The introduction of GSVA for pathology image segmentation is highly practical, as it effectively handles diverse scenarios, including multiple targets and empty target cases. This flexibility enhances the potential of AI-based medical assistants.
2. The paper contributes a new dataset, Ref-BCSS, specifically designed for language-guided tissue-referring pathology image segmentation. This dataset is likely to advance research in this domain.
3. The paper is well-structured, clearly written, and easy to follow.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The proposed Adaptive Expert Decoders combine outputs from multiple decoders using weights predicted by a gate network. This approach, while effective, appears similar to existing techniques [1][2][3][4]. However, the paper lacks a discussion of related work to contextualize its novelty.
2. The use of multiple decoders in the Adaptive Expert Decoders likely increases computational complexity. The paper does not provide details on the model’s parameters or FLOPs, which are critical for assessing its efficiency.
3. Given that the framework leverages a generalizable MLLM, it is unclear whether it supports unseen classes outside the training set.
4. The experiments are conducted only once. Due to the limited dataset size, the results lack statistical robustness. Since the dataset is split into multiple folds, evaluating performance across these folds would provide more convincing results.
5. Although the experiments account for different magnification levels, all data originates from the Ref-BCSS dataset, raising generalization concerns.
[1]Ou, Yanglan, et al. “Patcher: Patch transformers with mixture of experts for precise medical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022.

[2]Jiang, Yufeng, and Yiqing Shen. “M4oE: A foundation model for medical multimodal image segmentation with mixture of experts.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024.

[3]Zhang, Xinru, et al. “A foundation model for brain lesion segmentation with mixture of modality experts.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024.

[4]Wang, Guoan, et al. “Sam-med3d-moe: Towards a non-forgetting segment anything model via mixture of experts for 3d medical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper provide a novel yet practical pathology segmentation task, and the proposed dataset is likely to contribute meaningfully to research in the field. However, the novelty of the proposed method is questionable, and the experimentas could be improved. Based on these considerations, I recommend this score.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Thank you for the author’s response and detailed rebuttal. After careful consideration, I maintain that both the MOE in [1] and the proposed MOE utilize dynamic weighting, with their gate networks processing all input information, which raises concerns about the approach’s novelty. Additionally, the limited dataset and lack of statistical testing reduce the persuasiveness of the results.

Author Feedback

For R#2 and R#4 “Limited dataset and comparison” ==Currently, there are no publicly available dataset for referring segmentation on pathology images, as it requires extensive annotation efforts from experts. We will release our dataset and provide more options in future work. Additionally, for backbone selection, we will explore other backbone architectures in our future studies to enhance adaptability and robustness of our method. For R#3 “Importance of rejecting empty targets” == The rejection of empty targets is crucial for ensuring that the model behaves reliably when it encounters unknown or unseen classes during inference. For instance, if the labeled endothelial nuclei are not provided in training datasets, the traditional closed-set segmentation model will mis-classify them as other nuclei type other than endothelial nuclei that would affect the downstream clinical prediction tasks. “Dataset generation” ==Our BCSS-Ref dataset is constructed based on the BCSS semantic segmentation dataset, which consists of WSIs captured at 40x magnification. To generate the text and mask data, we first had a pathology expert review the original semantic annotations in BCSS. The expert then provided detailed descriptions for important regions, such as “abnormal proliferation of cells indicating a tumor.” These texts were carefully curated by the expert to capture critical pathological features. We then extracted instance-level masks from the annotated regions and combine them with the corresponding expert-provided texts to form the BCSS-Ref dataset. “Adaptive Expert Mixture” ==Each individual expert in the AEM module is a ViT-H SAM mask decoder, with weights initialized based on the ViT-H SAM backbone. The gate network within the AEM module dynamically assigns weights to each expert decoder based on both the textual and visual features. Since the references contain rich semantic cues, including category combinations, spatial location, and area information, different experts can learn to focus on distinct aspects through these adaptive weights, enabling more precise segmentation in complex contexts. “Reject Mechanism” ==Our method shares a similar rejection mechanism with GSVA but incorporates dynamically adjusted expert decoders for more precise segmentation. For R#4 “Novelty of MoE” ==Our Adaptive Expert Decoders differ from existing techniques by leveraging dynamically weighted expert decoders specifically tailored for complex, multi-target pathology segmentation. Unlike standard mixture models, our approach incorporates both visual and rich textual cues (e.g., category combinations, spatial information) through a gate network, allowing more precise specialization of individual decoders. “Model’s parameters” ==Our model has approximately 2.59B trainable parameters, with a total computational cost of approximately 994.87 FLOPs(G) per forward pass. “Unseen classes” ==Considering the zero-shot capability of MLLMs, our framework has the potential to conduct segmentation on unseen classes. We will point it out as our future direction in the revised manuscript.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Paper Summary: This paper tackles the challenge of tissue referring segmentation in pathology images by leveraging multi-modal large language models. The authors extend an existing MLLM backbone with multiple [SEG] tokens to support multi-region segmentation and a [REJ] token to explicitly reject nonexistent targets. To further improve adaptability and accuracy, they introduce an Adaptive Expert Mixture (AEM) module that dynamically selects specialized mask decoders based on visual and textual features. The work is supported by the first vision-language pathology dataset, BCSS-Ref, and is evaluated at two magnification levels, showing consistent gains in intersection-over-union metrics and near-perfect empty-target rejection.

Key Strengths: The paper presents a novel formulation by enriching the token vocabulary for multi-target and rejection handling, and its AEM module offers a flexible decoding strategy that outperforms unified decoders. The creation of BCSS-Ref represents a valuable resource for the community, combining images, masks, and textual descriptions. Empirically, the method delivers clear qualitative and quantitative improvements over both traditional LGRS approaches and recent MLLM-based segmentation assistants. The manuscript is also well structured and clearly written, facilitating understanding of its contributions.

Key Weaknesses: All experiments are confined to a single dataset, raising questions about generalization to other pathology domains. Comparisons are limited to a single MLLM backbone, leaving open whether the approach transfers to alternative vision-language models. Critical details about data generation, how masks and textual prompts were produced, are omitted. The internal workings of the AEM gate network and the distinctions among expert decoders (their initialization and feature specialization) are not fully explained. Moreover, computational complexity (parameter counts or FLOPs) is not reported, and the evaluation lacks cross-fold statistical analysis to demonstrate result robustness.

Review Summary: Reviewers unanimously praise the paper’s innovative use of multi-segmentation and rejection tokens, its Adaptive Expert Mixture design, and the introduction of a specialized BCSS-Ref dataset. They agree that the experiments convincingly demonstrate performance gains and that the writing is clear. Divergences occur around the depth of novelty discussion, some find the method sufficiently distinct, while others request contextualization against recent pathology-specific MLLMs, and the completeness of methodological details. Concerns include lack of clarity on dataset creation, expert decoder mechanisms, computational costs, and statistical rigor. Balancing these perspectives suggests the work has strong merit but needs targeted clarifications.

Decision: Invite to rebuttal to clarify methodological details and address generalization and robustness concerns.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This is a borderline paper with marginal novelty and modest performance gains, but I am leaning toward acceptance. Reviewer 3 appears to have misunderstood some aspects of the work.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

LTSE: Language-guided Tissue Referring Segmentation in Pathology Images with Adaptive Expert Mixture

Author(s):