Abstract

Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3678_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/demoyu123/FMISeg

Link to the Dataset(s)

QaTa-COV19 dataset: https://www.kaggle.com/datasets/aysendegerli/qatacov19-dataset MosMedData+ dataset: https://medicalsegmentation.com/covid19/

BibTex

@InProceedings{YuBo_Frequencydomain_MICCAI2025,
        author = { Yu, Bo and Yang, Jianhua and Du, Zetao and Huang, Yan and Li, Chenglong and Wang, Liang},
        title = { { Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {277 -- 287}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a Frequency-domain Multimodal Interaction for the medical image Segmentation (FMISeg) model by introducing high-frequency (HF) and low-frequency (LF) dual branches for image feature extraction and fusion them with the extracted linguistic features from CXR-BERT. Evaluations on QaTa-COV19 and MosMedData+ datasets validate the effectiveness of the proposed method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors claim

    • Since the HF features contain textural details while LF features encode high-level semantic contexts, their combination can enhance the representation of the raw image and boost the segmentation accuracy of lesions.
    • To effectively model the interaction between linguistic and visual features, the proposed method introduces a Language and Frequency-domain Feature Interaction (LFFI) module in the decoder. This module first establishes bidirectional interaction between linguistic and visual features through a cross-attention mechanism.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Suspected plagiarism. 1) The experimental results of some previous methods in Table 1 are the same as those in the LGA [7] paper (https://link.springer.com/chapter/10.1007/978-3-031-72390-2_57), including U-Net, UNet++, nnUNet, TransUNet, Swin-Unet, LViT-T, and LGA. The authors should not reuse the experimental results of LGA in this paper but only remind the readers to view them in LGA’s paper or show the results run by the authors instead of directly copying from another publication. 2) Using the ideas of others and passing them off as the authors own. ① There is existing literature using (wavelet-based) low and high frequency fusion for segmentation, which the author should mention.
      Shan, L., Li, X., Wang, W.: Decouple the high-frequency and low-frequency information of images for semantic segmentation. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1805–1809 (2021) Zhou, Y., Huang, J., Wang, C., Song, L., Yang, G.: XNet: wavelet-based low and high frequency fusion networks for fully- and semi-supervised semantic segmentation of biomedical images. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21028–21039 (2023) Huang, W., Liu, Y., Sun, L., Chen, Q., Gao, L.: A novel dual-branch pansharpening network with high-frequency component enhancement and multi-scale skip connection. Remote Sens. 17(5), 776 (2025) ② The idea of using CXR-BERT may come from TGCAM [13], but the authors didn’t refer to it.
    • Incomplete comparison. The authors didn’t compare the proposed FMISeg to SAM 2, which has been accepted by ICLR 2025, and its variants in the medical domain, such as MedSAM-2 and BioSAM-2.
    • The source code cannot be validated during review and the submission doesn’t mention open access. Could the authors provide an anonymized link to the source code for review?
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Before submitting a manuscript, authors should finalize all research work, including experimental validation and necessary modifications. This ensures a complete and accurate representation of the findings.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (1) Strong Reject — must be rejected due to major flaws

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is an act of plagiarism, such as the reuse of research data and simply using ideas without citing the original source. Furthermore, I suppose this is a fabricated paper with fake experimental results due to the not actual implementation of the proposed method.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns are resolved after reading the authors’ rebuttal.



Review #2

  • Please describe the contribution of the paper

    The authors used a dual-encoder dual-decoder model by decomposing the image into high-frequency domain and low-frequency domain using wavelet transformation. Considering that the authors used a CNN (ConvNeXt) model, it is a good way to utilize the two frequency domains if we consider CNN as a learnable frequency filter for the image. The authors also used FFBI to integrate the information between the two domains and obtained good results.

    The authors also implemented a multi-modal model that leverages language information through cross-attention between the language domain and image domain at the decoder. It is not clear which textual information the authors used (i.e., radiology reports or EHR clinical reports), but even if they used radiology reports, it can be considered a contribution to strengthning disease quantification (semantic segmentation) by multi-modal utilization of the given information. If an EHR was used, it can be considered a contribution to strengthening the results by utilizing clinical context.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors efficiently leveraged the multi-domain information of the high frequency domain, low frequency domain, and language information of the images through cross-attention to demonstrate good performance in COVID 19 pneumonia segmentation.

    2. The authors also provide a good comparison of image only SoTA models, image-text multi-modal SoTA models, and CNN, Hybrid, and Transformer.

    3. The ablation study on each element of the overall model (frequency-domain features, FFBI, LLFI) was well done to understand what contributed to the performance of the model.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In the motivation of the study, the authors claim that high-frequency features contain textual detail and low-frequency features encode high-level semantic context, but they do not provide any evidence for this. More rational explanations or documented evidence may support this claim.

    2. It is unclear whether the clinical text information used by the authors was clinical text from a non-imaging EHR or a radiologic report. This is important because the contribution of this study as a multi-modal model depends on whether the model is fed information from the same domain as the medical image itself (i.e., radiologic report) or clinical information outside of the image as context (i.e., electrohic health records) for the model.

    3. The authors conducted their experiments at 224 x 224 resolution. The authors should address why they used this resolution, unless they used ImageNet pretrained weights or NIH DenseNet-121 pretrained weights, as authors didn’t mention about the initialization method (e.g., randomly initialized for training). The authors performed experiments on CXR and Lung CT data. The raw image resolution of CXR depends on the size of the detector, but on average it is about 2000 x 3000, and CT typically has a resolution of 512x512. If you resize it to 224 x 224, there’s a lot of compression loss, and you could lose the information of targeted lesions.

    4. The authors trained and validated on each dataset and did not perform external validation on each model. While we can see a good fit of the proposed methodology, lack of external validation limits the assessment of its robustness or possibility of real-world deployment.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I liked the methodology of using both high frequency features, low frequency features, and language domain, and the efficient integration of them through cross-attention. However, there are some critical limitations, such as the use of a low resolution of 224 x 224 in the medical domain, the lack of external validation. In addition details in the description of some of the experiments were lacked.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    While the author’s rebuttal comments proves that FFBI and LFFI modules help improving segmentation performance, it still doesn’t prove that high-frequency features capture fine detail and low-frequency features capture high-level context. In addition, the experiment at 224 x 224 resolution makes it difficult to prove taht the model captures fine-grained detail well. Within the medical field, 224 resolution is known to impose limitations, thus necessitating the exploration of 512 or higher resolution for extended studies.

    Nevertheless, the authors’ multi-modal segmentation strategy is valuable and fits well within MICCAI’s scope. Although the absence of an external validation is disappointing, as the authors commented in their rebuttal, this is still a significant contribution that demonstrates the efficacy of their strategy on datasets from diverse imaging modalities.



Review #3

  • Please describe the contribution of the paper

    This paper addresses two key challenges in language-guided medical image segmentation: the insufficient discriminative representation of visual features and the inability to suppress semantically irrelevant information.

    To tackle the first issue, the authors propose a frequency-domain dual-branch encoder combined with a Frequency-domain Feature Bidirectional Interaction (FFBI) module, enabling the effective fusion of high-frequency (texture) and low-frequency (semantic) information while preserving visual encoding quality.

    To address the second issue, they introduce a Language and Frequency-domain Feature Interaction (LFFI) module in the decoder, which employs cross-attention and adaptive filter weighting to suppress irrelevant visual regions based on textual guidance.

    Experiments on two public datasets demonstrate that this targeted design leads to improved segmentation performance over state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. While techniques such as wavelet transform and image-text cross-attention are not entirely new, this paper clearly identifies key limitations in prior work and provides a well-reasoned, targeted design to address them. The proposed method is simple yet effective, avoiding unnecessary architectural complexity while achieving meaningful improvements.

    2. The paper includes comprehensive experiments, including strong comparisons with state-of-the-art methods and well-structured ablation studies. The presentation is clean and well-organized, with clear descriptions and logical flow, which makes the technical contributions easy to follow and evaluate.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While it is understandable that space may be limited, the paper would benefit from more detailed examples or visualizations showing how the dual-branch frequency encoder combined with late fusion specifically improves segmentation quality, especially in terms of fine-grained lesion boundaries. Such evidence would help reinforce the authors’ core claims.

    2. In the comparative and ablation experiments, it would be valuable to include a direct comparison between early fusion and late fusion strategies, or at least a discussion of their respective advantages. Given that the paper also emphasizes the benefits of late fusion, this additional analysis would provide useful insights and make the evaluation more complete and compelling.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend acceptance of this paper based on its clear problem formulation, well-motivated and targeted method design, and strong empirical results. The proposed approach effectively addresses key limitations in previous language-guided segmentation methods, and the overall architecture is both conceptually sound and practically efficient. The experiments are thorough, and the paper is clearly written and well-structured, making it a solid contribution to the field.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    All of my concerns have been properly addressed. Additionally, since both the QaTa-COV19 and MosMedData+ datasets come with official training and test splits, it is acceptable for the authors to cite previously reported results—as long as the sources are clearly acknowledged, as suggested by another reviewer.




Author Feedback

Thank you for all reviewers’ constructive comments and valuable suggestions. We provide detailed replies to the questions raised below:

R1[Q1]R2[Q1]: Thanks for your suggestions. Our paper provided qualitative comparisons in Fig.2. Our method achieves more accurate segmentation, especially on small or blurry lesions, due to the well-designed FFBI and LFFI modules. In Table 2, the LF-only model outperforms the HF-only model, supporting our claim that LF features capture high-level context while HF features retain fine details. We will refine our paper to make our claims clearer.

R1[Q2]: We agree that comparing early and late fusion strategies would strengthen the evaluation. While FMISeg requires extracting robust LF and HF features, injecting textual features in encoders may disrupt inherent LF and HF representations. In Table 1, late fusion methods [8-10,12,13] outperforms early fusion methods [6,7,11], supporting our design choice.

R2[Q2,Q3]: The clinical text comes exclusively from radiologic reports that accompany the chest X-ray or CT images. The text is directly related to lesions, including locations, shapes, and sizes. FMISeg adopted ConvNeXt-Tiny as backbones, which initialized with ImageNet pretrained weights. We agree that higher resolutions (e.g., 512×512) may improve fine-grained segmentation. While we followed previous studies [6-13] to adopt resolution of 224×224 for fair comparisons. We will refine Sec. 3.2 to make these details clearer.

R2[Q4]: We agree that external validation is essential for generalization. However, two evaluated datasets differ significantly in imaging modalities (X-ray & CT), cross-dataset evaluation without domain adaption may lead to misleading conclusions. Thus, our study opted to evaluate each model within its corresponding domain.

R3: We greatly appreciate your critical comments. We affirm that no plagiarism or data fabrication occurred in our paper. FMISeg is original, thoroughly implemented, and evaluated on public benchmarks. To reproduce results in the paper, the code is available in https://anonymous.4open.science/r/FMISeg-BB71. For reuse of results from LGA and inadequate references, we clarify these concerns below.

R3[Q1]: Followed prior works [7,9,11,13], we reuse results of unimodal methods [1-4,22,23] from LViT [6] to ensure fair and consistent comparisons. The comparison results in LGA [7] were also originally reported in LViT, not reproduced by the authors. We disagree with allegation of plagiarism, as FMISeg was fully implemented and evaluated independently. We will add a footnote in Table 1 to indicate where these results come from in revision.

R3[Q2]: Thanks for your careful review. It should be clarified that our paper has already cited XNet [17] and CXR-BERT [19] in Sec. 2.1. FMISeg followed XNet to use wavelet transform to obtain LF and HF images. While other two works (ICASSP 2021 and Remote Sens. 2025) use Fourier transform and multi-scale convolutions for LF and HF components, which are not closely related to FMISeg. CXR-BERT is a widely adopted text encoder in recent LMIS works, including LanGuideSeg [8], MAdapter [12], and TGCAM [13]. The usage of CXR-BERT is not new idea proposed by TGCAM, and our contributions are clearly distinct from TGCAM. Importantly, FMISeg differs significantly from existing methods. Beyond wavelet-based decomposition during image preprocessing, we introduced: 1) the FFBI module for bidirectional interaction between LF and HF features; 2) the LFFI module for cross-modal filtering using linguistic priors. FMISeg is a task-specific dual-branch late-fusion architecture tailored for LMIS task. We will highlight these distinctions in revision.

R3[Q3]: SAM-2 and its medical variants (MedSAM-2 and BioSAM-2) are designed for point- or box-prompted segmentation, not language-guided segmentation. Our task setting and model design are fundamentally different, which makes direct comparison inappropriate.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top