List of Papers Browse by Subject Areas Author List
Abstract
Cervical cancer remains a significant global health concern, emphasizing the need for effective diagnostic methods. Despite advancements in Vision Language Models, challenges persist in incorporating cytological knowledge, ensuring data relevance, and maintaining accuracy when aggregating visual information. Current methods often struggle to handle fine-grained morphological details and the complex relationships between images and textual knowledge. In this paper, we present a novel framework for cervical cell classification that combines attribute descriptors with cytological knowledge for enhanced morphology recognition. Our approach leverages the Vision Large Language Model to generate descriptions for each cervical image and pretrain image and text encoders, improving both image understanding and cytological context. We introduce Attribute Descriptors Extraction using LLMs and Retrieval-Augmented Generation to generate detailed descriptors that capture important cytological features while minimizing irrelevant information. Additionally, we propose Optimal Attribute Descriptors Matching to dynamically align textual descriptors with image features, enhancing prediction accuracy, interpretability, and cytological relevance. Experimental results demonstrate the superior performance and generalizability of our method with varying amounts of labeled data. The code is publicly available at https://github.com/feimanman/CervicalCellClassifier.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1001_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/feimanman/CervicalCellClassifier
Link to the Dataset(s)
HiCervix dataset: https://github.com/Scu-sen/HiCervix
BibTex
@InProceedings{FeiMan_Refining_MICCAI2025,
author = { Fei, Manman and Shen, Zhenrong and Liu, Mengjun and Song, Zhiyun and Sun, Yusong and Han, Xu and Liu, Zelin and Jiang, Haotian and Bai, Lu and Wang, Qian and Zhang, Lichi},
title = { { Refining Cervical Cell Classification with Cytological Knowledge and Optimal Attribute Descriptor Matching } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15964},
month = {September},
page = {521 -- 531}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper presents a pipeline for cervical cell image classification that integrates image descriptors with clinically grounded textual descriptors derived from the Bethesda System. The proposed method demonstrates improved performance with limited annotation compared to vision-language models such as CLIP, and enhances model interpretability by grounding predictions in expert-curated cytological descriptions.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The incorporation of TBS-derived textual descriptions ensures that the model leverages expert-validated morphological criteria specific to cervical cell diagnosis. This enhances the interpretability of the model’s predictions, aligning them more closely with clinical practice.
- The integration of OD Solver improves the performance in the ablation study, suggesting that it contributes to a better alignment between image features and descriptors.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The introduction emphasizes interpretability as a major advantage of the proposed model over traditional CNNs. To illustrate this point, it would be good to quantitatively evaluate the accuracy of predicted descriptors.
- The paper does not compare the proposed model to vision-only baselines (e.g., ResNet-50, ViT) trained on 25% and 50% labeled data, which should be added to demonstrate the efficacy of textual supervision under the same constraint.
- The ablation study lacks detail on the replacement strategy for removed modules. For instance, when the OD Solver is removed, what alternative matching mechanism is used?
- The OD Solver was first proposed by Chen et al. 2024 for a video recognition task. This paper should be cited, especially since the OD Solver in Fig. 1 is adapted from the illustration in Fig. 3 of the Chen paper.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The proposed model doesn’t show a substantial performance improvement over baseline models like ResNet, which doesn’t justify the added complexity of integrating text.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
Comparisons to baseline under the same annotation constraint are provided in the rebuttal.
Review #2
- Please describe the contribution of the paper
The paper introduces a novel framework for cervical cell classification that enhances morphological recognition by integrating attribute descriptors with cytological knowledge. The approach leverages Vision Language Models to generate detailed descriptions for cervical images and pretrains image and text encoders. Key contributions include the introduction of Attribute Descriptors Extraction using LLMs and RAG to create precise, cytologically relevant descriptors.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strengths of the paper lie in its innovative use of Retrieval-Augmented Generation to generate fine-grained cytological descriptors, which effectively capture key morphological attributes while minimizing irrelevant information. Additionally, the proposed Optimal Attribute Descriptors Matching dynamically aligns the textual descriptors with image features, improving both prediction accuracy and interpretability. The experimental results demonstrate the superior performance and generalizability of the approach across different datasets, showing its robustness and adaptability with varying amounts of labeled data.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-The main weaknesses of the paper lie in the use of the pre-trained TextEncoder. While the model employs CLIP to train a text feature extractor for cell features, there is some uncertainty about whether the extracted features are aligned in the feature space when directly applied to the Attribute Set and learnable tokens. It remains unclear whether this pretraining is effective and whether an additional adapter is needed for better alignment. -Additionally, while Optimal Transport is a valid method, the use of CLIP’s cosine similarity for alignment in the earlier stage could potentially achieve similar results. The implementation feels somewhat disjointed, as there is a shift from CLIP alignment to OT alignment, lacking a consistent, unified approach.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
My recommendation is based on the method’s alignment with the paper’s motivation and the thoroughness of the experimental validation. While the approach, particularly the use of RAG for fine-grained feature extraction, is innovative, there are concerns regarding the alignment of the pre-trained TextEncoder and whether additional adjustments, such as adapters, are needed. The experiments show strong performance, but the shift from CLIP-based alignment to Optimal Transport feels somewhat inconsistent. These factors influenced my overall score.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
In this paper, the authors propose a novel framework for cervical cell classification that integrates structured attribute descriptors with clinically relevant textual knowledge for fine-grained morphology recognition. Experiments show that the proposed method achieves state-of-the-art results compared to baselines.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- I think it’s a great idea to use VLLM to generate descriptions of input images and then use CLIP to pretrain. This enhances the model’s comprehension of cervical cell characteristics and facilitates better generalization.
- The use of entropy-regularized optimal transport to find the optimal matching between the image features and the attribute descriptors is clever. This is helpful for improving prediction accuracy.
- The ablation study looks good. It shows that pretraining with VLLM, attribute descriptors, optimal matching all contribute to the superior performance of the proposed method.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It would be great if the authors can evaluate their proposed method on multiple datasets to showcase the effectiveness. Currently, the authors only use 5 categories in 1 dataset. The generalization ability of the proposed method is not clear.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Based on the strengths and weaknesses mentioned above, I recommend “Accept”.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
I maintain my positive rating after reading the rebuttal and other reviews.
Author Feedback
##Reviewer1
- About the quantitative evaluation of the accuracy of predicted descriptors. Thank you for your suggestion. Currently, we do not have a dataset that matches descriptors with images, so we are unable to conduct a quantitative evaluation at this time. However, we do provide a visualization of the results in Figure 2 of the experiment section.
- The proposed model shows little improvement over baseline models, making the added text integration unjustified. Although we did not include direct comparisons with vision-only baselines (e.g., ResNet-50, ViT) under 25% and 50% labeled data settings in the main paper, our method achieves performance comparable to these baselines trained with 100% labeled data, demonstrating the advantage of leveraging textual supervision in low-label scenarios. To address this issue, we conducted experiments with ResNet-50 and ViT under the same 25% and 50% labeled data settings. The results show that ResNet-50 achieved ACC values of 63.24% and 65.47% respectively, while ViT achieved ACC values of 64.17% and 66.83%. In comparison, our method achieved ACC values of 69.30% and 71.42% with 25% and 50% labeled data, further demonstrating the efficacy of textual supervision under the same constraint.
- When the OD Solver is removed, what alternative matching mechanism is used. The matching process is simplified by not using the OD Solver. Instead, each category’s attribute descriptors, which represent the morphological features of that class, will be aggregated into a single, unified descriptor. For each category, we calculate the cosine similarity between the image’s feature embedding and this aggregate attribute descriptor.
- About the addition of references in the paper. Thank you for your reminder; we will add references to the paper.
##Reviewer2
- About the issue of the method’s generalization ability. Thank you for the suggestion. Due to page limitations, we will include additional datasets in future research to study the generalization.
##Reviewer3
- About the effectiveness of the pre-trained TextEncoder and whether an additional adapter is needed. We appreciate your concern regarding the alignment of features extracted by the pre-trained TextEncoder. First, we would like to emphasize that the pre-training is indeed effective. In the first stage, we perform pre-training to adapt CLIP from the natural image domain to the cervical cell pathology domain, which provides a coarse-grained alignment of image and text features. In the second stage, we introduce the Attribute Set and utilize the CoOP method to fine-tune CLIP for fine-grained classification, ensuring a more precise alignment. The effectiveness of this approach is supported by our ablation study, in Table 2, where the combination of the pre-trained TextEncoder (via VLLM) and the subsequent OT alignment with learnable tokens leads to a significant performance improvement. This demonstrates that the alignment strategy employed in our model is both effective and sufficient.
- About the shift from CLIP-based alignment to Optimal Transport. The CLIP alignment in the earlier stage only provides a preliminary transfer from the natural image domain to the cervical pathology domain, making further fine-grained matching using OT alignment essential. We introduce Attribute Descriptors for each category since a single image cannot match all descriptors simultaneously. Thus, we use OT to identify the best matching descriptor, unlike CLIP, which calculates similarity with all descriptors. To effectively transition from CLIP to OT alignment while considering model parameters and experimental efficiency, we employ CoOP with learnable tokens for fine-tuning. As shown in Table 2, rows two and four illustrate that adding Attribute Descriptors and OT alignment through prompt-tuning after the CLIP alignment in the earlier stage (Pretraining with VLLM) improves experimental results.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A