Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Fatty liver disease (FLD) negatively affects over 30% of the global population and can ultimately lead to cirrhosis and death. Early detection and intervention on the severity of FLD help control its progression. However, facilities for assessing the severity of FLD are lacking in economically disadvantaged regions, highlighting an urgent need for a cost-effective and scalable screening method. Traditional Chinese Medicine (TCM) suggests a strong correlation between tongue characteristics and liver health, positioning tongue diagnosis as a non-invasive means for assessing FLD severity. Establishing an automated tongue diagnosis method holds promise for large-scale and rapid classification of FLD severity among rural populations. In this paper we present a Hard sample Mining-based Tongue Diagnosis Framework (HM-TDF) for multi-class classification of FLD severity. The HM-TDF identifies hard samples using a novel uncertainty estimation approach and addresses them through a multi-expert classifier. We introduce a Multi-source Feature Fusion Kolmogorov-Arnold Network (MFF-KAN) to model the relationship between tongue images plus basic physiological indicators and FLD severity. We propose a three-step training strategy to train this heterogeneous model. We construct and release a novel tongue diagnosis dataset for FLD severity classification, named Tongue-FLD, to advance research in automated tongue diagnosis. Experimental results on this dataset indicate that the proposed method surpasses existing automated tongue diagnosis methods in the classification of FLD severity. Moreover, MFF-KAN effectively visualizes the key pathways from input to output, providing strong interpretability. The dataset and code are available at https://github.com/MLDMXM2017/HM-TDF.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2233_paper.pdf

SharedIt Link: https://rdcu.be/eG4Di

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_26

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/MLDMXM2017/HM-TDF

Link to the Dataset(s)

Tongue-FLD dataset: https://github.com/MLDMXM2017/HM-TDF

BibTex

@InProceedings{CheTao_Hard_MICCAI2025,
        author = { Chen, Tao AND Gao, Jie AND Xu, Yong AND Qiu, Weihong AND Wu, Yijie AND Ye, Weimin AND Liu, Kunhong},
        title = { { Hard Sample Mining-based Tongue Diagnosis for Fatty Liver Disease Severity Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {260 -- 269}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a hard sample mining framework (HM-TDF) and a feature fusion network (MFF-KAN) for classifying FLD severity from tongue images and physiological data. It also introduces Tongue-FLD, the largest public dataset for this task, and provides interpretability via network visualization.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Extending tongue-based diagnosis to FLD severity classification is an interesting and meaningful direction, using uncertainty guiding the classifier is also a reasonable idea, and the release of a new dataset represents a valuable contribution to the community.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The dataset is highly imbalanced (non-FLD vs. mild vs. moderate/severe = 7.17:2.94:1), but the potential impact of this imbalance on model training or evaluation is not discussed. 2) While it is mentioned that FLD severity was assessed by radiologists, it is unclear whether multiple annotators were involved, whether inter-rater agreement (e.g., Kappa) was measured, or whether any annotation quality control procedures were in place. 3) The reviewer is also unclear about the purpose of citation [15]. Is it referencing the assessment conducted by radiologists, or does it point to a prior study or dataset associated with the authors’ group? If it is the latter, it may compromise the fairness of the double-blind review process. 4) it is not stated whether all methods use the same input modalities (e.g., tongue images and physiological indicators), which may affect the fairness of the comparison. 5) no statistical significance testing is reported, which would strengthen the reliability of the performance claims.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a novel application of tongue-based diagnosis for FLD severity and contributes a new dataset. While promising, the role of citation [15] raises double-blind concerns. Also it lacks clarity on data imbalance, annotation quality, and input consistency. Given its potential impact, I recommend a weak accept pending a satisfactory rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper presents a Hard sample Mining-based Tongue Diagnosis Framework (HM-TDF) for classifying the severity of fatty liver disease (FLD) using tongue images. It introduces a Multi-source Feature Fusion Kolmogorov-Arnold Network (MFF-KAN) and releases a novel dataset (Tongue-FLD) for this purpose. The proposed method outperforms existing approaches and provides clear interpretability through the visualization of key pathways within the network.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel Framework: The Hard sample Mining-based Tongue Diagnosis Framework (HM-TDF) is innovative as it identifies hard samples through uncertainty estimation, allowing for targeted classification, which enhances the model’s accuracy in multi-class FLD severity classification.
2. Multi-source Feature Fusion: The introduction of the Multi-source Feature Fusion Kolmogorov-Arnold Network (MFF-KAN) is a significant advancement, as it effectively models complex relationships between tongue images and physiological indicators, improving the interpretability and performance of the diagnosis.
3. Comprehensive Evaluation: The paper includes a robust evaluation against 11 competing algorithms, demonstrating superior performance metrics such as MAE and RMSE, which highlights the clinical feasibility and effectiveness of the proposed method in real-world applications.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limited Dataset Diversity: The Tongue-FLD dataset consists of 5717 samples, which may not be representative of the broader population, particularly given the imbalance in FLD severity categories (7.17/2.94/1.00). This limitation could affect the generalizability of the model.
2. Complexity of Model Interpretation: While the MFF-KAN provides some interpretability, the overall complexity of the model with numerous learnable weights may hinder understanding and practical application in clinical settings.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My recommendation is based on several key factors: novelty of contribution, significance of results, and clarity of writing.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper introduces a new tongue diagnosis dataset for 3-class classification of fatty liver disease severity and proposes an automatic diagnosis system that utilizes tongue images and physiological indicators. The authors employ a Kolmogorov-Arnold Network (KAN)-based feature fusion classifier to predict the severity for easy samples and a KAN-based multi-expert classifier for hard samples, which are identified using a moment-of-inertia-based uncertainty estimation method. Additionally, a three-step training strategy is proposed to train the framework. The study demonstrates promising results compared to 11 baseline models on the collected dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The research topic is interesting and clinically valuable, as tongue diagnosis has the potential to serve as a non-invasive and cost-effective method for assessing fatty liver disease severity.
  2.Dataset Contribution. The authors plan to release the tongue diagnosis dataset, which will facilitate further advancements in automated tongue diagnosis research.
2. Novel Methodology. The proposed methods are innovative, with Kolmogorov-Arnold Networks (KAN) proving effective for both the feature fusion classifier (FFC) and multi-expert classifier (MEC), as demonstrated in the ablation studies. Additionally, KAN provides a degree of interpretability.
3. Comprehensive Experiments. The experimental results are thorough and demonstrate the effectiveness of the proposed FFC, MEC, and three-step training strategy (e.g., RR-Pretrain). The interpretation part is interesting.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The experiments rely solely on a single dataset, which restricts the ability to evaluate the algorithm’s generalization and reproducibility across diverse datasets.
2. Some of the interpretations appear to rely on subjective judgment, such as identifying tongue characteristics associated with key input nodes in the image feature vector. It is unclear why more objective methods, such as Grad-CAM, were not used.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. Full name should be provided for acronyms at their first appearance, such as IE, DE.
2. The hyperparameter of uncertainty threshold is not discussed in the paper.
3. There is a typo in the figure numbering; “Fig. 5” should be corrected to “Fig. 3.”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a clinically valuable and innovative approach to fatty liver disease severity classification using tongue images and physiological indicators, with strong experimental results and the promise of releasing a new dataset to advance research in automated tongue diagnosis. The use of Kolmogorov-Arnold Networks adds novelty and interpretability.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank all reviewers for their constructive comments. Here are our responses: Dataset Imbalance [R1–C1, R2–C1] We acknowledge the limited sample size and class imbalance in our dataset. The constrained sample size is mainly due to the high cost of collecting disease-annotated data, which were obtained from a cohort study. While we are actively collecting new data from other cohorts to expand the dataset. To address data imbalance, data augmentation and a progressively balanced sampling strategy were applied during training, and a brief description is added to the revised manuscript. Besides, all metrics except Accuracy are already macro-averaged to ensure balanced performance assessment across classes. Model Interpretability [R1–C2, R3–C2] Regarding concerns about the complexity of model interpretation, we would like to clarify that MFF-KAN contains only 0.7M parameters, which is far fewer than competing models such as ResNet34 (21.6M). Furthermore, MFF-KAN allows interpretability to focus on the key pathways, thereby facilitating understanding and practical use. While the identification of tongue characteristics involved expert judgment, the key pathways were objectively determined based on significance scores. We appreciate the reviewer’s suggestion to incorporate objective interpretation, such as Grad-CAM. While our primary goal is to understand how specific tongue characteristics influence model predictions, interpreting the regions highlighted by Grad-CAM still requires expert knowledge. We recognize the importance of objective interpretation and will explore standardized methods for identifying tongue characteristics in the future work. Annotation Quality [R2–C2, C3] While direct inter-rater agreement analysis was not feasible with the current data, we implemented several measures to ensure the consistency and reliability of the annotation process. First, the severity of FLD was assessed by professional radiologists, each with at least five years of clinical experience. Second, all raters received standardized training prior to data collection and followed the same diagnostic guideline (citation [15]: Guidelines for the Diagnosis and Management of Nonalcoholic Fatty Liver Disease: Update 2010. Journal of Digestive Diseases, 2011). Since citation [15] is a widely adopted clinical guideline, we are confident that its inclusion does not compromise the integrity of the double-blind review process. Input Modality [R2–C4] We would like to clarify that all comparison methods used the same input modalities: images and indicators. For models originally designed for unimodal input (M1, M4, M6, M7, and M8), the indicators were embedded using an MLP. This clarification has been incorporated into the revised manuscript. Performance Evaluation [R2–C5, R3–C1] To enhance the reliability of our performance evaluation, we employed a wide range of metrics and reported results averaged over five-fold cross-validation. To further validate the robustness of our findings, we performed statistical significance testing. Based on the metrics reported in Table 1, the Friedman test yielded a Friedman statistic XF2=40.667 and FF=8.2. The critical value at α=0.05 with degrees of freedom (11,55) was 2.0, which is much lower than 8.2. This indicates significant performance differences among the models at the 95% confidence level. To promote reproducibility, we will make both the dataset and code publicly available. We acknowledge that evaluation based on a single dataset is insufficient to assess the algorithm’s generalization capability. To further address this limitation, we are collecting data from a more diverse cohort. Additional Comments [R3] We have provided the full names of all acronyms in the revised manuscript. The uncertainty threshold was set to 50% of the MOI uncertainty, assuming a uniform distribution of predicted class probabilities. We have also corrected errors in figure numbering.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

The reviewers were unanimously positive on this paper. I urge the authors to take advantage of the thoughtful reviews to further improve and tune their work. I appreciated the roots in TCM with the ML methods.

back to top

Hard Sample Mining-based Tongue Diagnosis for Fatty Liver Disease Severity Classification

Author(s):