List of Papers Browse by Subject Areas Author List
Abstract
Soft tissue sarcomas (STS) are a rare and heterogeneous group of malignant tumors that arise in soft tissues throughout the body. Accurate classification from whole slide images (WSIs) is essential for diagnosis and treatment planning. However, STS classification faces a significant challenge due to patient-specific biases, where WSIs from the same patient share confounding non-tumor-related features, such as anatomical site and demographic characteristics. These biases can lead models to learn spurious correlations, compromising their generalization. To address this issue, we propose a novel multiple instance learning framework that explicitly mitigates patient-specific biases from WSIs. Our method leverages supervised contrastive learning to extract patient-specific features and integrates a bias-mitigation strategy based on propensity score matching. Extensive experiments on two STS datasets demonstrate that our approach significantly improves classification performance. By mitigating patient-specific biases, our method improves the reliability and generalization of the model, contributing to a more accurate and clinically reliable STS classification. To facilitate direct clinical application and support decision-making, the code, trained models, and testing pipeline will be publicly available at https://github.com/Lanman-Z/MPSF.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3339_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LinWei_Enhancing_MICCAI2025,
author = { Lin, Weiping and Zhu, Runchen and Hou, Wentai and Wang, Jiacheng and Lin, Yixuan and Chen, Rui and Ta, Na and Wang, Liansheng},
title = { { Enhancing Soft Tissue Sarcoma Classification by Mitigating Patient-Specific Bias in Whole Slide Images } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {162 -- 171}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper addresses the issue of patient-specific bias in soft tissue sarcoma (STS) classification using whole slide images. The authors propose a model-agnostic MIL framework that uses supervised contrastive learning to extract patient-level biases and a prototype-based strategy to suppress these biases during classification. Experiments on two STS datasets show performance improvements across multiple MIL backbones.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Patient specific biases is an important concern, often overlooked in WSI-based MIL pipelines.
- The method integrates well with standard MIL methods, demonstrating broad applicability.
- The method achieves consistent improvements across multiple datasets and MIL algorithms.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
Potential label leakage with contrastive learning: The contrastive learning framework pulls together WSIs from the same patient (which share the same label). This creates a perfect correlation between patient ID and subtype. This means pretraining stage may encode label information.
-
The paper claims that the method “supports clinical decision making”, enabling “direct application in clinical settings”. But the method requires multiple WSIs per patient for the contrastive learning stage, which is unrealistic for clinical workflows (unless supported by a prior research study).
-
The release of code and models is listed as a contribution (multiple times in the paper!!). This is a future promise and not a scientific result, and does not count as a contribution. Moreover, code release is now a standard practice to ensure reproducibility.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper introduces a promising framework that integrates well with existing MIL methods. However, there is a serious potential flaw in the contrastive pretraining stage.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The paper introduces a method to reduce the chance of predictive models learning spurious correlations due to biases in the data. This is achieved by learning patient-specific features and then injecting these into the data at training time to ensure these features can’t be used to solve the classification task. The authors apply this technique to the task of Soft Tissue Sarcoma Classification and show improvements in generalization performance compared to the baseline.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The idea for this method is interesting and to my knowledge, novel.
- There is a need for methods such as this, to improve learning of robust features.
- Results show a modest performance improvement.
- Paper is well written
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- For the training of the PSFE, the authors state that positive pairs are generated at WSI level, implying that multiple WSI per patient are needed for this method. Is this correct? Could it be possible to generate pairs at the tile-level?
- The patient-specific features that are meant to de-bias the training are concatenated to the path embedding. In this case, it seems that it may be possible that at training time the model could learn to completely ignore those extra dimensions, thus returning to the baseline problem. Is this something that can be countered?
- For the results tables, it would be good to also get a sense of the variation, not just the mean across the CV folds.
- The silhouette score analysis sis not super convincing. While the scores are decreasing, visually the patients appear clustered. The low number of patients used (N=5) is an issue.
- For the analysis of the attention regions, a more quantitative analysis should be done. For example, comparing distribution of attention scores across all test slides between diagnostic and non-diagnostic regions. The example shown is not very convincing.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Interesting and relevant idea. There is room for improvement and additional results especially in a longer format paper. Despite some minor flaws, I consider it a valuable addition.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This study presents a method to learn debiased WSI representations. A contrastive objective sets up positive WSI pairs from the same patient and negative pairs from different patients. All pairs have the same diagnosis, thus focusing the model to learn biased features unrelated to diagnosis. Resulting slide representations are clustered into prototypes, and a linear combination of these (weights inverse to their similarity to current slide-level embedding) is concatenated to each patch-level embedding in a supervised objective. In a small sarcoma dataset, their method outperforms baseline models.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Subtle but powerful novelty: Very interesting use of contrastive learning. Negative pairs are diagnostically the same but from different patients. Positive pairs are diagnostically the same and from the same patient. This forces the model to focus on features unique to individual patients and not focus on features that are predictive of diagnosis, since there is a push from the negative pairs and pull from the positive pairs with respect to the diagnosis. Moreover, in order to discriminate negative pairs, the model must learn something from the images not related to diagnosis. This is quite subtle and clever.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Unclear motivation and unsubstantiated claims: The authors call out “patient-specific biases” in the introduction as one of the shortcomings of previous MIL methods but do not define what that means. Are these biases demographic, behavioral, reporting, recording, etc.? A simple Google search does not render a definition either. This is defined loosely as “occurrence site and demographic characteristics” later, but the authors do not provide any citations to support their original claim. This undermines the motivation for the study.
Contradictory methodological choice: It is unclear why the authors choose to concatenate slide-level debiased features extracted from their pre-trained aggregator to patch-level embeddings. It would make more sense to use the debiased features in a linear probe or clustering directly in a fully-supervised manner. Moreover, the method ends up using the patch-level features anyway, which are presumably biased.
Likely limited general applicability: The method relies on multiple slides per patient which is standard in pathology practice but almost non-existent from publicly available dataset. Moreover, under more common circumstances, this method may end up deteriorating slide-level representations, as slides with different stains and different tissue types within the same case are very common in pathology, and these would end up forming positive pairs. The information pathologists glean from each slide is more often not unique.
No statistical comparisons or measures of variance: The authors state in the introduction that there is significant improvement but do not perform any statistical tests to prove that. Moreover, they do not report standard deviation or confidence intervals across their five-fold cross validation. Readers must know an estimate of variance in order to clearly understand whether there is significant improvement.
Qualitative attention maps: The authors compare heatmaps using their method and a baseline model against a pathologist’s annotations to suggest that their model learns more diagnostically relevant regions. This is n=1 and is not a strong piece of evidence for what the authors claim. Alternatively, the authors could measure some statistic across multiple annotated images in order to bring more strength to their observation.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Unclear definitions: From the current definition of the dataset, positive pairs are forms from WSIs of the same patient and negative pairs from different patients. Pairs are filtered to have the same class.
Table 2 clarification: The authors do not mention what dimension the ablation study was performed with.
Unspecific critical parameter: The authors select k prototypes from clustered slide-level embeddings in order to create confounding features Z(i) but do not specify k.
Future work: The proposed method does not resolve all biases in the data. Some biases that would persist in the current setting is disease prevalence, scanner, staining, institutional biases, temporal, treatment-induced, and batch effects. It would be interesting to mitigate these biases as well in future work. If this metadata were available, it would be relatively easy to adapt the method to reduce these biases as well.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I recommend a score of 5 for this paper due to its subtle but important conceptual innovation in using contrastive learning to isolate and mitigate patient-specific biases in WSI representations—an underexplored yet critical area in computational pathology. The idea of forming positive and negative pairs with the same diagnosis but different patient origins cleverly forces the model to disregard features correlated with diagnosis, offering a novel approach to representation learning. While the paper has weaknesses—including unclear motivation for some methodological choices, vague definitions, and lack of statistical rigor—its central idea is strong and promising. These limitations are fixable and do not detract from the paper’s potential contribution to the field, justifying acceptance with minor revisions.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The rebuttal mostly assuaged my concerns. Concerns regarding motivation, methodological design, limited scale of qualitative results were fully addressed, and the authors event went insofar as to address neutral comments (i.e. beyond just weaknesses). The rebuttal pushes back against reporting variances in test performance claiming that TCGA is sufficiently large, which is ill-advised given that many studies on TCGA report variances in test performance. My concern regarding slide-level representation deterioration in multi-stain scenarios also was not addressed. However, given that the core idea elevated this paper to a high ranking in my stack and most of my concerns were address (even neutral comments), I believe the paper should be accepted.
Author Feedback
We apologize for some missing details and thank the reviewer’s valuable suggestions. We will revise accordingly in the final version if allowed. Four common questions are denoted as CQ1–CQ4. R1-W1 (O1) refer to Reviewer 1’s first weakness (optional comment). CQ1:We concatenate slide-level debiased features with patch embeddings, as patient-specific biases are more prominent at the slide level. This fusion guides patch attention learning and enhances interpretability. CQ2: The results show that PSFE works effectively, at least for STS. This concern is valid and inspiring. TCGA contains many patients with multiple WSIs, enabling the training of a pan-cancer PSFE. This could help alleviate the model’s limited general applicability and support broader use in cancers similar to STS. CQ3: Results in Table 1&2 are based on 5-fold cross-validation with std<0.04. Paired t-tests confirm statistically significant improvements over the baseline (p<0.05). CQ4: In Fig.4, the average attention scores of patches inside and outside expert-annotated regions (in_score vs. out_score). Our method yields the largest “in_score – out_score” gap, indicating better focus on diagnostic region. R1-W1:In our paper, patient-specific biases refer to non-diagnostic features such as sampling site or demographic information. We will clarify this definition and add supporting citations. R1-W2: Please first refer to CQ1. We will also explore your insightful advice. R1-W3: Please refer to CQ2. R1-W4: Please refer to CQ3. R1-W5: Due to limited expert annotations, we showed visualization for one case. We are working to obtain more annotations. We also quantified this result and will include it (refer to CQ4). R1-O1: In Eq.1, our explanation that pairs must share the same pathological label came too late and was unclear. We will revise it based on your comments in “6.Strength”. R1-O2: d is the dimension of patient-specific features from PSFE. R1-O3: We apologize for the oversight. We set k=10, as preliminary experiments with different k values showed that k=10 yielded the best performance. R1-O4: Really excellent point. We will extend our work to address these biases in the future. R2-W1: Yes, it’s correct and easy to realize. CQ2 offers another feasible solution. Tile-level pairing may better capture nuisance factors like staining variations, which are visible at the tile level. Your suggestion is insightful, and we will explore tile-level PSFE in future work for more consistent feature alignment. R2-W2: When training, we observed non-zero, varying gradients for patient-specific features (PSF)-related neurons. When testing, replacing PSF with random values causes a performance drop. Incorporating these features improves performance and patch attention learning. R2-W3: Please refer to CQ3. R2-W4: In Fig.3, before using N=5, we also tested N=30, and the silhouette scores still showed a consistent decreasing trend. However, with N=30, the large number of clusters made the plot visually cluttered. Thus, we chose the 5 patients with the most slides for clearer visualization. R2-W5:Please refer to CQ4. R3-W1: In Eq.1, we introduce a constraint that two images of a pair (both positive and negative pairs) must share the same pathological label. This constraint came too late (last sentence of Page 4). This constraint ensures that pathological features are similar, making it infeasible to distinguish pos/neg pairs based on pathological features. Reviewer 1 also agreed with our explanation. Moreover, the test set is not involved in the training of the PSFE. R3-W2: Prior STS classification methods often relied on pathologist-selected patches or covered limited subtypes. Our method operates directly on WSIs and covers more subtypes, offering greater clinical relevance. For concerns about requiring multiple slides, please see CQ2. PSFE is trained once and reusable; only a single slide is needed at inference. R3-W3: We will revise it.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The authors have addressed most of the reviewers’ concerns in their rebuttal. I recommend acceptance
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The majority of reviewers vote for acceptance.