Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Psoriasis (PsO) severity scoring is vital for clinical trials but hindered by inter-rater variability and the burden of in-person clinical evaluation. Remote imaging utilizing patient-captured mobile photos offers scalability but introduces challenges, such as variations in lighting, background, and device quality that are often imperceptible to humans but may impact model performance. These factors, coupled with inconsistencies in dermatologist annotations, reduce the reliability of automated severity scoring. We propose a framework to automatically flag problematic training images that introduce biases and reinforce spurious correlations which degrade model generalization by using a gradient-based interpretability approach. By tracing the gradients of misclassified validation images, we detect training samples where model errors align with inconsistently rated examples or are affected by subtle, non-clinical artifacts. We apply this method to a ConvNeXT-based weakly supervised model designed to classify PsO severity from phone images. Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held-out test set. Commonly, multiple annotators and an adjudication process ensure annotation accuracy, which is expensive and time-consuming. Our method correctly detects training images with annotation inconsistencies, potentially eliminating the need for manual reviews. When applied to a subset of training images rated by two dermatologists, the method accurately identifies over 90% of cases with inter-rater disagreement by rank-ordering and reviewing only the top 30% of training data. This framework improves automated scoring for remote assessments, ensuring robustness and scalability despite variability in data collection. Our method handles both inconsistencies in image conditions and annotations, making it ideal for applications lacking standardization of controlled clinical environments.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3660_paper.pdf

SharedIt Link: https://rdcu.be/eHw8y

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05169-1_23

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{PalBas_GRASPPsONet_MICCAI2025,
        author = { Pal, Basudha AND Kamran, Sharif Amit AND Lutnick, Brendon AND Lucas, Molly AND Parmar, Chaitanya AND Patel Shah, Asha AND Apfel, David AND Fakharzadeh, Steven AND Miller, Lloyd AND Cula, Gabriela AND Standish, Kristopher},
        title = { { GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {233 -- 243}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors proposed an algorithm to remove data that hinders the generalization of the model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors proposed an algorithm for removing data samples that may hinder model generalization. In particular, they demonstrated that eliminating potentially misleading samples in psoriasis severity classification can improve model performance. This approach contributes to both reducing the manual effort required for data quality assurance and enhancing overall model accuracy.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The interpretation of the results appears to be insufficient. For example, there is no reference to the tables within the main text, which makes it difficult for readers to understand how the quantitative findings support the authors’ claims. Additionally, in Figure 2, the labels 0, 1, and 2 are shown—presumably corresponding to mild (PASI: 0–5), moderate (PASI: 5–10), and severe (PASI >10) categories—but this is not explicitly explained in the figure caption or the text. Clarifying these elements would improve the clarity and accessibility of the results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Appropriate topics and analysis followed. However, some areas need to be supplemented.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have appropriately modified my questions.

Review #2

Please describe the contribution of the paper

Main contribution of this paper: the authors propose an approach to improve the predictive performance of an MIL classifier for psoriasis severity scoring, which involves using data attribution to find images in the training set that are most influential for mis-classified examples in the validation set, then re-training the model without those images
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths:
1. The framework of MIL to clinical images to classify severity seems appropriate for the problem in question
2. The motivation (learn a robust classifier end-to-end that ignores spurious features w/o need for extensive pre-processing) is interesting. It would help more if you could highlight in the paper what the actual problems w/ pre-processing steps like bounding boxes/segmentation/background removal are. Are these too computationally expensive to run at inference time? Are they themselves subject to degradation across different hospitals/environments/etc? A potential benefit of your framework/approach that isn’t discussed as much is that it may provide interpretability that could help guide iterative improvement of the data pipeline. E.g., if images of a certain body region are constantly increasing the loss, perhaps the collection protocol for that region would need to be changed. Or if images of patients of a certain skin tone are increasing the loss, maybe there need to be separate models for different skin tones, etc etc.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Weaknesses:
1. It’s not clear to me exactly how the optimal number of training images to remove is being chosen. In section 2.1, you mention that “the final model is chosen based on the best validation AUC.” However, in Table 1, you only show the test set reader 1 and reader 2 AUCs, and not the validation AUC. My concern would obviously be that if the validation AUC doesn’t correspond to this, then the user would have no way to pick the correct number of images to remove. So to demonstrate that this model actually improves generalization performance (claim 3 of your key contributions in info), you should show val AUROC in that table as well.
2. In section 3.1 you mention the different skin tone distribution in your dataset — would it be possible to include any information on the baseline and post-intervention performance in the different subgroups? E.g. FST I/II/III vs FST IV/V/VI? It would be interesting to see, for example, are the majority of misclassified patient visits in the val or test sets from a particular skin tone group? Or PASI severity group? Or male/female group?
3. Can you clarify, are the train/val/test folds split such that there is no overlap between patients within the folds? From section 2.1, it sounds as though the patient visits are certainly unique between the folds, but it is less clear whether the patients are unique.
4. I know space is very limited in these papers, but it would be great if you specified additional details about the pre-trained Conv-NEXT/ViT. As it stands, all you mention is that the encoder is trained on ImageNet — are these just pre-trained models from torchvision, or models you pretrained yourself?
5. It would be interesting to know more about how the 46 images are collected per patient. Are these all clinical images collected on mobile devices at home? Or clinical images collected by camera in clinic? How many images per region e.g. (Head and neck, Upper limbs, Trunk, Lower limbs).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a nice paper! I think the reason I’m leaning more towards weak accept than to a stronger score would be (A) the lack of details on algorithmic hyperparameters (e.g. number of images to be removed) and details of training combined with (B) the lack of open source code, which combine together to decrease my confidence in the reproducibility/rigor of the method
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper mainly integrates TracIn, which is a well-known approach for influence functions, into a multi-instance learning framework for PsOriasis severity classification.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. well structured manuscript.
2. The authors target a relatively unexplored domain, psoriasis severity using uncontrolled mobile images, which for a clinical purpose can be valuable.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.The paper mainly integrates TracIn, which is a well-known approach for influence functions, into a multi-instance learning framework. Bothe TracIn and MIL-attention are already existing methods. From purely methodological point of view, this paper is more on an application of prior work. Other than providing a pipeline that train a baseline model to fine misclassified validation samples and rank influential training samples and remove them to be able to retrain them, has been conceptually explored in other studies.
1. The authors did not compare their TracIn-based method with simpler heuristics. In my opinion, without such comparison the benefits of full gradient tracing method might overlook.
2. The paper does not quantify time or GPU, for example how fast the top-k images were identified?
3. More inside seems necessary about Fitpatrick skin type. Can authors provide more information about potential bias?
4. In case of generalizing to other dermatological conditions, authors should show more results to prove the transferability of the model into other conditions.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Lack of novelty
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their valuable feedback and recognition of the innovation, impact and clinical relevance of our work. We have addressed their comments (broken down by reviewer and section). R1: 7: R1 observed: (1) missing table references in the text and (2) absence of PASI classification thresholds [mild (0–5): class 0, moderate (5–10): class 1, severe (>10): class 2] in the Fig 2 caption. We will revise the final version to include both. R2: 6.1 Mobile images taken by patients are highly diverse, and segmentation/background removal are therefore challenging. Since preparing this manuscript we are experimenting with methods to incorporate more targeted information into our model. 6.2 GRASP-PsoNet removed up to a maximum of 14/46 training images per selected patient visit. We removed 3.18-7.61% of images for FST I–V, while FST VI showed a higher rate of 13.04%. As suggested, potential interpretability analysis can be further explored in the future to guide pipeline improvements. 7.1: Validation AUROC is used to decide how many images to remove. Removing the top 300 most “harmful” samples per misclassified patient visit yielded the highest validation AUROC of 89.2% with ConvNeXT, versus 81.4% (top 100), 82.6% (200), 78.7% (400), and 76.6% (500). This trend aligns with test performance in Table 1 and will be included.
7.2: Performance analysis on subgroups, such as skin tone, show that on the test set our framework achieved AUCs of 90.5% for FST I–III (6624 images) and 88.0% for FST IV–VI (1196 images), compared to baseline AUCs of 84.7% and 87.6%. Type V showed the lowest performance due to limited data, while Types I, IV, and VI exceeded 90%. We will include performance improvement on subgroups in the final version. 7.3: We confirm that train, val, and test splits have no patient overlap.
7.4: ConvNeXT and ViT encoders were initialized with ImageNet pretrained weights via the timm library.
7.5: Each visit includes 46 mobile images taken by patients at home. The breakdown across body regions is: head & neck (12), trunk (10), lower extremities (13), and upper extremities (18), where some images are shared between regions (e.g.: head & neck, trunk).
12: Due to proprietary workflows and compliance constraints, we cannot release the full code immediately but are exploring a simplified version pending approval.
R3: 7.1: To address R3’s concern, we clarify that while TracIn and MIL are established individually, ours is the first integration of TracIn within an MIL framework for clinical disease severity classification. Unlike prior methods, we adapt TracIn to a weakly supervised clinical task to remove misleading samples without manual supervision or preprocessing.
7.2: Simpler heuristics like random removal or loss-based filtering do not capture the influence of individual training samples on predictions. GRASP-PsONet offers an approach that traces validation errors to training samples using gradient based attribution, allowing selective image-level pruning within patient visits, improving model reliability and clinical relevance.
7.3: Identifying the top 500 influential samples using our framework takes under an hour at ~$8/hr (NVIDIA A100), while manual annotation takes an hour per visit at $450/hr for board-certified dermatologists, making our method more scalable and cost-efficient.
7.4: Our dataset is from a clinical trial with a low prevalence of dark skin tones. Thus, in our analysis we stratify by disease severity and Fitzpatrick skin tone using a nested two-step process and use weighted sampling. We are currently testing on another dataset with higher dark skin prevalence to further explore potential bias. 7.5: PsO severity via remote imaging is central to dermatological drug trials. Mobile PsO images are susceptible to spurious correlations, motivating influence aware training. While tailored to PsO, the method may generalize to similar diseases if retrained on disease specific dat

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Vox populi: all three reviewers voted in favor of acceptance. One reviewer was a clinician, whose opinion carries high value. Also, the author response was very good.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification

Author(s):