List of Papers Browse by Subject Areas Author List
Abstract
Effective therapy decisions require models that predict the individual response to treatment. This is challenging since the progression of disease and response to treatment vary substantially across patients. Here, we propose to learn a representation of the early dynamics of treatment response from imaging data to predict pathological complete response (pCR) in breast cancer patients undergoing neoadjuvant chemotherapy (NACT). The longitudinal change in magnetic resonance imaging (MRI) data of the breast forms trajectories in the latent space, serving as basis for prediction of successful response. The multi-task model represents appearance, fosters temporal continuity and accounts for the comparably high heterogeneity in the non-responder cohort.In experiments on the publicly available ISPY-2 dataset, a linear classifier in the latent trajectory space achieves a balanced accuracy of 0.761 using only pre-treatment data (T0), 0.811 using early response (T0+T1), and 0.861 using four imaging time points (T0 -> T3). The full code can be found here: https://github.com/cirmuw/temporal-representation-learning
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3197_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/cirmuw/temporal-representation-learning
Link to the Dataset(s)
N/A
BibTex
@InProceedings{JanIva_Temporal_MICCAI2025,
author = { Janíčková, Ivana and Tan, Yen Y. and Helbich, Thomas H. and Miloserdov, Konstantin and Bago-Horvath, Zsuzsanna and Heber, Ulrike and Langs, Georg},
title = { { Temporal Representation Learning of Phenotype Trajectories for pCR Prediction in Breast Cancer } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15974},
month = {September},
page = {611 -- 621}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors present a multi task framework to learn and extract latent features from longitudinal MRI in order to pathological complete response (pCR) to neoadjuvant chemotherapy in breast cancer.
MLP-projected latent features are extracted at the bottle neck of a Unet that is trained using three losses:
- Lrec: MSE image to image reconstruction loss
- Ltemp: triplet loss where anchor and positives are two views of the same image, negative is another time point of the same patient
- Lalign: ensuring that all responder patients are close in the latent feature space An additional attention module is used to focus on tumoral regions
The method is evaluated on a subset of the ISPY-2 dataset with four time points and 585 patients. Axial 2D images are obtained from maximum intensity projection of dynamic contrast enhanced MRI. The method si compared to TE-SSL, a method proposed for longitudinal imaging analysis in the context of Alzheimer at MICCAI 2024.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is generally well written. It provides convincing performance analyses and ablation studies demonstrating the importance of all four components of the approach (3 losses + attention). The quantitative evaluation highlights a clear performance advantage for the proposed approach. This approach could constitute an important approach to efficiently leverage longitudinal data when compared to e.g. delta radiomics.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While generally well written, several important aspects need further clarifications in order to assess the relevance of the approach as well as to ensure reproducibility.
A first major aspect is a lack of justification why responders must be aligned or similar while non-responders must not. While we could imaging that response leads to disappearance of the tumor, it remains far from evident that responders should be more similar among them when compared to non-responders. This is especially problematic as the radiological expectations of pathological CR are not mentioned. In particular, is it expected that pCR lead to complete disappearance of the lesions ? If yes, are the model based on T0->T3 “cheating” as it is enough to find that no lesion is present on T3 ?
Concerning the triplet loss, anchor and positives are two views of the same image, but to what do these views correspond ?
How are axial 2D images selected, is it manual? How does this affect the performance ?
Is the MLP trained to predict pCR ? What is the dimensionality of its input, does it correspond to the bottleneck of the Unet?
When inputing multiple time points to the logistic regression model, are the features concatenated ? If yes the risk of overfit is large as the number of features will largely exceed the number of patients
How is the Logistic regression model trained, is it using the same split mentioned in Section 3 (Section ISPY-2 Dataset) ?
How imbalanced are the classes between pCR and non-pCR ?
How irregular time gaps between MR images impacts performance ?
The abstract lacks details on the models and method used.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the paper proposes an interesting approach to use longitudinal imaging for response prediction, an important lack of both justification of the approach and methodological details hinder the relevance and reproducibility of the proposed approach.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have partially clarified missing information in the paper and plan to add crucial information into the manuscript.
Review #2
- Please describe the contribution of the paper
This study investigates the use of temporal representation learning on longitudinal MRI data to predict pathologic complete response (pCR) to neoadjuvant chemotherapy (NACT) in breast cancer. Using a subset of 585 patients from the publicly available I-SPY2 dataset, the authors develop and evaluate a model that integrates imaging data across up to four timepoints. The work contributes to the development of models aimed at predicting treatment response over time, with a focus on clinical utility.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The use of a public dataset (I-SPY2) enables reproducibility and comparison with future approaches.
The authors indicate that code will be made available, which supports transparency and external validation.
The focus on early prediction of pCR during NACT is clinically relevant, as it could support timely treatment adjustments based on expected response.
The dataset is split into training (70%), validation (10%), and test (20%) cohorts stratified by pCR label, resulting presumably in a test cohort of 117 patients, which is appropriate.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Key information about the patient cohort is missing. While 585 patients from I-SPY2 are included, no details are provided regarding inclusion/exclusion criteria, the variety of NACT protocols used (I-SPY2 involves multiple “treatment arms”), cancer subtypes, patient demographics, or the prevalence of achieving pCR. This lack of detail limits the interpretability and reproducibility of the results.
The utility of using all four imaging timepoints for prediction is unclear. Predicting pCR at the final pre-surgical timepoint is of limited clinical value, as in most cases the absence of residual disease is readily apparent and one does not need AI to determine this.
The practice of bolding values in performance tables is uninformative in the absence of statistical testing to demonstrate significant differences. Statistical proof must be provided to support claims of superiority, also noting that a correction for multiple comparisons is required when comparing multiple models on the same data.
The area under the precision-recall curve cannot be interpreted without knowing the prevalence of pCR in the test cohort.
The ablation study appears to have been performed on the test set. This repurposes the test cohort as a validation set and compromises the independence of the final evaluation. As a result, the performance metrics reported in Table 2 (using the “best” model from the ablation study) are likely biased and may overestimate true model generalizability.
The use of standard deviations to report performance variation (e.g., for AUC) is less informative than 95% confidence intervals. Confidence intervals provide clearer insight into statistical uncertainty and whether the model reliably outperforms chance (i.e., whether the CI includes 0.5).
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Promising work but important details are missing and generalizability is questionable.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
While the clinical utility is unclear given the multiple ‘treatment arms’ in I-SPY2 (the question that’s being answered in this paper is the prediction of pathological complete response to a mixture of therapies), it’s an interesting preliminary analysis.
Review #3
- Please describe the contribution of the paper
The main contribution of this paper is the development of a self-supervised representation learning method that leverages longitudinal contrast enhanced-MRI data to accurately predict pathological complete response in breast cancer patients undergoing neoadjuvant chemotherapy. This is a crucial clinical information related to the patient care. The key innovations include the design of a latent temporal representation of the response. THe paper presents the introduction of a triplet loss specifically designed for temporal representation learning, an alignment loss that accounts for the heterogeneity among non-responders, and the use of the MTAN attention module to highlight response-specific features.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper introduces a novel formulation specifically tailored to capture longitudinal changes in MRI data using self-supervised learning. This approach significantly outperforms previously reported methods on the ISPY-2 dataset.
The integration of the MTAN module to refine feature extraction for the task-specific components is insightful and effective, clearly improving the interpretability and clinical relevance of extracted features.
Results are strong and clearly surpass current state-of-the-art methods (e.g., Jing et al., 2024; Zhang et al., 2024).
The application of using only the T0 scan is extremely promising for clinical usage.
The manuscript is well-organized, logically structured, and clearly written. The authors explicitly state that they will make the source code available upon acceptance, and they clearly outline the model architecture and training protocols. This significantly enhances the reproducibility of their findings.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While the approach is promising, the choice of using 3 channels MIP of 3D MRI data instead of full volumetric analysis could limit the representation capacity of the learned features. This decision was likely due to computational constraints but could benefit from futher justification and exploration.
The generalization capacity of the proposed method are unclear since the evaluation is limited to a single dataset (ISPY-2), and a single data split. The authors should clarify how they plan to ensure generalizability, ideally by testing on external datasets or through cross-validation strategies.
Clinical translation potential, although suggested, is not thoroughly discussed in terms of practical feasibility and implications. What are the targeted metrics for potential clinical use?
Comparison with STOA approaches is described yet on different test datasets. Running the standard methods on the same datasets could largely ease the results comparison.
Jing et al., 2024 show large improvement when using clinical features (age, hormonal status, …), even outperforming image data only. Could those features be included in the linear classification?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Further discussions on the clinical applicability of your method, especially how your predictions could concretely impact treatment decisions, would enhance the paper.
It is encouraged to validate the method with external datasets, if feasible, or conduct additional robustness analyses.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper presents an innovative approach with substantial empirical validation, clearly demonstrating superior performance in predicting pCR compared to existing approaches. The method has promising implications for clinical applications, particularly in early response prediction. Issues around method robustness, dataset-specific results, and broader generalizability could still require careful investigation during the rebuttal phase.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
Novel method, simple implementation. Important topic. This could benefit to a large community.
Author Feedback
We thank all reviewers for their constructive and to-the-point feedback. We appreciate R1’s recognition of our method’s performance and ablation studies, R2 and R3’s emphasis on the clinical value of early prediction, and their suggestions for improvement. In the following we address the questions and concerns raised. We will clarify the manuscript accordingly, the code will be made publicly available. CLINICAL RELEVANCE (R1 and R2 ask about clinical utility of prediction from all four imaging timepoints):pCR is not directly inferred from imaging, and [23] (MICCAI’24 paper) shows that prediction remains challenging even at late timepoints .Early prediction at T0 or T1 could steer early treatment, but even prediction at T3 can inform invasive surgery decisions: pCR occurs in a substantial number of patients, possibly rendering surgery avoidable, but is currently only confirmed after surgery [15,18]. We benchmark predictions at T0, T0+1, and T0->T3 (Tables 1&2). MODELLING ASSUMPTIONS (R1 requests clarification of our modelling assumptions and alignment strategy): We agree this point warrants further clarification and will expand on the rationale outlined in the paper. In the empirical paper results the alignment of only pCR outperforms alignment of both cohorts (Table 1, L_Art*). A possible reason is that non-pCR cases are more heterogeneous including partial responders, stable disease, and progressors, each with distinct imaging profiles under RECIST criteria [Kitajima et al., 2018] while pCR patients are more homogeneous. EVALUATION (R2 asks about the evaluation procedure): As described in the Implementation section, All model selection was performed using only the validation set. The separate test set was held out and only used for final reporting. This is described in the implementation section, and we will further clarify STATISTICAL REPORTING (R2 asks for statistical evaluation): Results are reported as mean ± standard deviation across runs to reflect variability, consistent with related work. The performance gains—often exceeding 10–20 percentage points—are large relative to the standard deviations and consistent across all metrics: for T0 (Table 1), our model achieves 0.764 ± .01 AUROC, 0.565 ± .02 PRAUC, and 0.761 ± .01 balanced accuracy, outperforming the L_TESSL baseline [21, MICCAI’24 paper] (0.625 ± .01, 0.367 ± .02, 0.526 ± .01). On the T0+T1 task, we reach 0.802 AUC, outperforming the imaging-based result of 0.706 reported in [10] which used T0+T1+T2. We have performed paired t-tests with Bonferroni correction on these runs and will add them to the manuscript. The differences between our method and baselines were statistically significant (p < 0.005) across all primary metrics reported in the results. GENERALIZABILITY (R2 and R3 raised concerns about generalizability): The publicly available ISPY2 dataset is multi-centre, multi-scanner study, and we will clarify to which extent this supports generalizability of our findings. REPRODUCIBILITY AND DATASET DETAILS (R2 asked about the selection and dataset description): The subset was filtered for patients with all four imaging timepoints and DCE-MRI with three contrast-uptake channels (see ‘ISPY2 Dataset’). While ISPY2 cohort statistics are publicly available, we will include subset-specific information (e.g., subtype, treatment arm, 33% pCR rate) in the final manuscript. Our full code will be made available upon publication so that results will be replicable with the public data. DETAILED POINTS:
- The triplet loss uses two different randomly augmented views of one observed image following standard SSL [21].
- MIP images are generated from 3D volumes; each timepoint is encoded into a 480-D vector via UNet+MLP projector that is frozen during evaluation.
- For multi-timepoint prediction, features are concatenated and fed to a linear classifier.
- ISPY2 has a fixed imaging schedule during treatment across all patients, there is negligibly variability in imaging
Meta-Review
Meta-review #1
- Your recommendation
Provisional Reject
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
The weaknesses clearly outweigh teh strengths. The paper lacks critical clinical and methodological details, including cohort composition, treatment protocols, class imbalance, and justification for modeling assumptions, limiting both interpretability and reproducibility. Performance claims are not statistically validated, (very important!), and potential data leakage introduced by performing ablation on the test set raises questions. Additionally, the utility of late-stage prediction, unclear modeling choices, and insufficient handling of temporal and sampling variability further weaken the study’s reliability and clinical relevance.
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The reviewers have acknowledge the rebuttal clarifications and converged to an acceptance.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A