List of Papers Browse by Subject Areas Author List
Abstract
Learning from noisy ordinal labels is a key challenge in medical imaging. In this work, we ask whether ordinal disease progression labels (better, worse, stable) can be used to learn a representation allowing to classify disease state. For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks. To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness. In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1277_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/berenslab/Learning-Disease-State
Link to the Dataset(s)
https://zenodo.org/records/10992295
BibTex
@InProceedings{SchGus_Learning_MICCAI2025,
author = { Schmidt, Gustav and Heidrich, Holger and Berens, Philipp and Müller, Sarah},
title = { { Learning Disease State from Noisy Ordinal Disease Progression Labels } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15964},
month = {September},
page = {283 -- 293}
}
Reviews
Review #1
- Please describe the contribution of the paper
The presented paper introduces a deep learning framework that can learn generalizable image embeddings from ordinal disease progression labels. The proposed method consists of Siamese encoders that process two images of a time series as well as two loss functions. The first loss term compares the images’ latent embeddings to predict the relative change in disease stage – better, worse, or stable – between the two time points. The second term is designed to identify cases in which at least one of the images is ungradable. Additionally, the authors introduce a trainable temperature parameter that can be interpreted as uncertainty estimate.
The authors train and evaluate their method using the MARIO challenge dataset from last year’s MICCAI, which contains time series of OCT images of patients with neovascular age-related macular degeneration (nAMD). Afterwards, they evaluate the generalization capabilities of their framework using an in-house ophthalmological dataset. Using a limited amount of labels to finetune their pre-trained model, they assess its ability to distinguish between active and inactive nAMD. The proposed method is shown to outperform a naive baseline that was trained via categorical cross-entropy.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
Relevant scientific problem: the paper researches the use of machine learning to process longitudinal medical image data with weak annotations. Moreover, as ordinal labels often reflect the way clinicians evaluate longitudinal data, this topic is of high scientific interest to the MICCAI community.
-
Intuitive method: the Siamese network architecture, the design of the loss functions, and the uncertainty estimate are all clearly motivated, well designed and make intuitive sense. Additionally, the method excellently suits the recent and unique MARIO dataset.
-
Accompanying public code repository: in addition to clearly describing their method, the authors have already made their code publicly available.
-
Well written manuscript: the paper is clearly structured, well written and nicely illustrated, making it easy to read and understand.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
Unclear distinction from prior work: in the introduction the authors write that “while ordinal regression has been explored in machine learning and medical imaging, learning from noisy ordinal disease progression labels about the underlying disease state remains unexplored.” I do not understand the distinction the authors are trying to make here and believe the authors have to more clearly explain how their work differs from the provided references (2, 3, 6, 7, 10, 14, 18, 19), potentially by including a dedicated related works section.
-
Limited methodological novelty: while the proposed method is intuitive, it essentially reflects the formal problem setting of the first task of the MARIO challenge (reference 20). Moreover, many of its components have already been introduced and utilized in previous works. For example, parts of the loss are similar to Gao et al. (reference 6), Garg et al. (reference 7) or Tang et al. (reference 19). Similarly, Taha et al. (reference 4) and Rivail et al. (reference 15) have used Siamese networks to simultaneously process two images of the same time series as pre-training task.
-
Unconvincing generalization results: most patients with nAMD will experience recurring neovascularization that is repeatedly treated with antiangiogenic drugs. Typically, this leads to nAMD alternating between an active and inactive state. I suspect that many of the worsening cases in the MARIO challenge dataset closely resemble cases of active nAMD in the used in-house dataset. Conversely, improving cases will most likely correspond to eyes with inactive nAMD. As a result, the pre-training and out-of-distribution tasks are semantically very similar. I do not believe that the conducted experiments can support any claims regarding the generalization capabilities of the proposed method.
-
Lack of meaningful baselines: additionally, the authors only include a single baseline in their experiments, a naive classifier trained via categorical cross-entropy. In order to effectively evaluate the potential of ordinal learning as weakly supervised training strategy, the authors must include other weakly and self-supervised baselines, such as contrastive learning or masked data modeling.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
-
The proposed method assumes that matching pairs in reverse temporal order represents the inverse disease progression (“Antisymmetric logit space equivariance”). However, after active neovascularization has been treated in eyes with wet AMD, structural damage to the retina often persists. I believe this clinical fact and its potential impact on the validity of the antisymmetric equivariance assumption should be discussed in more detail.
-
The authors write that their “model performed similar to the naive classifier”. However, the naive classifier achieves a substantially higher F1 score (70% vs. 60%). I believe that there is either an error in Table 1 or that above’s statement has to be reevaluated.
-
In Table 1, the authors should spell out what the metric “Rk-correlation” corresponds to. Additionally, they should clarify that the reported metrics are reported as percentages.
-
I suggest to also include the confidence values in the examples presented in Figure 2. Conversely, the predicted disease progression magnitudes could be added to the examples shown in Figure 3.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Although I found the presented work topical and interesting, I am concerned regarding its limited methodological novelty as well as the insufficient experiments that cannot support claims that the proposed method yields generalizable image representations. As I believe that addressing these issues is beyond the scope of a one-week-long rebuttal period, I am recommending to reject the paper in its current form.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
In their rebuttal the authors have focused on highlighting the novel aspects of their proposed method. Their explanations regarding their siamese network setup, loss re-weighting and how these relate to prior work have alleviated my concerns regarding the method’s limited algorithmic novelty. However, I still feel that the conducted experiments are not extensive enough to demonstrate the efficacy and generalization properties of the proposed method. In particular, my concerns regarding the semantic similarity of pre-training and evaluation datasets as well as the lack of meaningful baselines persist. As changing these issues is beyond the scope of the MICCAI rebuttal, I remain with my original recommendation to reject the paper in its current form.
Review #2
- Please describe the contribution of the paper
The paper proposes a method to learn continuous representations of disease state from ordinal labels. The authors propose to use siamese network to encode bscans, capture difference in logit space as difference between two encodings, converting continous logit representation to ordinal with sigmoid, adding learnable scale factor to the sigmoid. The performance was tested on open MARIO challenge data with cross validation, and private data in few shot classification task.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- It’s a well written paper in the sense that the authors have picked a problem, found solution, provided validation with supporting data and analysis.
- The method translates well to other tasks, e.g. stable vs active nAMD classification with few shot learning
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The paper and the method is quite straightforward and would rather qualify for a workshop or challenge paper. The contributions are not clear, as authors frame it as “we tailored our model” rather then we propose novel method/etc with highlighting what has actually been introduced.
- independent image encodings (siamese setup) has been long time know and extensively discussed in e.g. [1]
- Antisymmetric logit space: discusses properties of sigmoid
- Ordinal scale awareness: It’s a property of the method
- uncertainty-aware loss re-weighting. Cites the review paper rather than the paper it has been introduced in.
-
Private dataset isn’t described well, e.g. number of patients, demographics data.
- Comparison with a single naive baseline, for both tasks. Low performance on MARIO data.
[1] Li, M.D., Chang, K., Bearce, B. et al. Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. npj Digit. Med. 3, 48 (2020). https://doi.org/10.1038/s41746-020-0255-1
- The paper and the method is quite straightforward and would rather qualify for a workshop or challenge paper. The contributions are not clear, as authors frame it as “we tailored our model” rather then we propose novel method/etc with highlighting what has actually been introduced.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Despite the method being sound and the paper well organize, I had an impression that the paper lacks contribution, with vague statements in the abstract and last paragraph of introduction. I belive this paper is better suited for ophthalmological workshop.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The author have clarified the contributions of the paper and promised to provide the dataset statistics in the final version. Provided this, I believe the paper is slightly above the threshold.
Review #3
- Please describe the contribution of the paper
The authors provide a method to learn a representation of disease progression in late Age Related Macular Degeneration (AMD) by classifying pairs of retinal Optical Coherence Tomography (OCT) images from an eye whether disease got better, stable or worse. They use a siamese network to learn a representation of the retina, and model the differences directly in the latent logit space that allow for a continuous grading and furthermore a classification into the three classes. By introducing a second head to classify “other” labels and by introducing a learnable uncertainty parameter that reweights the loss they were able to mitigate label noise.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Continuous and symmetric representation of disease progression in the logit space, instead of doing directly a classification task.
- Incorporating label noise as a learnable parameter
- Thorough analysis of the model, its limitations and limitations of the data.
- Transfer learning of the model on an independent related task (active vs inactive nAMD).
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
For the sake of writing something, the classification performance is not as good as other methods using this MICCAI challenge data. However, this is also not the main aim of the method.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
A very nice and well written paper. The idea, method, results and discussion are clear and well structured. No major complaints or comments.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Disease progression modeling in AMD is a difficult task, in particular with noisy labels. The authors did not learn and optimize a blackbox classifier, with limited use in clinical practice. They developed a continuous representation of disease progression, and did within the limited number of pages a thorough analysis of the model, its benefits and limitations. Whereas the use of a Siamese network is an obvious choice for encoding differences, the authors choice of using the logit space for a continuous disease progression representation is an interesting choice. Futhermore, the special treatment of the “other” class and the learning of an uncertainty parameter is a good concept to mitigate label noise.
Learned representation is transferred to an independent dataset. The method is not bound to retinal OCT domain and may be transferred to other medical image domains.The paper is well written, method is described with enough details and validation is extensive.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank the reviewers for their constructive feedback. They appreciated that our work addresses a clinically relevant scientific problem (R1, R2), that the model design choices are clearly motivated and well designed (R1, R2, R3), and that the manuscript is well written and organized (R1, R2, R3).
Nevertheless, R1 and R3 also voiced criticism, mainly because they perceived a lack of methodological novelty. We may not have explained our contributions clearly enough. In the following, we hope to clarify our contributions and will also include the mentioned points in the final version.
Siamese network: Our model differs from off-the-shelf Siamese networks by producing a scalar disease state logit and a scalar “other” logit per image. To classify disease progression, we subtract the disease state logits between paired images, encouraging a compact, interpretable disease representation per image – even though we train on image-pair labels. In contrast, standard Siamese networks (like our baseline model) and methods like [4,15] rely on high-dimensional embeddings and joint processing. For example, our baseline concatenates these embeddings and passes them through learnable layers but does not yield compact per-image outputs. Importantly, unlike Matthew et al. (R3), who require Euclidean distances to healthy anchors, our model learns a continuous disease state directly, without such constraints.
Antisymmetric logit space: We compute disease progression as the difference between two disease states. Therefore, unlike the baseline, we enforce an antisymmetric equivariant logit space – derivable from a property of the sigmoid function (R3). This assumes that using image pairs in both time directions helps the model learn progression more effectively. Even if this inductive bias is not strictly true, as e.g. fluids in the retina may leave lasting traces (R1), we empirically found that the resulting model can be used to detect the presence of biomarkers like intra- and subretinal fluids.
Ordinal scale: Although our model is trained as a binary classifier (“worse” = 0, “better” = 1) using cross-entropy loss, we introduce the “stable” class with a target label of 0.5 – representing zero difference between two disease states. This enforces an ordinal scale for the labels – placing “stable” exactly between the other labels. Unlike conventional ordinal regression methods [6,7,19] that discretize K labels into K–1 binary tasks, our approach preserves the continuity of the label space, showing intraclass and interclass relationships.
Loss re-weighting: We propose a straightforward uncertainty estimation by a learnable slope parameter for the sigmoid function for loss re-weighting, a well-known subfield of learning with noisy labels, which is why we cited the review paper (R3). To the best of our knowledge, our approach of uncertainty estimation is novel in this context (R1).
Comparison to prior work: None of the references cited offer all our method’s features, as they deal with other tasks, such as SSL [2], or are not noise-aware [3,6,10,14,18,19]. Further, all the mentioned works do not have an internal continuous representation on image- and/or pair-level, including [7].
Next, we address R1’s claim that the OOD experiment is unsuitable to show the generalization of the model.
Generalization: While we agree with R1 that the OOD task is related to our primary task, the OOD dataset differs from MARIO – it comes from another clinic and includes single-image nAMD activity labels, not image-pair progression labels. We will give more details about the dataset in the final version (R3). Despite the task similarity, the baseline model trained on MARIO performs poorly. Only our model, which learns an internal disease state, performances clearly better – even in few-shot settings.
We thank R1 for also mentioning further points. We are confident to be able to address all of them in the final version.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The paper has received mixed reviews, but based on the authors’ feedback, as well as the reviews, I recommend the rejection.
It is quite clear that the proposed method makes a lot of sense, and the paper is well written. Unfortunately, reviewers and the AC find the experimental validation rather unsatisfying.
Below are the main concerns that were raised in the reviews:
- Limited algorithmic novelty (see the comment of R1 and reference to https://www.nature.com/articles/s41746-020-0255-1(
- It feels like noise modelling does not really add much. I personally do not find any statistically significant results.
- Validation on external dataset would be important. See R1’s comments about the progression of AMD. Moreover, this paper would require more baselines to make the evaluation meaningful.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A