Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate outcome prediction for head and neck cancer is critical but remains challenging due to domain shifts across multi-institutional imaging datasets. Existing domain generalization (DG) methods focus on visual features while overlooking clinical domain-invariant information. To address this gap, we propose MedPro-DG, a novel prompt learning framework that integrates CT imaging with clinical variables using domain-aware masked contrastive prompt learning. Our method can effectively mitigate domain shifts by aligning cross-modal features with domain-invariant clinical semantics. Extensive experiments conducted across six medical centers demonstrate the superiority of MedPro-DG, which outperforms state-of-the-art DG methods by 1.35% in AUC and 4.06% in ACC on average. Ablation studies further reveal that our prompt learning can capture clinically domain-invariant features, highlighting their diagnostic relevance. This work pioneers domain-invariant vision-language fusion for medical domain generalization, providing an available and effective solution for multi-center collaborative diagnosis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1231_paper.pdf

SharedIt Link: https://rdcu.be/eHwTL

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_39

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanRon_MedProDG_MICCAI2025,
        author = { Wang, Rongfang AND Chen, Jiasheng AND Zhang, Xinlong AND Wang, Jing AND Liu, Hui AND Zhou, Zhiguo AND Wang, Kai},
        title = { { MedPro-DG: Domain-Aware Masked Contrastive Prompt Learning of Institution Generalization for Outcome Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {413 -- 422}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes MedPro-DG, a vision-language framework for domain generalizaiton (DG) in head and neck cancer outcome prediction.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authos introduces two main components, attention-augmented visual prompt, and domain-masked contrastive loss. They are proposed to learn domain-invariant pathology sensitive representations in a multi-modal setup. It uses data from 6 centers to train and test the proposed model.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The dataset is highly imbalanced according to Table 1. I am deeply concerned about the evaluation of this papaer. It only calculates the AUC and ACC, which are both not reliable on highly imbalanced dataset. The numbers in the table don’t include any standard deviation or confidence interval. It might be okay in general, but for imbalanced dataset, it might be another concern.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think with a highly imbalanced dataset, the evaluation part of the paper is deeply concerning, I can’t really tell the contributions based on these number provided.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

MedPro‑DG is a vision‑language framework that fuses CT imaging with clinical text via attention‑augmented visual prompts and a domain‑masked contrastive loss.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The premise that clinical variables are more domain-invariant than imaging data is reasonable, given standardized medical terminologies (e.g., TNM staging).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Novelty of the proposed method is questionable. Components are incremental adaptations of existing prompt‑learning and contrastive frameworks. Without rigorous theoretical and empirical justifications, the proposed method seem to be another heuristic combination of existing works.
- The DMCL loss excludes same-class same-domain samples from the denominator. This design emphasizes cross-domain alignment but does not penalize or leverage within-domain similarities, potentially missing opportunities to reinforce class consistency. An ablation study including these samples in the denominator could clarify whether this exclusion is optimal or a limitation.
- Competing DG methods likely fine‑tune the full backbone. Without a baseline that freezes its encoder to the same degree as the proposed method, it is unclear whether gains arise from the prompt‑learning architecture or simply from an easier, lower‑capacity optimization. A fair ablation would show “ERM + prompt‑learning but frozen backbone” vs. “ERM fine‑tuned backbone.”
- Lack of significance testing (or even standard deviations).
- How are the clinical variables converted to text prompt for CLIP?
- Eq 10, index j is duplicated.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty is questionable. Some components require clarifications.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Clarifications on clinical prompts, DMCL, and backbone design resolve most doubts, though novelty remains incremental.

Review #3

Please describe the contribution of the paper

The paper introduces MedPro-DG, a domain generalization framework for multi-center outcome prediction in head and neck cancer. Its main contribution is the integration of CT imaging and clinical variables through two novel components: (1) Attention-Augmented Visual Prompt (AAVP), which injects features from vison encoder into a learnable text prompt space, and (2) Domain-Masked Contrastive Loss (DMCL), which aligns cross-domain, same-class samples in the clinical embedding space. The method is evaluated across six institutions using a leave-one-domain-out protocol, showing consistent improvements over existing domain generalization baselines.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Multimodal Integration with Prompt Learning The paper introduces a novel framework (MedPro-DG) that effectively integrates CT imaging and clinical variables using frozen backbones and learnable prompts, which is still relatively unexplored in the medical domain. The design of AAVP, which injects vision-derived spatial features into the text encoder, is an interesting formulation that can help establising more meaningful connections between imaging and clinical information.
2. Domain-Masked Contrastive Learning (DMCL) The proposed DMCL is a variation of supervised contrastive learning that explicitly enforces domain-invariant alignment. It constructs positive pairs from samples with the same class label across different institutions, and negative pairs from samples with different class labels.
3. Evaluation Across 6 Domains with Lightweight Tuning The method is evaluated across six medical centers, each treated as a separate domain in a leave-one-domain-out setting. This provides strong evidence of the model’s ability to generalize across real-world institutional variability. Additionally, the authors freeze both the image and text encoders, training only a lightweight set of learnable prompt vectors and a small FC layer in AAVP, which significantly reduces overfitting risk and improves efficiency.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the paper evaluates MedPro-DG against a variety of domain generalization methods, most are older (e.g., DANN, CORAL, GroupDRO). The authors did not inclode more recent and competitive approaches such as SWAD, SelfReg, which have shown strong performance in DG settings and are relatively easy to incorporate. Including such baselines — or even referencing more recent methods — would help better substantiate the claimed improvements.
2. It is unclear whether the baselines (e.g., ERM, VREx) also use both imaging and clinical (textual) inputs, or only rely on image features. This should be clarified. Additionally, if the baselines use only a vision model, it is important to specify what encoder weights are used. The image encoder from CLIP cannot be directly used for classification tasks, as it is trained for similarity with text embeddings, not softmax-based prediction. Are ImageNet-pretrained weights used instead for those baselines? This detail is important to ensure fair and consistent comparisons.
3. The authors use features from layer4 of ResNet50 in the AAVP module, but do not justify why this particular layer is chosen or whether other layers were considered. Additionally, I would be greatful if the authors provide more explanantion or evidnece on how AAVP provides spatial guidance, as mentioned in the paper.
4. The paper uses a leave-one-domain-out evaluation setup, which is well-suited for domain generalization. However, it would further strengthen the work if the authors clarified how hyperparameters (e.g., λ) and model checkpoint selection are handled during training. For example, is there a validation split within the source domains used for tuning? Clarifying this would help assess the robustness and reproducibility of the reported results.
5. The ablation studies effectively isolate the contributions of AAVP and DMCL. To further strengthen the analysis, it would be valuable to include a variant that uses fixed (non-learnable) prompts. This would help clarify whether the benefit arises specifically from prompt tuning, independent of the other components, and provide deeper insight into the role of prompt learning in the overall performance.
6. The paper mentions that “medical imaging domains often exhibit local texture variations”. To make this claim more compelling, it would be helpful to include a citation or a concrete example.
[1] SWAD: Cha J. et al., “SWAD: Domain Generalization by Seeking Flat Minima”, NeurIPS 2021.
[2] SelfReg: Kim D. et al., “SelfReg: Self-Supervised Contrastive Regularization for Domain Generalization”, ICCV 2021.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes MedPro-DG, a novel domain generalization method for multi-center outcome prediction that combines imaging and clinical data using an attention-augmented visual prompt (AAVP) and a domain-masked contrastive loss (DMCL). The integration of spatially-informed vision features into the text encoder is original and promising. The evaluation across six institutions with a leave-one-domain-out setup is solid, and the design choice to freeze encoders adds practical value.

However, several issues limit a stronger recommendation: the baselines are somewhat outdated. Some design choices (e.g., layer selection in AAVP) lack justification, and model selection and ablation coverage could be improved. Despite these, the core idea is interesting.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal addresses some of my concerns. For Q1, the authors justify their current baseline set and state that stronger DG baselines such as SWAD and SelfReg will be incorporated. For Q2, they clarify that all baselines use ImageNet-pretrained ResNet-50 for the image branch, while MedPro-DG additionally leverages only the text encoder from the CLIP-RN50 checkpoint to handle clinical prompts, ensuring a fair comparison. Concerning the first part of Q3, they explain the choice of ResNet layer 4; I recommend adding a brief subsection on this rationale in the final version. For Q4, they now outline the hyper-parameter tuning and model-selection process, including a 20 % validation split. They also commit to releasing code, data, and checkpoints. I therefore would like to keep my initial positive score.

Author Feedback

Dear Reviewers and Area Chairs, We appreciate all the reviewers for their valuable comments and suggestions. We are encouraged that they find our motivation reasonable(R2) and novel(R3), idea interesting and meaningful(R3), and representation clear and well-organized(R2,R3). We address the reviewers’ comments below. R1:Q1: We agree with the reviewer that our dataset is significantly imbalanced. We initially used AUC for its robustness under imbalance. Following the reviewers’ suggestion, we will add AUPRC, which is better reflects minority-class performance. Due to space limits, we will report AUPRC (69.82±4.83) only on the most imbalanced dataset (CHUM with 10.71% LRR cases). Since heatmaps can’t show Std, we will convert the existing results into a table for clarity. We will release the code, dataset, and checkpoints. R2:Q1: We propose a language-based learnable clinical prompt to extract continuous semantics from EHRs. By leveraging complementary, domain-invariant clinical features, our work enhances conventional DG approaches that focus only on image modality. Unlike typical prompt engineering, we use clinical variables instead of class labels, preserving general semantic power without label bias. We also design a model with learnable prompts and frozen backbones to guide attention to domain-invariant features, improving generalization to unseen domains with minimal parameter tuning. Q2: In the ablation study, we explored the impact of including same-class same-domain samples (AS4) vs. excluding them (MedPro-DG) in the denominator. Exclusion aids domain generalization by preventing within-domain overfitting and promoting focus on inter-domain invariant features. Q3: Our design targets domain-invariant features in clinical data, which are inherently cross-domain stable. Thus, we freeze the backbone and train only the learnable prompts with minimal tuning. As noted in Zhou et al. ‘CoOp’, and ‘CoCoOp’, freezing the backbone enables efficient downstream adaptation without performance loss. Q4: Please see R1-Q1A. Q5: We encode clinical variables as a simple concatenated sentence for CLIP, e.g., ‘57, Oropharynx, T2, N0, radiation, positive’. This minimal, unstructured format avoids introducing assumptions or biases. Q6: Eq. 10 revised. R3: Q1: Compared methods cover foundational DG approaches across data-, representation-, and strategy-based categories. ERM is critical as it often outperforms complex DG methods under distribution shifts and is the basis for many advanced techniques, e.g., Teterwak et al., ‘ERM++: An Improved Baseline for Domain Generalization’(WACV 2025). Our method is compatible with any ERM-based framework. We will compare SWAD and SelfReg in future work. Q2: All compared methods use only the image as input. We only use the CLIP’s text encoder to process clinical information. ResNet-50 based uses ImageNet-pretrained weights. The CLIP uses ‘RN50’ weights. Q3: In H&N cancer imaging, Layer4 better identifies tumors and locations than Layer3. We tested Layer3 but found Layer4 more effective. Early max pooling reduces performance due to information loss. AAVP uses GAP and GMP for spatial guidance: GAP captures global features and suppresses noise, while GMP highlights key regions. Please see in doi 10.1007/s00521-022-06953-8. Q4: We set λ=1 and an ablation study on its effect is provided in Tab. 3. For training, 20% of each source domain is held out for validation, following [20]. We save a checkpoint every 40 iterations, and select the best model on validation. Q5: AS1 evaluates the model using only learnable prompts to isolate their effect. As the paper cited in R2Q3, they deeply analyze the difference between fixed and learnable prompts. Q6: In [23] (Sec.2.3), authors discuss local texture variations across medical imaging domains. Due to the rebuttal rules, we can’t add more experiments, results, or analyses. Nevertheless, we deeply appreciate all the comments regarding further interpretation and investigation.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers have recommended accept. I think the authors have also provided reasonable rebuttal to the other reviewer, mainly regarding data imbalance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

MedPro-DG: Domain-Aware Masked Contrastive Prompt Learning of Institution Generalization for Outcome Prediction

Author(s):