Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Current medical report generation (MRG) methods remain limited by cross-modal associations, particularly when handling complex medical terminology across different modalities. In this work, we propose the Universal Medical Report Generation (UniMRG) framework to enhance Vision-Language foundation models (VLFMs) through coordinated data augmentation and architecture optimization. Specifically, we introduce Universal Semantics-Synergistic Multimodal Augmentation to enhance model adaptability to diverse medical scenarios while preserving critical diagnostic features. We further design a Medical Content Learner to capture both fine-grained pathological variations and specialized diagnostic contexts for robust cross-modal alignment. To achieve robust medical understanding against real-world variations, we develop a Dynamic Synergistic Evolution strategy guided by Large Language Model (LLM) that enables joint optimization of augmentation policies and architectural configurations. To address the existing gap in public VL datasets for skin diseases, we release a large-scale Skin-Path dataset, consisting of 277,761 patches covering 10 distinct skin diseases. Extensive experiments on PatchGastric22, IU-Xray, and Skin-Path demonstrate that UniMRG achieves state-of-the-art performance, surpassing Clinical-BERT by 2.6% in BLEU-4 and 3.9% in Rouge-L on IU-Xray. The Skin-Path dataset is available at: https://unimrg.github.io/Skin-Path/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0710_paper.pdf

SharedIt Link: https://rdcu.be/eHwU1

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_60

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

Skin-Path dataset: https://unimrg.github.io/Skin-Path/

BibTex

@InProceedings{XuHon_UniMRG_MICCAI2025,
        author = { Xu, Hongyan AND Sowmya, Arcot AND Katz, Ian AND Wang, Dadong},
        title = { { UniMRG: Refining Medical Semantic Understanding Across Modalities via LLM-Orchestrated Synergistic Evolution } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {636 -- 646}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a new framework to leverage Vision-Language foundation models for the task of medical report generation. In particular, the authors propose to leverage an augmentation scheme that more accurately resembles medical variations as well as a Medical Content Learner (MCL), an adapter/LoRA inspired approach to adapt the vision language model to the medical task. Finally, an LLM is used to optimize the augmentation policy and the design of the MCL in a Neural Architecture Search inspired setting. In addition, the authors also make a new skin cancer dataset available.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Leveraging general vision language foundation models for medical report generation is a timely and relevant topic.
- The authors publish a new skin disease dataset, which has the potential to be beneficial for the community.
- Empirical results demonstrate that the proposed approach obtains promising performance.
- Leveraging LLMs in order to perform neural architecture search and finding augmentation policies is interesting.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Methodology: For the Universal Semantics-Synergistic Multimodal Augmentation in Sec 2.1, it is unclear how the LLM is leveraged as an intelligent proxy over both modalities. While, the reviewer understands that it can be leveraged to provide augmentations for the text modality, proposing augmentations for the images independent from the actual images appears limiting. If it is instead just given a list of augmentations (no text/image) and asked to suggest combinations of augmentations from this list, as appears to be the case here as P is only dependent on the set of visual and textual strategies in Eq.1, it is unclear why this is beneficial over directly sampling following a specified prior distribution.

In Sec. 2.2, it is unclear why the neural architecture search is conducted using an LLM and experiments using alternative approaches are missing. There appears to also be an inconsistency when it comes to Eq. 2 and Fig. 2. Should there be an additional MHA in the second term in Eq. 2?

Clarity and reproducibility: It is unclear what the “Manually Configured Data Augmentation” baseline refers to. Is this an optimized combination of the different strategies in the UMA? The manuscript is lacking considerable amounts of implementation details, limiting the reproducibility. For instance, the level of augmentations for the augmentation scheme, definition of the search space for the DSE.

Dataset and Empirical evaluation:
- Additional details are required to understand the experimental setup in Table 1, in particular on how the baselines are constructed. The proposed vision-language model is compared to pure vision models (ConvNeXt, SWIN, etc.) but in the end output text quality is being evaluated.
- As mentioned above, alternative NAS approaches need to be considered to validate the choice of an LLM.
- For the dataset, it seems like there will be significant duplication when it comes to text annotations (277,761 patches with 194 unique (image-level) annotations) and the view in a single patch typically will not provide enough information to generate the “high-level” text annotation, requiring aggregation of the image-level information.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Some additional minor comments:
- There appears to be a typo in the pseudo-code in Stage 2 with the equation number.
- In Table 2, you state that UniMRG is not pretrained, which can be slightly misleading given that the VLFM has indeed been pre-trained (outside the medical domain).
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In its present form, I am leaning towards rejecting the manuscript, but am open to re-evaluate my position during a potential rebuttal if the authors can provide clarifications and address the doubts raised in the weaknesses on the methodology side.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I do appreciate the authors clarifications and the promise to share the code, which will clarify aspects such as choice of prompts etc. However, while I consider this a borderline case, I am still leaning towards rejecting the paper given that the proposed approach is performing neural architecture search (NAS) and a comparison to existing NAS approaches is missing. If the paper was to be accepted, references to prior works that leverage LLM-based NAS approaches should be included and contrasted with the proposed approach.

Review #2

Please describe the contribution of the paper

The author proposed the Universal Medical Report Generation (UniMRG) framework to enhance Vision-Language foundation models (VLFMs) through coordinated data augmentation and architecture optimization. They also designed a Dynamic synergistic Evolution method to explore the optimal model architecture and multi-modal augmentation strategy for MRG model. On the top of that, they introduced first VL dataset for skin cancer.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The manuscript is generally well-written, well-organized, and easy to follow.
- The methodology is clear and with certain level of novelty (simulating diverse real-world medical variations while preserving critical diagnostic features from a data perspective; as well as aligning medical terminology with visual semantics.)
- Experiments were conducted on three different clinical multimodal datasets, with sufficient ablation studies to prove the effectiveness of the proposed modules.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Simple evaluation and comparison on the MIMIC dataset would be appreciated since it’s a widely adapted large dataset for MRG task.
- Computational cost/efficiency was not mentioned, which is important.
- An ablation study of using various text encode would also important since different encoders would largely affect the final performance.
- Figure 1 is bit of messy, suggest to reduce text content and magnify the visual information.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Major contributions such as augmentation, content learning, and dynamic optimizationare clearly outlined, demonstrating the novelty of the proposed approach.Solid experimental design and validations with sufficient comparison and ablation studies.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

All my concerns have been properly addressed, thus I’m favouring towards acceptance.

Review #3

Please describe the contribution of the paper

The paper proposes UniMRG, a novel framework for medical report generation. Its main contributions include a semantics-synergistic multimodal augmentation strategy and a Medical Content Learner to enhance cross-modal alignment. In addition, it introduces a dynamic evolution approach guided by a large language model to jointly optimize data augmentation and model architecture. The authors also release a new large-scale dataset for skin disease report generation. Experiments on three benchmarks demonstrate superior performance over existing methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The proposed method leverages large language models to explore multimodal information, which is highly consistent with current research trends in vision-language integration.

(2) Based on the reported experimental results, the proposed approach demonstrates strong performance across multiple benchmarks, indicating its effectiveness.

(3) The newly introduced dataset is a valuable contribution and is expected to benefit future research on medical report generation and related tasks.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

A major weakness of the paper lies in the lack of clarity in the explanation of certain methods and their motivations, particularly in Section 2.2. For example, key components such as the functions LN and CA in Equations (2) and (3) are not explicitly defined. Moreover, the purpose and intuition behind the operations described in this subsection remain unclear, which hampers the reader’s understanding of the proposed mechanism.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the authors’ descriptions of certain methods and their motivations are somewhat unclear, particularly in Section 2.2, I believe the proposed approach can still be rated with a weak accept based on the strong experimental results and the significant contribution of the released dataset. The effectiveness of the method is demonstrated through the reported performance, and the dataset has the potential to significantly advance future research in this area.
Reviewer confidence

Not confident (1)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the AC and reviewers for the valuable feedback. We will address all suggestions in the final version by: (1) Correcting typos and writing issues; (2) Enhancing component definitions and implementation details and including the code link with reproduction instructions after acceptance.

To R1 i) MIMIC Evaluation: We experiment on PatchGastric22, IU-Xray, and Skin-Path to cover pathology and radiology domains. Per conference policy, MIMIC evaluation is not allowed to be included. We appreciate the suggestion and will include it in future work. ii) Computational Efficiency: As noted in the ablation, our modules increase model size from 103.15M to 106.1M(+2.86%). Despite this, Tab. 3 shows substantial performance gains of 25.1%, 22.9%, and 21.9% over baseline across datasets. iii) Text Encoder: In early study, we compared OPT and FlanT5. FlanT5 produced more coherent reports and achieved better results, likely due to its instruction tuning and bidirectional attention. We adopted FlanT5 and will summarize this in the final version. iv) Fig. 1 Layout: We will redesign Fig. 1 to improve clarity.

To R2 i) Augmentations from LLM: As shown in Algorithm 1, LLM iteratively generates and refines multimodal augmentation strategies based on validation feedback, which acts as a task-aware proxy. Since the performance stems from the iterative search on image-text pairs, this feedback loop is closely related to the actual image content, which helps LLM adjust the visual transformation to increase the diversity of visual styles while maintaining semantic relevance. We will revise Eq. (1) and clarify that P is a dynamic, performance-driven function. Tab. 4 confirms UMA’s gains over traditional augmentation (e.g., Mixup, Cutmix) and MCDA. ii) LLM as Efficient NAS Proxy: Traditional NAS requires evaluating many architectures, making it costly as the search space grows. Our method uses an LLM proxy with expertise knowledge to perform supernet-free search. Based on historical performance and configurations, LLM efficiently finds optimal architecture in only 10 iterations. It outperformed manual, random, and AutoSlim-based baselines while significantly reducing search cost. We will fix the Eq. 2–Fig. 2 inconsistency. iii) MCDA Setting: MCDA uses manually designed augmentations (e.g., flip, jitter) without LLM guidance. It serves as a static baseline. iv) Implementation: For UMA, image augmentations include (but are not limited to): ColorJitter (0.2/0.2/0.2/0.1), HorizontalFlip (p=0.5), and GaussianBlur (kernel=3). Text augmentations include Synonym Replacement (30% nouns), Back-Translation, and T5-based Contextual Paraphrasing. DSE search space: FD/GC depth in [1, 3], heads in {2, 4, 8}, hidden dim in {128, 256, 512}. v) Baseline Construction in Tab. 1: Vision baselines (e.g., ConvNeXt, SWIN) are re-implemented with official code and the same settings. To ensure fairness, we use the same transformer-based report generation head across all models. vi) Patch-level Annotation: Following PatchGastric22, whole slide images (WSIs) are annotated at slide-level, with all patches from the same slide inheriting the corresponding report. Per-patch annotation is infeasible due to scale (277k+) and would introduce redundancy. Our framework supports patch aggregation via multi-patch input and extended visual tokens.

To R3 Motivation of DSE: We will revise Sec. 2.2 to clarify that our goal is to enhance cross-modal medical semantic understanding. Prior methods treat augmentation and architecture design separately, resulting in suboptimal, static alignment. Our DSE strategy unifies both in a joint search space, where an LLM guides synergistic optimization based on validation feedback and domain knowledge, boosting MRG performance. FD-Learner captures fine details; GC-Learner integrates global context. LN (Layer Normalization) stabilizes activations; CA (Cross-Attention) fuses cross-modal, multi-level features.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

UniMRG: Refining Medical Semantic Understanding Across Modalities via LLM-Orchestrated Synergistic Evolution

Author(s):