Abstract

Millions of melanocytic skin lesions are examined by pathologists each year, the majority of which concern common nevi (i.e., ordinary moles). While most of these lesions can be diagnosed in seconds, writing the corresponding pathology report is much more time-consuming. Automating part of the report writing could, therefore, alleviate the increasing workload of pathologists. In this work, we develop a vision-language model specifically for the pathology domain of cutaneous melanocytic lesions. The model follows the Contrastive Captioner framework and was trained and evaluated using a melanocytic lesion dataset of 42,512 H&E-stained whole slide images and 19,645 corresponding pathology reports. Our results show that the quality scores of model-generated reports were on par with pathologist-written reports for common nevi, assessed by an expert pathologist in a reader study. While report generation revealed to be more difficult for rare melanocytic lesion subtypes, the cross-modal retrieval performance for these cases was considerably better.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2599_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/SanderMoon/MOSAIC

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LucRub_Pathology_MICCAI2025,
        author = { Lucassen, Ruben T. and Moonemans, Sander P. J. and van de Luijtgaarden, Tijn and Breimer, Gerben E. and Blokx, Willeke A. M. and Veta, Mitko},
        title = { { Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {510 -- 520}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. Interesting and Relevant Topic: The paper addresses the significant challenge of automating pathology report generation for cutaneous melanocytic lesions, a task that is both time-consuming and repetitive for pathologists. The potential to alleviate the workload of pathologists by automating part of the report writing process is highly relevant and impactful.

    2. Dataset: The proposed dataset comprised of 42,512 H&E-stained WSIs and reports. The dataset will promote the development of the automatic report generation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed dataset is very large and may promote the development of visual-language models in the pathology field.
    2. Apply human-level evaluation makes the results more convincing.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of baselines. In fact, there are some works that have achieved slide-level report generation like MIGen and HistGen. They are not used for comparison. In addition, the comparison of different retrieval method is also overlooked.
    2. Limited Generalizability: The model’s performance on rare melanocytic lesion subtypes is significantly lower compared to common nevi. This suggests that the model may struggle with more complex and less frequent cases, limiting its generalizability. The paper does not provide a detailed analysis of why the performance drops for these subtypes.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The lack of experiments and baselines.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concern is addressed.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a vision-language for generating pathology reports from whole slide images of cutaneous melanocytic lesions. The vision-language model was evaluated by having a pathologist evaluate different aspects of pathology reports from real pathologists and the vision-language model. The conclusion of the study found that the vision-language model failed to perform to the standard of a pathologist on most other lesions but matched the performance when run on common nevi, showing potential that they could be used in the future.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem, as identified in the paper, pathologists using a large proportion of their time writing pathology reports for each benign nevus being biopsied is something that is very relevant.
    • The study uses a large dataset of historic WSIs and paired pathology reports, with 42,512 WSIs.
    • The design of the vision-language model was well explained and justified along with the methods used to train the model and the parameters used.
    • The experiments used for this study have been well designed for the specific use case, sourcing pathologists to review the outcomes of the model to compare them with historic reports, judging them on specific types of error.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is not clear where the novelty of this paper is, whether this is a slight extension to previous frameworks in order to generate longer pathology reports.
    • As the data used in this study is private data, it limits the ability for the rest of the MICCAI community to reproduce or build off this work, despite having the code made available.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While this is a strong study with a detailed evaluation of the proposed method, although it is not clear what the specific novelty of this work is, and all experiments were conducted on private data, limiting reproducibility.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents a CoCa-based vision-language model for generating reports on cutaneous melanocytic lesions, including both diagnoses and detailed descriptions of visual features. Evaluation was conducted through a reader study, including the identification of four types of errors and a subjective score assessing report accuracy and practical usability. Additional evaluation was performed using WSI-to-report and report-to-WSI retrieval tasks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Clear description of materials, methods, evaluations, and interpretation of results.
    2. Incorporation of the latest advancements in digital pathology into model development.
    3. Evaluation of different fine-tuning strategies for BioGPT, offering useful training guidance for future applications.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Further explanation of the four error types, including definitions and examples, would improve understanding.
    2. No comparison is made to existing models, such as PRISM.
    3. The reader study involved only one participant.
    4. There is no discussion of methods to address the imbalanced training dataset for better performance on uncommon lesions.
    5. Training time and computational resources are missing.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Insightful and successful application, embedded latest developments, clear method description and evaluations.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Insightful and successful application, embedded latest developments, clear method description and evaluations.




Author Feedback

We thank the reviewers for providing detailed feedback on our paper.

(R1) An important part of the novelty of our paper is in the application. We adapted a state-of-the-art vision-language model framework, which was in prior work only trained and evaluated on short reports (with one or two sentences) related to the diagnosis, to train and evaluate it on much longer pathology reports, including both the diagnosis and descriptions of visual characteristics. Another noteworthy difference is that the prior study used a pan-cancer dataset, where cases from different organs are visually much more dissimilar than for our narrower domain of melanocytic lesions. As a technical contribution, we also compared regular finetuning, LoRA finetuning, and freezing the language encoder, and found that especially regular finetuning degraded the performance.

We agree with R2 that adding definitions of the error types would improve the paper: (1) Factual errors are inaccuracies which are verifiable based on the WSIs (e.g., there are no mitoses visible); (2) Hallucinations are statements which are unverifiable based on the WSIs (e.g., the patient concerns a 50-year-old male); (3) Repeated phrases are all sentences or subsentences with information that was already mentioned before; (4) Omissions of information are missing descriptions of clinically relevant characteristics. Additionally, the final model was trained on 16x24GB NVIDIA RTX6000 GPUs for a duration of ~2 days.

We share the opinion with R2 that the reader study would benefit from more participants, which we mentioned as a limitation in the discussion section: “While the reader study showed clear patterns, recruiting multiple readers and selecting a larger set of cases would be important for a more comprehensive analysis.” (R2, R3) We decided not to compare our model against other public models trained on different datasets, because it is often unclear how many melanocytic lesion cases, if any at all, were included for training. Moreover, these other models were likely developed using pathology reports with different reporting standards and formatting, which is consequently also reflected in the generated reports. This may give an unfair disadvantage to the other models and does not serve the goal of our paper. Instead, we focused our evaluation on the performance differences between common nevi and more rare subtypes for pathology report generation and cross-modal retrieval using a single, state-of-the-art vision-language modeling framework. However, for future work, we agree that it would be interesting to compare different vision-language model architectures (such as the original PRISM, MIGen and, HistGen), trained on the same dataset, to investigate how architectural differences affect the performance.

In response to the concerns raised by R3, we think that the report generation performance is lower for rare melanocytic lesions because these lesions are more difficult to describe, reflected by the length of the reports (“For common nevi, the reports contained an average of 77 words across 7 sentences. In contrast, the reports for all other lesion subtypes contained, on average, 125 words across 12 sentences.”, page 3) and because of the class imbalance are less important to reach a low loss during training (“Most of these lesions (81.8%) were benign common nevi.”, page 3). As highlighted by R2, it would be good to mention in the discussion potential solutions such as oversampling strategies or loss weights for rare diagnostic classes to potentially improve the performance. Because the reports for these subtypes are longer and more complex, they are also more distinctive, which we hypothesize is the reason for the better retrieval performance for rare melanocytic lesions. Note that retrieval in clinical practice is mostly relevant for the rare subtypes.

We hope that this clarifies the remaining ambiguities and provides more insight into our motivation for certain decisions




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers vote for paper acceptance.



back to top