Abstract

Remote medical care has become commonplace with the establishment of patient portals, the maturation of web technologies, and the proliferation of personal devices. However, though on-demand care provides convenience and expands patient access, this same phenomenon may lead to increased workload for healthcare providers. Drafting candidate responses may help speed up physician workflows answering electronic messages. One specialty that may benefit from the latest multi-modal vision-language foundational models is dermatology. However, there is no existing dataset that incorporate dermatological health queries along with user-generated images. In this work, we contribute a new dataset, DermaVQA(https://osf.io/72rp3/), for the task of dermatology question answering and we benchmark the performance of state-of-the-art multi-modal models on multilingual response generation using relevant multi-reference metrics. The dataset and corresponding code are available on our project’s GitHub repository (https://github.com/velvinnn/DermaVQA).

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2444_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/velvinnn/DermaVQA

Link to the Dataset(s)

https://osf.io/72rp3/

BibTex

@InProceedings{Yim_DermaVQA_MICCAI2024,
        author = { Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Ben Abacha, Asma and Yetisgen, Meliha and Xia, Fei},
        title = { { DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces DermaVQA, a new dataset for dermatology question answering using multimodal inputs. The dataset aims to facilitate the development of AI-assisted tools to help dermatologists streamline their workflows when answering electronic messages in remote medical care settings. The authors benchmark SOTA models on multilingual response generation using the DermaVQA dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The dataset is valuable for the dermatology AI community, as it is the first VQA dataset in this field, potentially making an impact.
    2. The data curation method is interesting and may inspire other researchers to collect datasets similarly.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper’s structure is unusual for a MICCAI paper, with an overly long related work section (>2 pages) and a brief half-page introduction. Some sections in the related work part are unnecessary or overly wordy.
    2. The evaluation is limited, lacking open-source LVMs as baselines, such as LLaVa-Med, Med-Flamingo, MINI-GPT4, BLIP2, LLaVa, and Open-flamingo.
    3. Details on data curation, statistics, and methodology are insufficient.
    4. Experiment analysis is shallow, providing few insights.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The proposed dataset’s statistics should be included in Table 2 as comparison.
    2. The dataset’s quality is uncertain as the paper lacks a lot of details.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the dataset may be valuable, the overall poor writing makes it difficult to assess its true quality. The lack of open-source baseline evaluations further reduces confidence in the dataset’s value.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The rebuttal partially addresses my concerns, leading me to raise the score. I anticipate that the author can enhance the paper’s quality by considering my previous suggestions. Also, open-sourcing the dataset alongside detailed documentation is essential; without these, the dataset paper should not be accepted. Moreover, if the paper is accepted, it should include a discussion on future works based on this dataset, which is essential to make the paper more impactful.



Review #2

  • Please describe the contribution of the paper

    The paper makes a significant contribution to the field by introducing the DermaVQA dataset, addressing a critical need in the intersection of AI and dermatology.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors introduce “DermaVQA,” a novel dataset tailored for dermatology visual question answering (VQA), marking a significant contribution to the field of medical AI and dermatology. This dataset addresses a gap in multi-modal (text and image) datasets in dermatology, fostering advancements in AI-driven diagnostics and patient care.

    2. The work’s focus on both English and Chinese languages, coupled with its integration of textual and visual data, stands out. This approach not only broadens the potential user base but also mirrors real-world scenarios where patients and healthcare providers communicate in diverse languages and use images to convey medical issues.

    3. The dataset’s detailed characteristics, including annotations for diagnosis, treatment advice, and demographic information (age, sex), enhance its utility and realism. Such comprehensive details enable more targeted and nuanced AI model training, potentially leading to higher accuracy and relevance in practical applications.

    4. The authors’ methodology in dataset creation, from data collection on platforms like IIYI and Reddit to the careful filtering and gold standard response curation, is thorough and well-documented. This meticulous approach ensures high-quality data and sets a precedent for future dataset development in similar domains.

    5. The benchmarking of the dataset against current leading multi-modal models provides valuable insights into the dataset’s challenges and the current state of AI capabilities in dermatology. It lays a solid foundation for future work and improvements in the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the baseline models tested provide a starting point, the relatively modest performance scores across the board suggest that there is significant room for improvement. Future work could focus on developing more sophisticated models or algorithms specifically tailored to the unique challenges of dermatological VQA.

    2. Although the dataset covers a wide range of dermatological conditions and query types, the extent to which these findings can be generalized to other medical specialties or health queries remains to be seen. Further research could explore the applicability of the proposed methodology to broader medical contexts.

    3. The paper primarily focuses on the dataset creation and preliminary model testing. An area for future exploration could be the real-world applicability of these models, including integration into telemedicine platforms and assessment of their impact on healthcare delivery and patient outcomes.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    pros:

    1. The authors introduce “DermaVQA,” a novel dataset tailored for dermatology visual question answering (VQA), marking a significant contribution to the field of medical AI and dermatology. This dataset addresses a gap in multi-modal (text and image) datasets in dermatology, fostering advancements in AI-driven diagnostics and patient care.

    2. The work’s focus on both English and Chinese languages, coupled with its integration of textual and visual data, stands out. This approach not only broadens the potential user base but also mirrors real-world scenarios where patients and healthcare providers communicate in diverse languages and use images to convey medical issues.

    3. The dataset’s detailed characteristics, including annotations for diagnosis, treatment advice, and demographic information (age, sex), enhance its utility and realism. Such comprehensive details enable more targeted and nuanced AI model training, potentially leading to higher accuracy and relevance in practical applications.

    4. The authors’ methodology in dataset creation, from data collection on platforms like IIYI and Reddit to the careful filtering and gold standard response curation, is thorough and well-documented. This meticulous approach ensures high-quality data and sets a precedent for future dataset development in similar domains.

    5. The benchmarking of the dataset against current leading multi-modal models provides valuable insights into the dataset’s challenges and the current state of AI capabilities in dermatology. It lays a solid foundation for future work and improvements in the field.

    cons:

    1. While the baseline models tested provide a starting point, the relatively modest performance scores across the board suggest that there is significant room for improvement. Future work could focus on developing more sophisticated models or algorithms specifically tailored to the unique challenges of dermatological VQA.

    2. Although the dataset covers a wide range of dermatological conditions and query types, the extent to which these findings can be generalized to other medical specialties or health queries remains to be seen. Further research could explore the applicability of the proposed methodology to broader medical contexts.

    3. The paper primarily focuses on the dataset creation and preliminary model testing. An area for future exploration could be the real-world applicability of these models, including integration into telemedicine platforms and assessment of their impact on healthcare delivery and patient outcomes.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see above

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel dermatology dataset comprising images accompanied by corresponding queries from patients, along with doctors’ responses and comments. The authors have annotated the texts into various classes based on their nature, such as questions, advice, diagnoses, etc. Additionally, the paper demonstrates the performance of multiple models across various metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This dataset holds great promise for advancing dermatology research further. The paper is clear and provides detailed explanations of the data curation and annotation process. Additionally, the multiple results are noteworthy.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper does not have many weaknesses; however, despite the presented results, it is difficult to analyze the quality of the dataset without more examples in the paper. Additionally, there may be legal and ethical considerations related to the publication of data from the selected websites.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Providing more examples of the data in the paper would be highly beneficial. However, overall, the paper is already very well-written and useful for the community.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is in line with the conference, provide a very interesting dataset, so I would suggest accepting it.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I remain convinced about the work, and the authors respond interestingly to the comments in the rebuttal. I still suggest accepting the paper, as the work presented is pertinent to MICCAI and the dataset is very valuable.




Author Feedback

We thank the reviewers for their valuable feedback, comments, and questions. We summarize several major opinions below:

  1. Additional baseline systems and models experiments (Reviewer 3,4)

Currently we experiment with three different models: GPT4, Gemini-pro-vision, and LAVA fine-tuned. For additional context, this dataset was used in a shared task and a majority of models and complex systems did not outperform our three baselines/LVM-based solutions. The main intention of this paper is to provide dataset construction details and several baseline starting points (with reasonably high performance models) for which shared task papers/other systems can refer and compare to. We believe although there are more models that we can test, the current systems are reasonably strong general baselines for reference and that the creation of a more complex system may be better left for future work.

If accepted, we will use the additional space allocation to describe future directions of development such as employing other foundation models (e.g. BiomedCLIP, Florence) fine-tuned on a set of synthetically produced dermatology VQA data. Furthermore, experimenting with adding image segmentation and normalization, and intermediate image characteristic extraction may provide additional performance gains.

  1. Quality of the dataset (Reviewer 4,5)

The dataset was curated from two online sites, as described in Section 3. In the first site (iiyi), responses are from platform-enrolled doctors. For the second subset (reddit), the responses were written by certified US medical dermatologists. Additionally, medical annotators were employed to ensure images included relevant content, quality was sufficient to use, and did not contain inappropriate content (e.g. genitalia, or user-drawn markings). For the iiyi dataset, medical annotators ensured that posts had medically relevant advice.

  1. Details on data curation, statistics, and methodology are insufficient. (Reviewer 4)

In our manuscript, we provided high-level information related to the data creation methodology and statistics. In our dataset release, we will provide further details, including guideline instructions, related to the data curation.

  1. Lengthy related works (Reviewer 4)

Some related works discussion is relevant to the comparison of the final system scores in the results. We will move this discussion to the results section.

  1. Experiment analysis is shallow, providing few insights (Reviewer 4)

If accepted, we will use the additional page to add additional commentary on how we may employ certain techniques to overcome some of the challenges of this problem. For example, additional future directions include experimenting with other foundation models, such as BiomedCLIP, and fine-tuning on a set of synthetically produced dermatology VQA data. Furthermore, experimenting with adding image segmentation and normalization, and intermediate image characteristic extraction are potential avenues of future work.

  1. Further research could explore the applicability of the proposed methodology to broader medical contexts (Reviewer 3)

Other medical VQA specialties such as radiology, ophthalmology, and pathology have established datasets and more narrow labels. For example, previous work’s labels were largely 1-2 words with little variation. Part of the contribution of this work is the creation of a new dataset specialty that has completely different expected QA output. However, we agree, this would be an interesting direction for future work.

  1. An area for future exploration could be the real-world applicability of these models, including integration into telemedicine platforms and assessment of their impact on healthcare delivery and patient outcomes (Reviewer 3)

We agree this would be an excellent area of interest for future work.

  1. Information about legal/ethical requirements (Reviewer 5)

If accepted, we will use the additional space to provide more context for these issues.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, the reviewers reached a unanimous agreement. Considering the value of the dataset, the paper should be accepted. However, some concerns raised by the reviewers should be addressed, including issues related to writing and evaluation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the rebuttal, the reviewers reached a unanimous agreement. Considering the value of the dataset, the paper should be accepted. However, some concerns raised by the reviewers should be addressed, including issues related to writing and evaluation.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top