Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Ensuring equitable access to medical communication is crucial for deaf and hard-of-hearing individuals, especially in clinical settings where effective patient-doctor interaction is essential. In this work, we present a novel radar-based imaging framework for Sign Language recognition (with a focus on the Italian Sign Language, LIS), specifically designed for medical communication. Our method leverages 60 GHz mm-wave radar to capture motion features while ensuring anonymity by avoiding the use of personally identifiable visual data. Our approach performs sign language classification through a two-stage pipeline: first, a residual autoencoder processes Range Doppler Maps (RDM) and moving-target indications (MTI), compressing them into compact latent representations; then, a Transformer-based classifier learns temporal dependencies to recognize signs across varying durations. By relying on radar-derived motion imaging, our method not only preserves privacy but also establishes radar as a viable tool for analyzing human motion in medical applications beyond sign language, including neurological disorders and other movement-related conditions. We carried out experiments on a new large-scale dataset containing 126 LIS signs - 100 medical terms and 26 alphabet letters. Our method achieves 93.6% accuracy, 87.9% sensitivity, 99.3% specificity, and an 87.7% F1 score, surpassing existing approaches, including an RGB-based baseline. These results underscore the potential of radar imaging for real-time human motion monitoring, paving the way for scalable, privacy-compliant solutions in both sign language recognition and broader clinical applications. The code is available at https://github.com/IngRaffaeleMineo/SignRadarClassification_MICCAI2025 and the dataset will be released publicly.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3040_paper.pdf

SharedIt Link: https://rdcu.be/eHdUd

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_51

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/3040_supp.zip

Link to the Code Repository

https://github.com/IngRaffaeleMineo/SignRadarClassification_MICCAI2025

Link to the Dataset(s)

MultiMeDaLIS: Send request to raffaele.mineo@unict.it

BibTex

@InProceedings{MinRaf_RadarBased_MICCAI2025,
        author = { Mineo, Raffaele AND Caligiore, Gaia AND Proietto Salanitri, Federica AND Kavasidis, Isaak AND Polikovsky, Senya AND Fontana, Sabina AND Ragonese, Egidio AND Spampinato, Concetto AND Palazzo, Simone},
        title = { { Radar-Based Imaging for Sign Language Recognition in Medical Communication } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {533 -- 543}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper leverages the mmWave radar in sensing sign language, specifically the Italian sign language (LIS). The contribution comes from the protection of privacy and the proposed end-to-end architecture for processing RDM and RDM-MTI frames.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The dataset used in this paper is for Italian sign language and is different from the mostly used American sign language. Also, the dataset contains numerous isolated signs with proper signal processing to enhance the related features by RDM/RDM-MTI.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The method used in this work is simply a CNN-based autoencoder with a transformer classifier. I did not see any novelty from this architecture.
2. There is only one figure shown in this paper, and the figure contains too many processes and information, making a vague description on both signal preprocessing and model structure.
3. In the last paragraph of Introduction, the author said: “address the limitation by aligning Radar data with depth and RGB sensor”. It seems that only radar signal is used in this paper, what is the meaning of alignment of radar data with other modalities?
4. The paper lacks necessary illustrations of the (a) collected dataset; (b) model architecture; (c) variety of radar signal (RDM and RDM-MTI) caused by different language; (d) fine-grained analysis about the experimental results, e.g., confusion matrix for different hand gestures.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

One of your reference papers might be suitable as a template for you to improve your work from both experiment arrangement and academic writing/drawing. [9] Debnath, B., Ebu, I.A., Biswas, S., Gurbuz, A.C., Ball, J.E.: Fmcw radar range profile and micro-doppler signature fusion for improved traffic signaling motion classification. In: 2024 IEEE Radar Conference (RadarConf24). pp. 1–6. IEEE (2024)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The poor organization of the experiments and writing. Only one figure is plotted in this paper, which is not suitable for an acceptance of MICCAI. The deep learning architecture is too old without any novelty.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper presents a radar-based imaging framework for medical communication in Italian Sign Language.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper focuses on a rarely explored domain: sign language recognition using radar based solution. The context makes sense in that the medical setting requires more attention to privacy preserving and accessibility.
2. The proposed method achieves good performance in the targeted task.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Indirect clinical impact. The proposed method/approach is more on the LIS capturing side and does not have substantial impact to medical problems. The paper is more suitable to be considered by a conference in the sign language community
2. Data was only presented for 1 subject, which does not provide enough experimental soundness
3. Lack of technical novelty. The computational framework falls within very simple video processing method and does not consider more recent SoTAs
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The topic is interesting but does not fit well with MICCAI’s general objectives. There has been limited considerations on the clinical integration or impact. Also, the paper lacks sufficient evaluation on more than one subject and uses a relatively old computational method
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I was recommending a weak reject due to the paper’s limited suitableness, indirect clinical impact, and mediocre technical novelty. Although I was not an expert in the LIS/RDM domain, I found R1 and R2 to be confident and shared similar concerns. Would like to maintain the rating. Regarding rebuttals: “Radar SLR is unlike RGB video: RDMs encode distance x velocity and need spectral preprocessing” this does not well justify the lack of powerful backbone model. While LIS-specific data might be rare, transfer learning is always possible.

Review #3

Please describe the contribution of the paper

This paper introduces a novel radar-based framework for Italian Sign Language (LIS) recognition, designed to support medical communication in clinical settings for Deaf and hard-of-hearing individuals. Leveraging 60 GHz mm-wave radar, the system captures motion features in a privacy-preserving way by avoiding visual or personally identifiable data. The proposed pipeline consists of a residual autoencoder for feature compression from Range Doppler Maps (RDMs) and moving-target indications (MTIs), followed by a Transformer-based classifier that models temporal dependencies. The authors conduct extensive experiments on a newly collected LIS dataset consisting of 126 signs (100 medical terms and 26 alphabet letters), achieving strong results across multiple metrics and outperforming both radar and RGB-based baselines. The authors also commit to releasing the code and dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel privacy-preserving modality: The use of mm-wave radar in sign language recognition is compelling, particularly in clinical environments where privacy and anonymity are critical concerns. This is an innovative departure from traditional camera-based approaches, especially given the reported performance.
- Thorough experimentation and contribution of a new dataset: The collection of a large-scale LIS dataset (126 signs) and the comprehensive experimental evaluation, including comparisons against baselines and multiple metrics, strengthen the overall contribution.
- Clarity: The paper is also well-structured and clearly written.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited subject variability: While the dataset is substantial in terms of sign vocabulary and frames, all signs are performed by a single subject. This significantly limits the system’s ability to generalize across different individuals, a crucial aspect in real-world deployment. This limitation is not sufficiently discussed in the manuscript.
- Architectural inconsistency: There is an unexplained discrepancy in the reported size of the latent embeddings from the autoencoder, which are said to vary from 256 to 1024 before being passed to the Transformer. It is unclear whether this is a typo, a dynamic setting, or a result of sign duration variability. This point requires clarification, as it affects the interpretability and reproducibility of the model architecture.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
This is a promising and well-presented study with a strong technical contribution. A few points of improvement and clarification are suggested:
- the reported latent vector sizes (256 to 1024) passed to the Transformer should be explained more clearly. If variable-length sequences are used, this should be explicitly stated; if not, the discrepancy may simply be a typo.
- the manuscript would benefit from a discussion on the implications of training and evaluating solely on data from a single subject. This limitation should be acknowledged and framed as a direction for future work, possibly for a journal extension.
- since the method is aimed at real-time, privacy-compliant use in medical settings, a short mention of inference time or deployment considerations would further strengthen the practical relevance.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work presents a creative and well-executed solution for privacy-preserving sign language recognition using radar imaging. The technical approach is sound, and the performance gains are impressive. However, generalizability remains a concern due to the use of a single subject, and there is a need to clarify model design details.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors clearly explained and solved the reported issues. The weak accept can therefore be considered as an accept.

Author Feedback

We thank the reviewers for their suggestion.

R1

Any novelty: Although deceptively simple, this is the first application of a CNN autoencoder followed by a Transformer to mmWave radar data for SLR. RDM sequences encode sparse Doppler motion signatures with non-smooth temporal transitions, framing the problem as 2D signal processing over time rather than RGB video understanding. We show that a lightweight autoencoder can compress radar frames while preserving discriminative features, enabling a Transformer to outperform ResNet-18, AlexNet, and end-to-end Transformers on RDMs by 5–15% accuracy, using up to 4x less GPU memory and 3x faster training (p4.2). This efficiency–accuracy trade-off, detailed in p3.3, allows real-time inference on CPUs or edge-GPUs (clinical deployments). The final manuscript will emphasize this novelty for radar signals.

Insufficient visuals: We used a single high-level schematic (Fig1) to avoid clutter under tight page limits. It shows the full pipeline, while all layer details, preprocessing steps, and hyperparameters are described in the text. A brief supplementary video demonstrates data capture and preprocessing for full clarity.

No multimodality: Data were acquired multimodally (radar, RGBd, depth); we detailed all preprocessing for completeness. This paper uses only radar, but mentioning radar–RGBd synchronization aids future multimodal extensions. We can drop the depth/RGB reference if preferred.

Lack (a) dataset, (b) architecture, (c) RDM vs RDM-MTI; (d) analysis: (a, c) We will add two panels showing an RGB frame, its RDM, and its MTI map for one medical sign and one letter. (b) Fig1 gives a high-level autoencoder schematic; full layer and hyperparameter details are in the text and code. The Transformer follows Vaswani et al., documented in p3.2 and the repo. (d) Instead of a 126×126 matrix, we’ll include a brief summary of top confusion pairs (e.g., pain–ache, B-P) and a per-class F1 inset. Most confusions occur between similar glosses (ARM–HAND (10%), COUGH–COLD (8%), MOUTH–THROAT (9%), EMERGENCY_ROOM–HOSPITAL (7%), ME–YOU (6%), MORNING–DAY (5%), FOREHEAD–NECK (4%), YES–NO (3%)) pointing to future gains from multimodal cues or targeted augmentation.

R2

Limited subject variability: We acknowledge this limitation and will add it to Future Work (p4), outlining plans for cross-subject validation, greater signer diversity, and external testing. We have already collected data from nine more LIS signers and three foreign SL datasets (ASL, Turkish, Arabic) for our journal follow-up.

Unclear embedding size: We apologize for the confusion. Our autoencoder’s bottleneck is 256 channels x 2x2 spatial = 1024 total units, which we project down to a 256-dimensional frame embedding before Transformer input. We will correct the manuscript to explain this clearly.

Real-time applicability: We will add Tab4 in p4.3 to report end-to-end inference on an NVIDIA A100 (7 ms per frame = 142 fps) and on an INTEL Core i5 gen10th (81 ms per frame = 12 fps), demonstrating real-time feasibility of our model on edge-devices.

R5

Indirect clinical impact: Our primary goal is to enable automated front-desk translation in ER/triage, where signers cannot always use interpreters. We clarify in Conclusion that our end-vision is an avatar receptionist translating LIS to speech/text in real time, reducing wait times and improving patient comfort.

Limited subject variability: See R2 first point.

Simple video-processing methods: Radar SLR is unlike RGB video: RDMs encode distance x velocity and need spectral preprocessing. In our tests, 3D-CNNs (ResNet3D, I3D) on raw RDMs achieve ≤40% accuracy; prior radar methods reach only 71.9–81.0% (Jhaung, Debnath, Arab), and RGB/RGB-D baselines top out at 88.4% (De Coster) and 84.1% (Vahdani). Our radar-specific architecture, by contrast, achieves 93.6%, surpassing all SOTA. We will include a brief comparison in p4.

We look forward to a favorable decision, thanks.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper proposes a privacy-preserving radar-based framework for sign language recognition in medical communication. While the idea is novel and the performance on LIS data is promising, the method lacks technical innovation and broader clinical integration. The model is built on standard components without strong justification for design choices. The dataset includes only one signer, limiting generalizability. Clinical impact is indirect, and evaluation does not reflect real-world variability. Despite a detailed rebuttal, concerns regarding subject diversity, architectural clarity, and the paper’s fit within MICCAI remain. Therefore, I do not recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Rejected because 2 of the reviewers refused, but the third was on the verge of acceptance, with lots of remarks and harsh criticism.

back to top

Radar-Based Imaging for Sign Language Recognition in Medical Communication

Author(s):