Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

We present a novel dual-stream deep learning architecture, AcouSem-AFNet, for automated tuberculosis (TB) detection using acoustic analysis of respiratory sounds. The proposed architecture utilizes two complementary pathways to extract distinct semantic and acoustic characteristics essential for identifying TB-related respiratory patterns. Specifically, the semantic stream employs a Whisper encoder to model structured patterns in respiratory events, while the acoustic stream leverages WavLM to capture detailed temporal dynamics characteristic of TB cough sounds. These distinct features are fused through a specialized backbone with squeeze-excitation mechanisms and residual connections, designed explicitly to maintain discriminative capabilities and mitigate overfitting challenges typical of limited medical datasets. Evaluated on the CODA-TB challenge dataset, our approach achieves state-of -the-art performance with an accuracy of 78.10% and an AUC of 0.79, demonstrating improvements of 3% in AUC and 2% in accuracy over leading baseline methods. Our framework enables rapid, non-invasive TB screening, particularly beneficial for resource-limited settings, demonstrating the feasibility of deep learning-based acoustic analysis as a scalable, preliminary diagnostic tool to enhance global TB screening accessibility. The code and models are publicly available at https://github.com/IAB-IITJ/AcouSem-AFNet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5335_paper.pdf

SharedIt Link: https://rdcu.be/eHwLG

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_44

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/IAB-IITJ/AcouSem-AFNet

Link to the Dataset(s)

Cough Diagnostic Algorithm for Tuberculosis (CODA TB) challenge dataset: https://www.synapse.org/Synapse:syn31472953/wiki/619711

BibTex

@InProceedings{AkhYas_NonInvasive_MICCAI2025,
        author = { Akhter, Yasmeena AND Ranjan, Rishabh AND Dutta, Bikash AND Vatsa, Mayank AND Singh, Richa},
        title = { { Non-Invasive TB Detection using Acoustic and Semantic Features from Cough Sounds } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {460 -- 470}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper is about Non-Invasive TB Detection, which is an interesting topic. The authors propose a novel dual-stream deep learning architecture, AcouSem-AFNet, for automated tuberculosis (TB) detection through acoustic analysis of respiratory sounds. The paper is well written and well organized. However, there are several concerns in the current version of the paper that addressing them will increase the quality of this paper.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1 The research directions in this paper are popular and of interest to readers and can provide new ideas.

2 The paper is written in a logical and easy to understand manner.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1 In the abstract section, the authors need further describe the background and challenge for the readers not belong to this field, instead of spending a lot of space on modeling details.

2 Figure 1 can be further improved, the amount of information expressed at present is very limited, and more examples and illustrations can be given. Figure 2 also needs to be improved; it is currently impossible to see a clear model structure.

3 The experimental component is inadequate, especially the in-depth analysis of the results and phenomena.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experiment is inadequate and the presentation needs to be improved.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s reply resolved my doubts to some extent, and I think the score can be improved to weak accept.

Review #2

Please describe the contribution of the paper

The paper proposes AcouSem-AFNet, a dual-stream model combining semantic and acoustic features for TB detection from respiratory sounds. It achieves state-of-the-art results on the CODA-TB dataset and shows improved accuracy and AUC over baseline models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper proposes a dual-stream model for cough sounds-based TB test and achieve better results in AUC and accuracy compared to baseline models
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. the authors assume that TB-related cough sounds contain distinct acoustic biomarkers, but do not provide sufficient evidence or discussion about the variability and potential overlap with other respiratory diseases. Only trying to distinguish TB from COVID-19 is not realistically feasible since there are tons of disease and illness share the cough symptoms.
2. while the work emphasizes real-world applicability in resource-limited settings, no details are provided regarding robustness to environmental noise or hardware heterogeneity. Besides, the criticism of traditional TB screening methods is somewhat simplified, and the limitations of prior work are not quantitatively substantiated.
3. The novelty of the model architecture remains unclear, as it primarily combines existing modules.
4. Are the two streams (Whisper & WavLM) just connected in parallel or are there any interactions? How to ensure that they extract complementary information rather than redundant information? Are there any ablation experiments to illustrate the effectiveness of the two-stream design?
5. The author used the training part of the challenge dataset and then divided it (70/10/20). Is it possible to cause patient-level information leakage (different cough samples of the same patient enter training and testing)? Is the division based on patients?
6. While the proposed model shows modest improvements in AUC and accuracy, the claim of clinical significance may be overstated without supporting evidence, such as reduced false negatives or earlier detection in real-world scenarios. Moreover, no statistical significance testing like p-value or confidence intervals are provided. The statement that the results “conclusively demonstrate” the superiority of the approach is too strong given that evaluations are limited to a single dataset without external validation. A more detailed ablation study or subgroup analysis would help clarify the actual contribution of each component.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak reject due to concerns about the limited novelty, lack of evidence supporting the distinctiveness of TB-related cough features, absence of robustness analysis for real-world deployment, and insufficient experimental validation such as ablation studies and statistical significance testing.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The main contribution is the development of AcouSem-AFNet, a novel dual-stream deep learning architecture designed for automated tuberculosis (TB) detection through cough sound analysis. This innovative model integrates semantic features, extracted using a Whisper encoder to capture structured respiratory patterns, and acoustic features, derived from WavLM to preserve fine-grained temporal characteristics specific to TB. These features are fused through a specialized backbone featuring squeeze-excitation mechanisms and residual connections, enhancing discriminative power while mitigating overfitting on limited medical datasets. Evaluated on the CODA-TB challenge dataset, AcouSem-AFNet achieves state-of-the-art performance with 78.10% accuracy and an AUC of 0.79, outperforming existing methods by a significant margin (3% AUC improvement and 2% accuracy gain over the best baselines). This framework offers a rapid, non-invasive, and scalable solution for TB screening, particularly valuable in resource-constrained settings, addressing critical gaps in global TB detection.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper exhibits several major strengths that distinguish it as a significant contribution to automated tuberculosis (TB) detection. Its primary strength lies in the novel formulation of the AcouSem-AFNet architecture, a dual-stream deep learning model that uniquely integrates semantic and acoustic features from cough sounds. The semantic stream, powered by a Whisper-large encoder, captures structured patterns in respiratory events, while the acoustic stream, utilizing WavLM, preserves fine-grained temporal characteristics specific to TB-related sounds. This parallel processing of complementary features is innovative because it addresses the limitations of single-stream models that often fail to capture the full spectrum of TB acoustic signatures. By fusing these streams through a specialized backbone with squeeze-excitation mechanisms and residual connections, the model maintains discriminative power while preventing overfitting, a critical challenge in medical datasets with limited samples. This architectural novelty is interesting as it sets a new standard for audio-based diagnostic systems, offering a robust framework adaptable to other respiratory diseases. Another strength is the strong evaluation on the CODA-TB challenge dataset, which includes 9,772 cough recordings from 1,105 patients across seven countries. Achieving state-of-the-art performance with 78.10% accuracy and 0.79 AUC, the model outperforms baselines by 3% in AUC and 2% in accuracy. This rigorous evaluation, supported by detailed comparisons with eight existing architectures like RawNet3 and SpecRNet, demonstrates the model’s superior ability to detect TB-specific acoustic patterns. The inclusion of ROC curve analyses and performance metrics further enhances the evaluation’s credibility, making it a compelling case for the model’s reliability. The paper also demonstrates clinical feasibility, a crucial strength for real-world impact. By leveraging cough sounds, a non-invasive, specimen-free biomarker, the approach addresses diagnostic barriers like limited healthcare access and stigma-induced underreporting. The 3% AUC improvement translates to reduced false negatives, potentially enabling earlier TB intervention, which is particularly valuable in resource-constrained settings. This clinical relevance is underscored by the model’s design for rapid, low-cost screening, aligning with global health needs for scalable TB detection solutions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper presents a compelling approach, but it has several notable weaknesses that warrant discussion. One significant weakness is the limited discussion of model interpretability, which is critical for clinical adoption. While the AcouSem-AFNet model achieves impressive performance, the paper does not adequately explain how the semantic and acoustic features correlate with TB-specific pathophysiology. For instance, it lacks detailed analysis of which cough sound characteristics (e.g., pitch, timbre, or temporal patterns) are most discriminative, making it challenging for clinicians to trust or interpret the model’s decisions. This gap reduces the practical utility in medical settings, where interpretability is often as important as accuracy. Another weakness is the absence of validation across diverse demographic populations beyond the CODA-TB dataset’s seven countries. Although the dataset is comprehensive, the paper does not address performance variations across age groups, genders, or comorbidities, which are critical for ensuring robustness in real-world scenarios. For example, TB presentation may differ in elderly patients or those with HIV, yet the paper provides no subgroup analysis to confirm generalizability. This limitation undermines the claim of scalability, as untested demographic factors could affect model reliability. The paper also falls short in exploring multi-modal integration, despite acknowledging it as a future direction. Prior work, such as Nathavitharana et al. (2019, cited in the paper as reference [5]), emphasized the value of combining clinical metadata with diagnostic tests to improve TB triage accuracy. By focusing solely on cough acoustics, the paper misses an opportunity to enhance performance using readily available metadata like patient symptoms or medical history, which could have strengthened clinical relevance. This reliance on a single modality is a notable oversight given the precedent set by earlier studies. Additionally, the evaluation, while strong, lacks a thorough analysis of failure cases. The paper reports a 78.10% accuracy and 0.79 AUC but does not discuss scenarios where the model misclassifies TB-positive or TB-negative cases. Understanding these errors—potentially through confusion matrices or qualitative analysis of misclassified samples—would provide deeper insights into limitations and guide future improvements. Without this, the evaluation feels somewhat superficial despite its quantitative rigor. Finally, the computational complexity of the dual-stream architecture is not addressed, which could hinder deployment in resource-constrained settings—the very environments the paper targets. The use of Whisper-large and WavLM encoders, combined with AFMS-Res2MP blocks, suggests significant computational demands, yet the paper provides no details on inference time, memory requirements, or feasibility on low-cost devices. This omission contrasts with the stated goal of accessibility and scalability, as resource-limited clinics may struggle to implement such a model without optimization. These weaknesses collectively temper the paper’s impact, highlighting areas for refinement to fully realize its potential.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My recommendation of “Weak Accept — could be accepted, dependent on rebuttal” is based on a balanced assessment of its significant strengths and notable weaknesses, with the final decision hinging on the authors’ ability to address critical gaps in their rebuttal. The major factors influencing this score are outlined below, reflecting the paper’s contributions, limitations, and potential for improvement. The paper’s primary strength is its novel AcouSem-AFNet architecture, which introduces a dual-stream deep learning model integrating semantic (Whisper-large encoder) and acoustic (WavLM) features for TB detection from cough sounds. This innovative design effectively captures complementary respiratory patterns, achieving state-of-the-art performance with 78.10% accuracy and 0.79 AUC on the CODA-TB dataset. The 3% AUC and 2% accuracy improvement over baselines like SpecRNet and RawNet3 demonstrate a meaningful advance in acoustic-based diagnostics. This architectural novelty, combined with a robust evaluation across a diverse dataset from seven countries, strongly supports the paper’s scientific merit. The clinical feasibility of non-invasive, low-cost TB screening further enhances its relevance, addressing a critical global health challenge in resource-constrained settings. However, several weaknesses temper this enthusiasm, influencing the “weak” designation. The lack of model interpretability is a significant concern, as the paper does not clarify how specific cough sound features correlate with TB, limiting its clinical trustworthiness. This is particularly problematic for medical applications where understanding decision-making is essential. Additionally, the absence of subgroup analysis across demographics (e.g., age, gender, comorbidities) raises questions about generalizability, despite the dataset’s breadth. The paper’s failure to explore multi-modal integration, despite prior work like Nathavitharana et al. (2019) highlighting its value, misses an opportunity to boost performance with clinical metadata. The evaluation, while strong, lacks depth in analyzing failure cases, which would provide a more comprehensive understanding of limitations. Finally, the computational complexity of the model is unaddressed, potentially undermining its feasibility in low-resource settings—a key claim of the paper. The “Weak Accept” recommendation reflects the paper’s strong contributions but acknowledges these gaps as barriers to immediate acceptance. The major factors driving this score are the innovative methodology and solid performance, weighed against the need for better interpretability, demographic validation, and practical deployment details. A compelling rebuttal addressing these concerns—particularly by providing interpretability insights, subgroup analyses, or plans for computational optimization—could justify acceptance. Conversely, failure to resolve these issues might warrant rejection, as they impact the paper’s real-world applicability. Thus, the recommendation hinges on the authors’ ability to strengthen their case through clarifications and additional evidence.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

MR: Thank you for your valuable feedback & for recognizing our work’s potential in global TB diagnostics. Our approach, aligned with WHO priorities, offers an automatic, non-invasive, & scalable diagnostic aid using cough sounds, addressing the limitations of traditional methods like Chest X-rays in resource-limited settings. We will include ± std dev in our results (Accuracy: 77.58%±0.0051, AUC: 0.7156±0.0034) to clarify their significance. We sincerely thank all reviewers(R1,R2,R3) for their constructive feedback & recognition of our contributions. Reviewers highlighted our work’s strengths:R1 noted our architecture “uniquely integrates semantic & acoustic features,” establishing “a new standard for audio-based diagnostic systems.” R3 acknowledged our model achieves “state-of-the-art results” with notable accuracy & AUC improvements. R2 commended our paper as “popular,” “interesting,” &”easy to understand.”

Interpretability (R1): We agree that interpretability is vital for clinical trust. Gradient SHAP analysis reveals our model’s predictions align closely with clinically established TB pathophysiology, notably airway inflammation & mucus production (Chung et al., Lancet, 2008). Feature attribution peaks within the initial (0.15–0.25 sec) & mid-cough phases (0.25–0.35 sec), especially in the 5–10 kHz range, corresponding directly to known acoustic biomarkers of TB-related respiratory changes. These insights will be included in our revision.

Demographic Generalizability (R1,R3): We assessed demographic robustness across gender & age. Accuracy is balanced across genders (Male: 76.01%, Female: 80.14%). Age-wise, accuracy peaks in middle-aged groups (35-51: 80.4%, 52-68: 80.7%), but notably drops in elderly (69.7%).

Experimental Depth, Ablation, Significance (R2,R3): Responding to reviewers, additional experiments were conducted:

Ablation confirms our dual-stream approach (78.10%) surpasses single-stream Whisper or WavLM models (71–73%).

McNemar’s test comparing AcouSem-AFNet vs. SpecRNet yields p=0.0359, confirming statistically significant improvement.

Confusion matrix analysis shows high specificity (85.2%) but moderate sensitivity (67.5%), guiding future improvements to TB-positive sensitivity, particularly in older populations.

Data Leakage & Reproducibility (R3): We confirm strict patient-level splits (70:10:20), eliminating leakage risks. Additionally, we commit to publicly releasing our code post-acceptance to facilitate reproducibility.

Computational Feasibility (R1,R3): Addressing deployment concerns, our model utilizes frozen Whisper/WavLM encoders, with only 109M/211M parameters trainable. This reduces computational demands significantly (inference ~0.55 sec/100 samples). Our model is also deployable via lightweight frameworks (TensorFlow Lite, ONNX Runtime), confirming suitability for low-resource clinics.

Robustness to Noise & Hardware (R3): The CODA-TB dataset includes diverse countries/settings, demonstrating initial robustness. Yet, systematic noise & hardware variation analyses are important. Future research will explicitly investigate these via synthetic noise augmentation & cross-device performance evaluations.

Presentation & Figures (R2): Abstract revision will succinctly clarify TB diagnostic challenges & their relevance to broader audiences. Figs 1 & 2 will be updated to add more clarity.

Multi-modal Integration (R1,R3): We acknowledge reviewers’ suggestions on multi-modal integration. Our architecture’s modular design readily allows integration of clinical metadata, which previous studies (Nathavitharana et al.,2019) demonstrate as valuable for TB diagnostics. This important future extension will be explicitly outlined in the revised manuscript. In summary, reviewer insights have greatly improved the clarity, depth, & clinical relevance of our manuscript. Our revisions comprehensively address all concerns, substantially strengthening our paper’s scientific & clinical impact.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
There were mixed reviews, all open to further conversation of a rebuttal phase. I urge the authors to carefully incorporate the thoughtful comments of the reviewers. A couple extra comments:
1. This is an enormously important health care issue. See e.g. John Green’s new book “Everything is tuberculosis”
2. This approach (vs e.g. x-ray), if it ever worked, would substantially expand diagnostic options for much of the world that is most at risk for TB.
3. What constitutes clinically-relevant performance (e.g. consulting WHO or various MoH guidelines), and in what way do these results move in that direction?
4. Is it possible to provide +/- std dev for comparative results? Without these, it is very hard to assess whether any differences in results are meaningful.
5. The paper appears well-grounded in both the medical and the ML literature.
6. References formatting: Capitals etc can be preserved by using braces, eg {GHTC}, in the .bib
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper proposes AcouSem-AFNet, a dual-stream architecture integrating semantic and acoustic features for non-invasive tuberculosis detection from cough sounds. The approach is novel, clinically relevant, and achieves state-of-the-art performance on the diverse CODA-TB dataset. While reviews were mixed—one weak reject, one weak accept, and one accept—all acknowledged the potential impact and originality of the method. The rebuttal addresses concerns comprehensively, including new interpretability analyses, subgroup evaluations across age and gender, ablation and significance testing, and computational feasibility for deployment in low-resource settings. The authors also confirmed strict data splits and committed to public code release.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The paper receives mixed reviews. R3 points out valid arguments, such as limited novelty, lack of evidence supporting the distinctiveness of TB-related cough features. Except this, the paper shows a reasonable designed method to solve an understudied clinical problem. I lean towards acceptance.

back to top

Non-Invasive TB Detection using Acoustic and Semantic Features from Cough Sounds

Author(s):