SPIN2026: No bad apple! SPIN2026: No bad apple!

P07Session 1 (Monday 12 January 2026, 15:00-17:30)
Automated question and answering based speech-in-noise test using large language models

Mohsen Fatehifar, Kevin Munro, Michael Stone
University of Manchester, UK

David Wong
Leeds Institute of Health Sciences, University of Leeds, UK

Tim Cootes
University of Manchester, UK

Josef Schlittenlacher
Department of Speech, Hearing and Phonetic Sciences, University College London, UK

The aim of this study was to develop and evaluate a self-supervised conversational sentence-in-noise (SIN) test using machine learning and to compare its validity (detecting hearing loss) and reliability (consistency across two runs) with those of a standard clinical test. Two tests were used. The first was the new Conversational Hearing Assessment Test (CHAT), designed to evaluate hearing by measuring participants’ ability to understand and respond to simple questions. In total, 500 statements and 500 corresponding questions were generated using ChatGPT and converted to speech with a text-to-speech (TTS) model. Each statement–question pair was mixed with background noise and presented to the participants. They were then asked to answer the question. Their spoken responses were transcribed with automatic speech recognition (ASR), and ChatGPT determined whether the answers were correct. The second test was the Adaptive Sentence List (ASL) test, a clinically established test, in which sentences were mixed with noise and participants verbally repeated the presented sentence. Their response was then scored by the researcher to determine whether they repeated the word correctly or not. Both tests measured speech reception thresholds using an adaptive up–down procedure. Reliability was assessed with Bland–Altman analyses, and limits of agreement between two runs of the same test were reported. Validity was evaluated using the area under the ROC curve (AUC and Youden index) to classify participants as normal hearing or hearing impaired. 40 native speakers participated, including 20 with normal hearing and 20 with hearing impairment.

For reliability, two runs of the ASL test had limits-of-agreement of ±3.1 dB, whereas CHAT had a limits-of-agreement of ±3.6 dB. In terms of validity, the ASL test reached an AUC of 0.93 and Youden index of 0.71 while CHAT had an AUC of 0.97 and Youden index of 0.89. These findings indicate that CHAT achieved validity and reliability comparable to the ASL test, demonstrating its potential as an ecologically valid, accurate and reliable screening tool for hearing loss. Moreover, because CHAT operates without the need of a human supervisor, it can be adapted for online delivery, allowing participants to complete the test remotely in their own homes.

Last modified 2025-11-21 16:50:42