SPIN2026: No bad apple! SPIN2026: No bad apple!

P57Session 1 (Monday 12 January 2026, 15:00-17:30)
Examining the benefit of spatial, voice, and visual cues for speech recognition in a competing-talkers paradigm

Hartmut Meister, Denise Gradtke
Faculty of Medicine and University Hospital Cologne, Department of Otorhinolaryngology, Head and Neck Surgery, University of Cologne, Germany

Pascale Sandmann
Department of Otolaryngology, Head and Neck Surgery, Carl-von-Ossietzky University of Oldenburg, Germany

Khaled Abdel-Latif
Faculty of Medicine and University Hospital Cologne, Department of Otorhinolaryngology, Head and Neck Surgery, University of Cologne, Germany

In situations with competing talkers, three main mechanisms are beneficial for separating the different auditory streams and extracting the speech of the intended talker: Spatial cues based on interaural time and level differences help determine the position of the talkers, voice cues help distinguish between talkers and visual cues enable to identify the talker and extract speech features over and above those submitted by the auditory channel. However, individuals with hearing loss and particularly cochlear implant (CI) recipients may be limited in the use of the auditory cues, but may have enhanced abilities to exploit visual speech cues.

This ongoing study aims to investigate the relative contribution of these cues in groups of typical-hearing (TH) listeners and cochlear implant users in terms of speech recognition and cognitive load. To this end, a female target talker uttering matrix sentences (name-verb-number-adjective-object) is masked with two competing talkers, who can either have the same or a different voice and can either be spatially separated by ±30° from the target or colocated. In addition, these conditions are presented with and without additional visual speech cues, provided by a virtual character that enables the visualization of arbitrary speech materials. The target sentence always begins with the name “Stephen”, and the target-to-masker ratio is adaptively adjusted to achieve 75% correct word recognition. During stimulus presentation, gaze is tracked and pupil dilation is measured as a proxy of cognitive load.

It is hypothesized that the two study groups would exhibit different patterns: while the TH listeners primarily use auditory cues, CI recipients are expected to rely more heavily on visual cues and to be particularly limited regarding the benefit of voice cues. The present contribution outlines the study rationale and design and presents preliminary results for both TH listeners and CI recipients.

Funding: Supported by Deutsche Forschungsgemeinschaft (DFG), reference ME 2751/4-1

Last modified 2025-11-21 16:50:42