In Herzogenrath, Hans W. Gierlich and other acoustic engineers at HEAD acoustics are researching communication technologies for mobile and home offices
During an international video conference involving Angela Merkel in April, it became clear that the Chancellor is not immune to dealing with sound and image interference. “Can you hear me now?”, she asked when she encountered the initial audio problems. Later, an Arab politician was frozen on the screen for a long period of time, and Melinda Gates from the Gates Foundation couldn’t be heard at all. For Hans W. Gierlich, Managing Director at HEAD acoustics in Herzogenrath, these are problems that could probably be solved relatively easily. The acoustics engineer deals with topics such as speech and audio quality and works for many global corporations in the telecommunications sector.
BY GUIDO M. HARTMANN
Mr. Gierlich, what is usually the problem when it comes to video or phone conferences?
HANS W. GIERLICH: Anyone who has tried having an audio or video conference in recent days or weeks knows the problems. Jerky or distorted voices, or the familiar echo effect where I hear myself twice, and my colleague sounds like a robot. The space you’re in also plays a role here. Am I sitting in a reverberant room, in a packed study, or is street noise roaring in through a partially open window? Furthermore, we are faced with technical limitations that hinder natural interaction and communication.
These can include things like transmission delays. This makes it difficult to interrupt a participant, with the result that queries or interposed questions are often technically impossible – similar to the way walkie-talkie radios used to be.
The ideal solution would be signal processing and transmission with extremely low latency – that’s the time it takes for the speech signal to travel from the mouth of the talker to the ear of the receiver. This means that the time for the transfer between the talker and listener should be less than 150 milliseconds. Anything that takes longer makes human communication more difficult.
In addition to the quality of the individual components, the interaction between the devices as well as the network and the conference bridge are decisive. Technology to optimize all these elements adequately is available. In principle, it does not matter whether you’re using loudspeakers on a PC, headsets, smartphones, conference telephones or speakers with voice assistants. Or whether you’re using Skype, Microsoft Teams, Zoom or one of the other platforms for the video conference. Or, for that matter, whether you’re connected via Bluetooth, Wi-Fi or something else.
Many existing standards have minimum requirements that are typically adhered to by the manufacturers of the devices. But we are now seeing that using only the basic standards is insufficient. These are certainly not satisfying.
Tests that mimic these scenarios are difficult to perform under real conditions during development. But laboratory-based solutions are available. Highly developed test procedures and simulations for laboratory use are available at our facilities in Herzogenrath, at our subsidiaries in many countries, and also at our customers’ facilities. This applies to all conceivable types of equipment: cordless or wired headsets, PCs, tablets, all types of telephone, cell phones, conference systems, modems and interfaces such as Bluetooth or Wi-Fi. This means that typical conference scenarios with all their influences, as well as background noise and reverberation, can be simulated in the laboratory using state-of-the-art technology and then used in the optimization of products.
The solution to these problems could lie in better, more accurate measurement of devices such as mobile phones, Bluetooth devices and others. This is all feasible, but it does of course involve certain additional costs. While the testing effort is admittedly multiplied, the additional cost to the consumer in the case of mass-produced products such as headsets, cell phones or PCs would probably be less than one percent of the price of the device. However, the companies would have to be prepared to include these additional costs in the budget for the development and production process. Perhaps this will become an issue at a time when audio and video conferencing is increasingly being relied upon. We have the experts for this in-house at our company, but our customers can use this measurement technology directly in their own laboratories as well.
Speech Quality in Conferencing Scenarios
Understanding All Optimization Requirements
Newsletter The Audio Voice 283, Week25
BY DR.-ING. HANS W. GIERLICH
While on the one hand, we experience the benefit of having communication possibilities anytime and anywhere and use all types of communication platforms allowing us to interact in virtual conferences without being present physically – we also experience significant variations in communication quality, especially when being involved in virtual meetings. The question is: What is communication quality and which factors distinguish good communication quality from poor communication quality?
When defining communication quality you will find that there are varieties of aspects in speech quality, which need to be considered from the user’s point of view, such as listening speech quality, and associated artifacts, such as interrupted and distorted voice; the delay to transport the speech signal from one end to the other, hampering the conversational flow; echo and howling effects especially in multi-party conferences; not being able to interrupt conversational partners, and more.
Additionally, the user’s environment and the user’s behavior plays an important role. Adverse conditions, noise, and reverberation may impair the speech quality on all sides of a virtual meeting. User movements, non-optimum positioning of devices, and their effects on communication quality need to be understood. Perceptually, latency, speech intelligibility and listening effort, loudness, speech sound quality, talking effort and echo performance, double-talk performance, and localization performance are all relevant parameters deserving technical investigation.
These parameters are perceptually relevant and can be evaluated subjectively. However, for (objective) system test and product optimization, procedures are required to help engineers in a lab-type environment. The instrumental procedures used nowadays to qualify and optimize products are based on perceptual investigations and can be applied with a good degree of confidence by engineers in the lab.
Technically, the conversational quality is determined by the quality of any device used in a connection. Transparency and compatibility of devices is key when interconnected. The quality of the communication terminal, the network quality, and the quality of the conference bridge are of equal importance.
The minimum requirements found in existing standards and which are typically applied by manufacturers of devices are insufficient and generally do not cover this complex scenario. For optimization, a variety of additional technical parameters can be measured and used.
The basic optimization process starts with the endpoints of the connection, the terminals – regardless of which type of terminal is used: handset or headset (wired and cordless), hands-free, conference system, smart speaker, PC, tablet or other smart devices.
In this context, it is also necessary to take into account the user behavior as well as the user’s environment. Nobody can perform such tests in real-life situations, but fortunately there are laboratory-based solutions available.
Background noise simulation is important to validate the performance of noise cancellation and voice enhancement strategies for a variety of background noises found in daily life scenarios. ETSI TS 102 224  specifies the recording and setup procedures as well as pre-recorded background noise for those simulations. Using an HEAD acoustics simulation system 3PASS allows this simulation for all types of terminals with acoustical interfaces.
An identical setup can be used for the simulation of reverberation, which is of special importance in the sending direction. Reverberation may impair speech communication as well as speech recognition. A simulation method for lab-based testing with reverberation, according to ETSI TS 103 557 , is used for that type of simulation.
User behavior can also be tested in different ways for different devices. A motorized handset positioner is used for testing positional robustness with all type of handheld devices. The efficiency of noise cancellation can be evaluated and optimized by simulating various user positions. The goal is to find a good balance between loudness, speech sound quality and the amount of background noise reduction in various positions and for various noise and room conditions. In a similar way, the speech sound quality and the listening effort  can be optimized in the receiving side.
Testing the robustness of echo cancellation requires a variable echo path during the tests. An almost soundless rotating reflector can be used in the nearfield. The conversational performance of devices is determined by the delay introduced and the seamless behavior in all types of double-talk situations with and without background noise. During double-talk, no artifacts should be observed, neither from switching or echo nor any performance decrease with background noise. The signal processing quality in a terminal mostly determines all these parameters and can be optimized in the lab. A very good overview about testing and performance requirements can be found in the ETSI standard series  and . Testing of headphones follows the same principle. Various positions typical for users are chosen for the tests as described above.
In speakerphone mode (e.g., with hands-free, laptop, tablet and handheld devices) and with conference devices, user behavior impact and the acoustical environment are even more critical. The acoustical coupling between loudspeaker(s) and microphone(s) is strong, leading to potential echo problems. User movements affect echo canceling performance more than in handheld devices. The distance between the talker and the microphone is much higher, and as a consequence, room noise and room reverberation have a much stronger impact on the performance than in handheld mode.
However, as with the handheld mode, similar testing strategies simulating reverberation and background noise (, ), user movement simulation by using a rotating reflector, and a turntable to rotate the device under test can be used. And the metrics used for qualifying are the same as in handheld mode. The parameters determining the quality from the user’s point of view are almost the same.
Speech coding performance in conjunction with network performance is essential for all types of endpoints including modems. Since most conferencing solutions are OTT solutions, no bandwidth guarantee is given and channel quality may vary depending on load conditions and routing. Lab-based testing is performed using a different type of network jitter and loss simulations applied to the speech codec. The parameters tested are listening speech quality and jitter buffer handling. With jitter and packet loss, it is key to limit the increase of the jitter buffer size while simultaneously keeping the speech quality high. This can be achieved by optimizing the jitter buffer handling and the packet loss concealment, which today is typically part of the speech codec used.
All the procedures and setups described above are key in all HEAD acoustics’ development work over the last few decades. Complete sets of automated test procedures and simulations are available for laboratory use, for all types of devices and configurations. In addition, setups for the optimization of the complete transmission chain are available. There is big potential in quality increase if such advanced optimization strategies are used.
HEAD acoustics provides complete and automated test systems to bring the user’s environment to the lab. The most advanced tests and simulation methods are available and can be used for device and system optimization. When successfully applied, a much better user experience for all systems can be realized than often found today.
ETSI TS 103 224: Speech and Multimedia Transmission Quality (STQ): A sound field reproduction method for terminal testing including a background noise database, ETSI 08/2019
ETSI TS 103 557: Speech and Multimedia Transmission Quality (STQ): Methods for reproducing reverberation for communication device measurements, ETSI 08/2019
Gierlich, H.W.; Reimes, J.: Objective Listening Effort Evaluation, audioXpress 04/2020
ETSI ES 202 737 – ETSI ES 202 740 series: Speech and multimedia Transmission Quality (STQ); Transmission requirements for narrowband @wideband VoIP terminals (handset, headset & hands-free) from a QoS perspective as perceived by the user, ETSI 03/2020
ETSI TS 102 925: Transmission requirements for Super-Wideband/Fullband hands-free and conferencing terminals from a QoS perspective as perceived by the user, ETSI 10/2018.
audioXpress November 2020
BY DR.-ING. HANS W. GIERLICH
With most company employees working from home and the trend toward decentralized organizations and remote collaborations, webinar and conferencing services are the new way of communication. Once these communication solutions are adopted, we all have the same problems: The speech transmission quality varies. Sometimes it is so bad that good communication is not possible. But the reasons for this are complex. This article addresses how to evaluate and determine strategies for these challenges.