| Technical Papers Library | |
|
3-D Audio Primer This document presents an introduction to the general concepts and performance of three-dimensional audio technology. Several audio technology categories are defined with the purpose of creating a common understanding of "better-than-stereo" audio playback methods. Contents:
Since the late 1970's, several audio technologies have been developed to advance the state of the art in audio reproduction beyond stereo. Most of them are focused on increasing the dimensionality of sound playback beyond the one-dimensional stereo sound field created by conventional playback on a left/right speaker pair. Furthermore, the advent of digital audio signal processing has enabled interactive audio experiences: similar to live music, sounds are created on-the-fly based on user input (for example in video games), rather than being based on playback of a pre-recorded soundtrack (as in movies). A3D from Aureal is a digital audio technology that has been developed to provide maximum performance in both areas of dimensionality and interactivity. A3D technology is based on the principles of binaural human hearing. Binaural means that we hear using two ears. From the two signals that our ears perceive, we can extract enough information to tell where a sound is located in the three dimensional space around us. The functioning of the human hearing system has been researched successfully over the last two decades by psycho-acoustic researchers around the world. They have provided us with the necessary findings and understanding that today’s A3D audio systems are based on. To put it in simpler terms: since we can hear three-dimensionally in the real world using just two ears, it must be possible to achieve the same effect from just two speakers or a set of headphones. On this basic assumption, 3D audio products have been successfully built. This document starts by explaining how different forms of audio processing compare against each other ("What is and What isn’t 3D Audio"). It then focuses on the concepts of acoustics and human hearing that A3D is based on, and details the digital audio building blocks that make up an A3D system.
2. WHAT IS AND WHAT ISN'T 3-D AUDIO As mentioned in the introduction, there are two key pieces to a 3D audio system: 3D positioning and interactivity. A full-featured 3D audio system provides the ability:
Certain technologies, namely stereo extension and surround sound, offer some aspects of 3D positioning or interactivity. They are discussed here to explain what applications they are geared towards, and why they are not considered to be part of a new category of technologies, called Positional 3D Audio. This new category combines full 3D positioning and interactivity to offer a new kind of audio listening experience. A3D is the industry leading positional 3D audio technology. A comparison chart of different audio playback methods is included to help differentiate the features of each technology.
Extended stereo technologies and products process an existing stereo (two channel) soundtrack to add spaciousness and to make it appear to originate from outside the left/right speaker locations. These products are particularly useful to restore stereo performance to low-end PC multimedia sound systems that typically contain low-quality speakers that are placed very closely together. Extended stereo effects can be achieved via various, fairly straight-forward methods. Additionally, their performance is often evaluated based on subjective criteria such as listening tests. For those reasons it is somewhat difficult to compare products in this area. Some of the differentiators include: Although sometimes marketed under the name "3D Sound" or "3D stereo" extended stereo technologies are not considered to be 3D audio technologies, because they only offer passive spreading of an existing soundtrack, and not interactive 3D positioning of individual sounds. 2.2 Surround Sound Technologies and products that create a larger-than-stereo sound stage by playing back multi-channel Dolby® or Mpeg surround sound soundtracks on multi-speaker setups. Surround sound is based on using audio compression technology (for example Dolby ProLogic® or Digital AC-3®) to encode and deliver a multi-channel soundtrack, and audio decompression technology to decode the soundtrack for delivery on a surround sound 5-speaker setup. Additionally, virtual surround sound systems use 3D audio technology to create the illusion of five speakers emanating from a regular set of stereo speakers, therefore enabling a surround sound listening experience without the need for a five speaker setup. Aureal's A3D Surround is a Virtual Surround technology. Because they are pre-recorded, surround sound soundtracks are most suitable for movies. They are non-interactive, and therefore not particularly useful in interactive software such as video games and Web Sites. Because of their limitations when it comes to interactivity, surround sound systems are not considered for the interactive 3D audio category. Ways to evaluate the performance of a surround sound system:
2.3 Positional 3D Audio (A3d Interactive) Positional 3D audio (a.k.a. interactive 3D audio) allows for interactive, on-the-fly positioning of sounds anywhere in the three-dimensional space surrounding a listener. Support for such technologies can be incorporated into software titles such as video games to create a natural, immersive, and interactive audio environment that closely approximates a real-life listening experience. This category can be described as the audio equivalent of 3D graphics. Aureal’s A3D Interactive is a positional 3D audio technology. 3D audio technologies create a more life-like listening experience by replicating the 3D audio cues that the ears hear in the real world. The following two sections, "The Basics of Acoustics" and "The Basics of Human Hearing", explain what those listening cues are and how they can be reproduced. For maximum flexibility and usability, a 3D audio algorithm should support all possible audio playback environments: headphones, stereo speakers and multi-speaker (surround or quad) arrays. In the case of stereo speakers or headphones more demands are placed on the algorithm and less demands on the end-user, because stereo setups are most common and easy to setup. Multi-speaker arrays require less complex 3D audio rendering algorithms, but put more demands on the end-user’s playback setup (cost and setup complexity of extra amplifiers and speakers). In both cases, the desired 3D effects are controlled by software applications which position 3D sound sources and listeners via an API (Application Programming Interface) such as Microsoft’s DirectSound3D API for the Windows® platform, or the VRML 2.0 standard. Ways to evaluate the performance of a 3D interactive sound system:
Table: A Comparison of Audio Playback Methods
2.4 Headphone Versus Stereo Speaker Playback Devices In terms of 3D sound processing, these two playback media offer different challenges and advantages. Headphones have the advantage of always being in a known position with respect to the listener’s ears. This means that two separate audio signals (left and right) are guaranteed to go directly into the two ears of a listener. With speakers, this is only the case if the listener is sitting in the ideal listening position, the sweet spot, and processing methods are employed to insure that the left ear does not receive any audio content from the right speaker, and vice versa (cross-talk cancellation). Human beings extract a lot of information about their environment using their ears. In order to understand what information can be retrieved from sound, and how exactly it is done, we need to look at how sounds are perceived in the real world. To do so, it is useful to break the acoustics of a real world environment into three components: the sound source, the acoustic environment, and the listener:
Figure 1 - Typical soundfield with a source, environment and listener.
4. THE BASICS OF HUMAN HEARING As explained above, people can be considered sound receiving objects in an environment. We have an auditory sensing system consisting of two ears and a brain. Additionally, very low frequency sounds can be sensed through the human body. The brain uses a number of cues that are embedded in the two sound signals it receives from the two ears to learn about the sounds and their environment. Most people are unaware that the effects described in the following sections greatly impact our continuous perception of reality, every day of our lives. On the other hand, there are certain people, for example non-sighted people, that are very much aware of these effects, because they heavily rely on their ears for querying and navigating their surroundings.
The two primary localization cues are called interaural intensity difference (IID) and interaural time difference (ITD). IID refers to the fact that a sound is louder at the ear that it is closer to, because the sound’s intensity at that ear will be higher than the intensity at the other ear, which is not only further away, but usually receives a signal that has been shadowed by the listener’s head (see fig. 2). ITD means that a sound will arrive earlier at one ear than the other (unless it is located at exactly the same distance from each ear - for example directly in front). If it arrives at the left ear first, the brain knows that the sound is somewhere to the left (see fig. 3).
Figure 2 - Illustration of IID.
Figure 3 - Illustration of ITD. The combination of these two cues allows the brain to narrow the position of an individual sound source to somewhere on a cone centered on the line drawn between the listeners ears (see fig.4 ).
Figure 4 - ITD Cone. 4.2 The Outer Ear Structure - Pinna Before a sound wave gets to the ear drum, it passes through the outer ear structure, called the pinna. The pinna accentuates or suppresses mid- and high-frequency energy (see fig. 5) of a sound wave to various degrees, depending on the angle at which the sound wave hits the pinna (see fig. 6). This means that the two pinnae act as variable filters that effect every sound that passes through them. The brain knows how to figure out the exact location of a sound in space by receiving a signal that has been filtered in a way that is unique to the sound source’s position relative to the listener.
Figure 5 - Spectrum differences between original and pinna.
Figure 6 - Pinnae frequency modulation sound source and pinna reception at varying elevations. The pinnae are the key to accurately localizing sounds in space. However, since the outer ear and its folds are on the scale of a few centimeters, only sound waves with wavelengths in the centimeter range or smaller can be affected by the pinna. In addition, the two ears are about 15 centimeters apart, so even IID and ITD cues are greatly reduced for wave lengths bigger than that. For example, a 3.3 kHz sound signal oscillates 3300 times per second, while sound travels at about 330 meters per second. The wave length is therefore about 330/3300 = 0.1 meters, or 10 centimeters. This means that a sound at 3300 Hz lies in the area where primary cues are still noticeable, but pinna cues start to be diminished. In general, the higher the frequency of a sound, the shorter its wave length, and the better it can be localized. This phenomena can be verified by placing two speakers, a sub-woofer and a high-frequency tweeter, in a room and playing music through them. With closed eyes you will be able to immediately tell where the tweeter is located, the sub-woofer however will sound like it is "coming from everywhere". 4.3 Propagation Effects, Range Cues, and Reflections Many things happen to a sound as it travels through an environment before it is received by a listener. All of these effects allow us to learn more about what we are hearing and what kind of environment we are in:
Figure 7 - Source attenuation and absorption.
Figure 8 - Direct path, first and second order due to range (listener-source distance) reflections in a typical room.
5. HOW A3D WORKS
A 3D audio system aims to digitally reproduce a realistic sound field. To achieve the desired effect a system needs to be able to re-create portions or all of the listening cues discussed in the previous chapter: IID, ITD, outer ear effects, and so on. A typical first step to building such a system is to capture the listening cues by analyzing what happens to a single sound as it arrives at a listener from different angles. Once captured, the cues are synthesized in a computer simulation for verification.
The majority of 3D audio technologies are at some level based on the concept of HRTFs, or Head-Related Transfer Functions. An HRTF can be thought of as set of two audio filters (one for each ear) that contains in it all the listening cues that are applied to a sound as it travels from the sound’s origin (its source, or position in space), through the environment, and arrives at the listener’s ear drums. The filters change depending on the direction from which the sound arrives at the listener. The level of HRTF complexity necessary to create the illusion of 3D realistic hearing is subject to considerable discussion and varies greatly across technologies.
The most common method of measuring the HRTF of an individual is to place tiny probe microphones inside a listener’s left and right ear canals, place a speaker at a known location relative to the listener, play a known signal through that speaker, and record the microphone signals. By comparing the resulting impulse response with the original signal, a single filter in the HRTF set has been found (see fig. 9). After moving the speaker to a new location, the process is repeated until an entire, spherical map of filter sets has been devised.
Figure 9 - Combining speaker output and microphone input to compute impulse response.
HRTF synthesis
Once an HRTF has been devised, real-time DSP (digital signal processing) software and algorithms are designed. This software has to be able to pick out the critical (psycho-acoustically relevant) features of a filter and apply them in real-time to an incoming audio signal to spatialize it. The system works correctly if a listener cannot tell the difference between listening to a sound over the speaker setup from the analysis process above (the speaker is in a specific position), and the same sound played back by a computer and filtered by the HRTF impulse response corresponding to the original speaker location (see fig. 10).
Figure 10 - Applying synthetic impulse response synthetically to create illusion of a virtual speaker.
HRTFs can be used with great effectiveness in all audio playback configurations: headphones, stereo speakers, or multi-speaker arrays. On headphones, HRTF output is sent directly to the users ears. On stereo or multi-speaker setups, an additional audio processing step called cross-talk cancellation is employed to ensure proper signal separation between left and right ears.
5.2 Aureal Wavetracing (A3D)
Once HRTFs have been captured and can be rendered, a sound can be made to appear from any 3D location. To compute and render the additional effects that the 3D environment can have on a sound, A3D employs proprietary Wavetracing algorithms. Among other features, the addition of Wavetracing technology distinguishes A3D 2.0 systems from A3D systems. Developed over many years in conjunction with clients such as NASA, Matsushita and Disney, Aureal’s Wavetracing technology parses the geometry description of a 3D space to trace sound waves in real-time as they are reflected and occluded by passive acoustic objects in the 3D environment. With Wavetracing, sounds cannot only be heard as emanating from a position in 3D space, but also as they reflect off of walls, leak through doors from the next room, get occluded as they disappear around a corner, or suddenly appear overhead as you step into the open from a room. Reflections are rendered as individually imaged early reflections and as reverb late field reflections. Acoustic space geometries and wall surface materials are specified via the A3D 2.0 API (Application Programming Interface). The result is the final step towards true audio rendering realism: the combination of 3D positioning, room and environment acoustics and proper signal presentation to the user’s ears.
5.3 The A3D API
The A3D API (Application Programming Interface) delivers A3D into the hands of the software content developer. It allows games, 3D Internet browsers, and other 3D software applications to harness the full power of A3D. The API allows the application developer to do the following:
Audio-Visual Synergy
The eyes and ears often perceive an event at the same time. Seeing a door close, and hearing a shutting sound, are interpreted as one event if they happen synchronously. If we see a door shut without a sound, or we see a door shut in front of us, and hear a shutting sound to the left, we get alarmed and confused. In another scenario, we might hear a voice in front of us, and see a hallway with a corner; the combination of audio and visual cues allows us to figure out that a person might be standing around the corner. Together, synchronized 3D audio and 3D visual cues provide a very strong immersion experience. Both 3D audio and 3D graphics systems can be greatly enhanced by such synchronization.
Head Movement and Audio
Audio cues change dramatically when a listener tilts or rotates his or her head. For example, quickly turning the head 90 degrees to look to the side is the equivalent of a sound traveling from the listener’s side to the front in a split second. We often use head motion to track sounds or to search for them. The ears alert the brain about an event outside of the area that the eyes are currently focused on, and we automatically turn to redirect our attention. Additionally, we use head motion to resolve ambiguities: a faint, low sound could be either in front or back of us, so we quickly and sub-consciously turn our head a small fraction to the left, and we know if the sound is now off to the right, it is in the front, otherwise it is in the back. One of the reasons why interactive audio is more realistic than pre-recorded audio (soundtracks) is the fact that the listeners head motion can be properly simulated in an interactive system (using inputs from a joystick, mouse, or head-tracking system).
5.4 The Vortex A3D Silicon Engines
Aureal has developed a line of PCI-bus based digital audio chips called Vortex. These chips, among many other features, contain silicon implementations of A3D algorithms, including HRTF and Wavetracing rendering engines. Vortex is a no-compromise PCI audio chip architecture. It takes true advantage of the PCI bus by streaming dozens of audio sources to on-board audio processing engines: A3D, DirectSound, Wavetable synthesis, legacy audio, multi-channel mixers, sample rate converters, etc. Vortex delivers highest quality A3D capabilities for sound cards and PC motherboards at maximum price/performance points.
| |