Recently, I was tasked with picking an AAC audio codec library for one of our products. There were several libraries I had to evaluate, and I needed some quantitative metrics for doing the comparison. I’m not what professionals call an “expert listener”, so I had to do the best with what I had. While creating my test plan, I noticed that more people seemed interested in how I was doing testing rather than the actual results. So I decided to share my approach to audio codec testing.
Note: This is intended to be a pragmatic guide for engineers evaluating codecs. It is not a comprehensive treatment of the subject. The goal is to give readers a solid overview and some practical ideas.
Get Familiar with Psychoacoustics
Psychoacoustics is the study of how humans perceive sound. As you might expect, we humans don’t process sound in a perfect, linear fashion. The physical shape of the ear, the transfer function of the Basilar membrane, and the psychological interpretation of the data all affect how we perceive sound (and by extension, how “good” an audio codec sounds to us).
I highly recommend you start by reading this excerpt from Surround Sound: Psychoacoustics Part 1, by Tomlinson Holman (he created THX for Lucasfilm).
Understand the Codec
Make sure you understand the codec you are testing; not necessarily the implementation, but what tools (i.e. methods) the codec uses for compression. Many codecs have different “profiles”, which describe what subset of available tools are used (e.g. AAC). You should also have some idea how each compression tool works and any short-comings it has. This will help guide you in selecting reference audio samples and knowing what artifacts to listen for.
For an introduction to modern audio compression, read Audio Coding: An Introduction to Data Compression Part 1, and Part 2 (discusses MP3 and AAC). I actually suggest buying the book “Introduction to Data Compression”, by Khalid Sayood.
Understand the API
Make sure you actually read the codec documentation and look at any available code samples. This step has more to do with due-diligence than anything, as I haven’t seen a codec API we couldn’t work with, but you need to do this. This will also help you scope the work required to get a working encoder/decoder for future steps (if your lucky, the sample code can be used).
Choose the Reference Audio Samples
An effective test requires multiple audio samples with different characteristics. There are many types of artifacts a codec can introduce, and your choice of audio samples will dictate how easy they are to detect. It’s also important to pick samples that reflect the actual types of sound the codec to have to deal with. For example, if the final system will primarily be encoding speech, then you should choose more speech-oriented references as opposed to music samples.
Some characteristics you might consider:
- Transients (snare drum): Sensitive to pre-echo and noise “smearing”.
- Tonal structure (clarinet, saxophone): Sensitive to noise and “roughness”.
- Natural speech (male and female voices of various languages): Sensitive to distortion and smearing of “attacks”.
- Complex sound (bag pipes): Stresses the codec.
- High bandwidth (bag pipes): Loss of high frequencies and program-modulated high frequency noise.
It is also possible to use synthetic sounds and sweeps, but this is only recommended for the automated objective tests below.
As a basic guideline, you need 10-25 second “raw” samples recorded at the highest sample rate your system needs to work with. It is vital that the samples you choose have never been compressed with a lossy codec (mp3, AAC, etc)… that would severally limit the quality of your test. For sample rate and size, I suggest 48kHz 16-bit PCM, but a lower rate/size makes sense if the final system is limited in this area. It also makes sense to use a sample rate of 44.1kHz, since many quality audio samples can be ripped losslessly from CD. Just keep in mind that the objective PEAQ test mentioned below requires 48kHz 16-bit PCM, so up-sampling may be required.
The audio samples can be stored in whatever container format you want (raw, WAV, etc) as long as your codec test application can unpack it. This is important to keep in mind… you don’t want to accidentally run the WAV header through the codec (yes, I’ve done this). The container format is more of a practical issue, but it was worth mentioning.
Generate Various Test Samples and Observe CPU Load
This step is pretty straight-forward: wrap the codec in an application and encode the reference audio samples at different bit rates. You should choose bit rates that represent the full spectrum of bit rates that will be used in the final system. While you’re encoding, track the CPU usage on the codec and how many cores it’s using. You may even want to do a separate test running many encodes in parallel (this works nicely if the CPU usage is too low to measure accurately). Make sure to consider application overhead and disk I/O when making measurements.
After encoding, you need to decode back to raw PCM. Clearly label your files so you know what bit rate each one was encoded with. These decoded test samples are what we will be comparing to the original reference samples.
Do a Subjective Test
How you conduct your subjective testing will depend on several factors, such as time constraints, cost, and the required test precision. At the low end, you could simply listen to the test samples in a pair of headphones and judge the quality yourself. For a high precision test, you could do a full ITU BS.1116 test using “expert listeners” in a controlled environment. While these examples represent the extremes, there are many permutations that can give you the desired quality of results.
The most common subjective test is called a “double-blind triple-stimulus with hidden reference” test. The listener hears three samples (commonly labeled A, B, and C) for a period of 10 to 25 seconds. A is always the original reference sample. The next two samples, B and C, are randomly assigned either the test sample from the codec or the original reference sample played again (called the “hidden reference”). The listener must then rate the difference between B and A, and C and A, not knowing which one is the test sample. The grading scale is:
- 5.0 Imperceptible
- 4.0 Perceptible, but not annoying
- 3.0 Slightly annoying
- 2.0 Annoying
- 1.0 Very annoying
Ideally, you would conduct several tests and average the results together. If you do the listening test yourself, your results will be limited to your listening skills and understanding of audio codec artifacts. Here’s a summary of factors that affect the quality of your results:
- The quality of the listener.
- The choice of audio samples.
- The number and duration of the testing.
- The testing environment, including speaker/headphone quality, room design, and listener placement.
- The quality of randomization of sample order to remove any correlation between samples.
- Proper statistical analysis of the combined test results.
A proper subjective test is both expensive and time consuming. It’s important to find the right balance for your particular needs.
Do an Objective Test
Evaluating a codec objectively requires testing methods that correlate well to actual human perception. You can’t simply measure the distortion introduced by the codec using traditional measurements like Signal-to-Noise ratio (S/N) and Total-Harmonic-Distortion (THD), because they don’t correlate well to perceived audio quality. Some distortion is imperceptible to the human ear, and codecs take advantage of this to increase the compression ratio.
Fortunately, the ITU has standardized an objective audio test called PEAQ (BS.1387). The acronym stands for Perceptual Evaluation of Audio Quality. PEAQ uses software to model the entire human auditory system (including blood flow noise in the inner ear) to generate a set of metrics that are used to give a final “quality” score. The original reference signal is compared to a signal run through the codec, and the result is a real number between 0.0 and -4.0. The result is interpreted on the following scale:
- 0.0 = Imperceptible
- -1.0 = Perceptible but not annoying
- -2.0 = Slightly annoying
- -3 .0= Annoying
- -4.0 = Very annoying
Obviously, values closer to zero are better.
The test was developed by a similar group of audio experts that developed BS.1116 (mentioned above) and the results have been validated against a long list of subjective tests done using expert listeners.
There are several free and commercial software packages available for doing PEAQ tests. The best free package I’ve found is AFsp from the McGill Telecommunications and Signal Processing Lab. There’s also peaqb, but there’s a comment that it gives incorrect results. AFsp worked great in my tests and included some helpful tools like CompAudio and InfoAudio.
Hopefully this post has given you a good starting point and some practical ideas for testing audio codecs. My goal was to provide a pragmatic approach with different options depending on what your actual evaluation needs are. This is in no way a comprehensive treatment of the subject; only an overview. I highly suggest reading some of the books I referenced if you’d like a deeper treatment of the subject. Either way, I hope you found this post helpful.