Recently, I was tasked with picking an AAC audio codec library for one of our products. There were several libraries I had to evaluate, and I needed some quantitative metrics for doing the comparison. I’m not what professionals call an “expert listener”, so I had to do the best with what I had. While creating my test plan, I noticed that more people seemed interested in how I was doing testing rather than the actual results. So I decided to share my approach to audio codec testing.
Note: This is intended to be a pragmatic guide for engineers evaluating codecs. It is not a comprehensive treatment of the subject. The goal is to give readers a solid overview and some practical ideas.
Get Familiar with Psychoacoustics
Psychoacoustics is the study of how humans perceive sound. As you might expect, we humans don’t process sound in a perfect, linear fashion. The physical shape of the ear, the transfer function of the Basilar membrane, and the psychological interpretation of the data all affect how we perceive sound (and by extension, how “good” an audio codec sounds to us).
I highly recommend you start by reading this excerpt from Surround Sound: Psychoacoustics Part 1, by Tomlinson Holman (he created THX for Lucasfilm).
Understand the Codec
Make sure you understand the codec you are testing; not necessarily the implementation, but what tools (i.e. methods) the codec uses for compression. Many codecs have different “profiles”, which describe what subset of available tools are used (e.g. AAC). You should also have some idea how each compression tool works and any short-comings it has. This will help guide you in selecting reference audio samples and knowing what artifacts to listen for.
For an introduction to modern audio compression, read Audio Coding: An Introduction to Data Compression Part 1, and Part 2 (discusses MP3 and AAC). I actually suggest buying the book “Introduction to Data Compression”, by Khalid Sayood.
Understand the API
Make sure you actually read the codec documentation and look at any available code samples. This step has more to do with due-diligence than anything, as I haven’t seen a codec API we couldn’t work with, but you need to do this. This will also help you scope the work required to get a working encoder/decoder for future steps (if your lucky, the sample code can be used).
Choose the Reference Audio Samples
An effective test requires multiple audio samples with different characteristics. There are many types of artifacts a codec can introduce, and your choice of audio samples will dictate how easy they are to detect. It’s also important to pick samples that reflect the actual types of sound the codec to have to deal with. For example, if the final system will primarily be encoding speech, then you should choose more speech-oriented references as opposed to music samples.
Some characteristics you might consider:
- Transients (snare drum): Sensitive to pre-echo and noise “smearing”.
- Tonal structure (clarinet, saxophone): Sensitive to noise and “roughness”.
- Natural speech (male and female voices of various languages): Sensitive to distortion and smearing of “attacks”.
- Complex sound (bag pipes): Stresses the codec.
- High bandwidth (bag pipes): Loss of high frequencies and program-modulated high frequency noise.
It is also possible to use synthetic sounds and sweeps, but this is only recommended for the automated objective tests below.
As a basic guideline, you need 10-25 second “raw” samples recorded at the highest sample rate your system needs to work with. It is vital that the samples you choose have never been compressed with a lossy codec (mp3, AAC, etc)… that would severally limit the quality of your test. For sample rate and size, I suggest 48kHz 16-bit PCM, but a lower rate/size makes sense if the final system is limited in this area. It also makes sense to use a sample rate of 44.1kHz, since many quality audio samples can be ripped losslessly from CD. Just keep in mind that the objective PEAQ test mentioned below requires 48kHz 16-bit PCM, so up-sampling may be required.
The audio samples can be stored in whatever container format you want (raw, WAV, etc) as long as your codec test application can unpack it. This is important to keep in mind… you don’t want to accidentally run the WAV header through the codec (yes, I’ve done this). The container format is more of a practical issue, but it was worth mentioning.
Generate Various Test Samples and Observe CPU Load
This step is pretty straight-forward: wrap the codec in an application and encode the reference audio samples at different bit rates. You should choose bit rates that represent the full spectrum of bit rates that will be used in the final system. While you’re encoding, track the CPU usage on the codec and how many cores it’s using. You may even want to do a separate test running many encodes in parallel (this works nicely if the CPU usage is too low to measure accurately). Make sure to consider application overhead and disk I/O when making measurements.
After encoding, you need to decode back to raw PCM. Clearly label your files so you know what bit rate each one was encoded with. These decoded test samples are what we will be comparing to the original reference samples.
Do a Subjective Test
How you conduct your subjective testing will depend on several factors, such as time constraints, cost, and the required test precision. At the low end, you could simply listen to the test samples in a pair of headphones and judge the quality yourself. For a high precision test, you could do a full ITU BS.1116 test using “expert listeners” in a controlled environment. While these examples represent the extremes, there are many permutations that can give you the desired quality of results.
The most common subjective test is called a “double-blind triple-stimulus with hidden reference” test. The listener hears three samples (commonly labeled A, B, and C) for a period of 10 to 25 seconds. A is always the original reference sample. The next two samples, B and C, are randomly assigned either the test sample from the codec or the original reference sample played again (called the “hidden reference”). The listener must then rate the difference between B and A, and C and A, not knowing which one is the test sample. The grading scale is:
- 5.0 Imperceptible
- 4.0 Perceptible, but not annoying
- 3.0 Slightly annoying
- 2.0 Annoying
- 1.0 Very annoying
Ideally, you would conduct several tests and average the results together. If you do the listening test yourself, your results will be limited to your listening skills and understanding of audio codec artifacts. Here’s a summary of factors that affect the quality of your results:
- The quality of the listener.
- The choice of audio samples.
- The number and duration of the testing.
- The testing environment, including speaker/headphone quality, room design, and listener placement.
- The quality of randomization of sample order to remove any correlation between samples.
- Proper statistical analysis of the combined test results.
A proper subjective test is both expensive and time consuming. It’s important to find the right balance for your particular needs.
Do an Objective Test
Evaluating a codec objectively requires testing methods that correlate well to actual human perception. You can’t simply measure the distortion introduced by the codec using traditional measurements like Signal-to-Noise ratio (S/N) and Total-Harmonic-Distortion (THD), because they don’t correlate well to perceived audio quality. Some distortion is imperceptible to the human ear, and codecs take advantage of this to increase the compression ratio.
Fortunately, the ITU has standardized an objective audio test called PEAQ (BS.1387). The acronym stands for Perceptual Evaluation of Audio Quality. PEAQ uses software to model the entire human auditory system (including blood flow noise in the inner ear) to generate a set of metrics that are used to give a final “quality” score. The original reference signal is compared to a signal run through the codec, and the result is a real number between 0.0 and -4.0. The result is interpreted on the following scale:
- 0.0 = Imperceptible
- -1.0 = Perceptible but not annoying
- -2.0 = Slightly annoying
- -3 .0= Annoying
- -4.0 = Very annoying
Obviously, values closer to zero are better.
The test was developed by a similar group of audio experts that developed BS.1116 (mentioned above) and the results have been validated against a long list of subjective tests done using expert listeners.
There are several free and commercial software packages available for doing PEAQ tests. The best free package I’ve found is AFsp from the McGill Telecommunications and Signal Processing Lab. There’s also peaqb, but there’s a comment that it gives incorrect results. AFsp worked great in my tests and included some helpful tools like CompAudio and InfoAudio.
Hopefully this post has given you a good starting point and some practical ideas for testing audio codecs. My goal was to provide a pragmatic approach with different options depending on what your actual evaluation needs are. This is in no way a comprehensive treatment of the subject; only an overview. I highly suggest reading some of the books I referenced if you’d like a deeper treatment of the subject. Either way, I hope you found this post helpful.
If you’ve ever done embedded development in C/C++, you are probably familiar with bitfields. They are a handy way to reference individual bits in things like hardware registers. The problem is that bitfields can lead to performance problems and race conditions if not used properly. I hope to highlight some of the issues you should consider when using them.
First, let’s assume you need to check various fields in a hardware register with the following layout:
You could define the following bitfield to represent this register:
1: struct HwReg
3: unsigned int Base : 16;
4: unsigned int Offset : 8;
5: unsigned int Rsvd : 5;
6: unsigned int Flag : 1;
7: unsigned int Type : 2;
The total size of this data type is sizeof(unsigned int), with each line defining a different region (field) within that type (this looks confusing when you first look at it). The following code uses the HwReg bitfield to access a memory-mapped register:
1: struct HwReg* pReg = (struct HwReg*)0×80001005;
3: if (pReg->Flag && pReg->Type == TYPE_1)
5: void* address = pReg->Base + pReg->Offset;
Line 1 defines a pointer to the physical hardware register as type HwReg. We can now use this pointer to easily access the register fields. If this isn’t clear, you can read more about bitfields HERE.
The compiler doesn’t know how to optimize bitfield accesses (especially because the pointers to memory-mapped hardware registers are almost always declared ‘volatile’). This means that every access to a member of the bitfield will require a read of the physical hardware register. This can be orders of magnitude slower than accessing main memory. In the code example above, the hardware register will be read 4 times; once for each field access.
The way to remedy this is to cache a copy of the register value and then operate on that. Consider the following code:
1: unsigned int* pFullReg = (unsigned int*)0×80001005;
2: unsigned int temp = *pFullReg;
3: struct HwReg* pReg = (struct HwReg*)&temp;
5: if (pReg->Flag && pReg->Type == TYPE_1)
7: void* address = pReg->Base + pReg->Offset;
Line 1 defines a pointer to the physical hardware register. Line 2 performs the actual read into a local variable (the slowest part). This local copy is now in main memory and the CPU cache. Line 3 casts the cached value to the bitfield for easy access. Finally, all accesses to the register fields is on the cached value, which can be read very fast from L1 cache.
Another advantage to this approach is when the hardware requires locking before the register can be accessed. By caching the value, you can keep all the locking code localized to a single area of the function. Without caching, you would hold the lock for a longer period of time (possibly forcing other operations to block) and have to make sure to release the lock on every return path (more difficult with exceptions).
NOTE: Remember you are only working with a copy of the register value. If you update a value in the bitfield, you must still copy the updated value back to the register.
As stated above, each access to a field value generates its own read/write operation. Even if the CPU architecture guarantees that an individual operation is atomic, updating multiple fields are not. Thus, in a multi-threaded application you must lock the entire block of code that operates on the bitfield. I again suggest caching the value, as you only need to lock the actual read/write of the entire register.
Bitfields are a nice language construct that can help make it easier to write clean code (as opposed to using macros and bitmasks). Unfortunately, it’s all too easy to shoot-yourself-in-the-foot with bitfields if you don’t understand the pitfalls. As always, use caution when writing performance-critical code and make sure you understand how to use the available code constructs.