(Disclaimer: Since I do not speak German, most of the background information and raw data in this article is taken from a very informative posting by Robert Michel to the vorbis mailing list. The analysis, however, is mine, so don't complain to Robert about it.)
During the last week of August, 2002, the German IT magazine c't performed an online test of various audio compression codecs at both 64 kbps and 128 kbps. Each participant was given 7 audio files of a sample which had been compressed and decompressed with the following codecs:
The results of the test were aggregated by c't into percentages of participants who gave each codec a particular ranking. (As in "41% of the participants gave WAV a ranking of 1.") The following tables show these percentages:
| 64 kbps test (3295 participants) | |||||||
|---|---|---|---|---|---|---|---|
| Ranking | |||||||
| Codec | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| WAV | 41 | 21 | 15 | 10 | 07 | 05 | 01 |
| Ogg Vorbis | 25 | 21 | 17 | 13 | 11 | 10 | 03 |
| MP3Pro | 11 | 23 | 26 | 18 | 15 | 06 | 01 |
| WMA | 10 | 17 | 17 | 22 | 23 | 10 | 01 |
| AAC | 07 | 12 | 17 | 26 | 22 | 14 | 02 |
| RealAudio | 04 | 05 | 08 | 10 | 20 | 51 | 02 |
| MP3 | 01 | 01 | 01 | 01 | 02 | 04 | 90 |
| 128 kbps test (2785 participants) | |||||||
|---|---|---|---|---|---|---|---|
| Ranking | |||||||
| Codec | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| WAV | 21 | 17 | 15 | 13 | 13 | 11 | 10 |
| Ogg Vorbis | 21 | 16 | 15 | 13 | 13 | 12 | 10 |
| MP3Pro | 11 | 13 | 16 | 14 | 15 | 15 | 16 |
| WMA | 13 | 14 | 15 | 14 | 16 | 17 | 11 |
| AAC | 11 | 11 | 12 | 12 | 14 | 14 | 26 |
| RealAudio | 12 | 15 | 14 | 18 | 14 | 14 | 13 |
| MP3 | 11 | 14 | 15 | 15 | 16 | 16 | 14 |
Analyzing these results presents some interesting challenges. Because of the way c't reported the data, the rankings of a particular individual are not available. Given the subjective nature of audio quality, the relative rankings given by a particular individual would actually be more useful than the actual ranking values themselves. That is to say it is more meaningful to know that a participant ranked MP3 above RealAudio than to know that he or she gave MP3 a ranking of 5 and RealAudio a ranking of 7.
However, this information cannot be extracted from the preceding table, so we must make due with some inaccurate assumptions. For the purposes of analysis, we will treat the previous data as the results from an experiment where the "ranking" of each codec was measured N times, where N is the number of participants in the test. Each measurement is considered to be independent (clearly false), and the rankings are considered to be equally spaced across the "quality spectrum" (also false). The question we will ask is: "What is the expected ranking a random listener will give a particular codec?" We will interpret this expectation value as the "goodness" of the codec.
The expected rankings and their uncertainties (given as +/- 1 standard deviation) are reported below (LOWER IS BETTER!):
| 64 kbps test (3295 participants) | |||||||
|---|---|---|---|---|---|---|---|
| Codec | Expected Ranking | ||||||
| WAV | 2.4 +/- 1.6 | ||||||
| Ogg Vorbis | 3.1 +/- 1.8 | ||||||
| MP3Pro | 3.3 +/- 1.4 | ||||||
| WMA | 3.7 +/- 1.5 | ||||||
| AAC | 3.9 +/- 1.5 | ||||||
| RealAudio | 5.0 +/- 1.4 | ||||||
| MP3 | 6.7 +/- 1.0 | ||||||
| 128 kbps test (2785 participants) | |||||||
|---|---|---|---|---|---|---|---|
| Codec | Expected Ranking | ||||||
| WAV | 3.5 +/- 2.0 | ||||||
| Ogg Vorbis | 3.6 +/- 2.0 | ||||||
| MP3Pro | 4.2 +/- 2.0 | ||||||
| WMA | 4.0 +/- 1.9 | ||||||
| AAC | 4.5 +/- 2.1 | ||||||
| RealAudio | 4.0 +/- 1.9 | ||||||
| MP3 | 4.2 +/- 1.9 | ||||||
In general, the results show a very large uncertainty in the expected ranking. The participants overall, despite the large number of them, had extremely variable rankings with few clear trends. This shows that nearly all of the codecs do a reasonable job of encoding for untrained listeners (as most of the participants were). When only 41% of the listeners gave the original a ranking of 1 compared to the 64 kbps samples and 21% compared to 128 kbps, it is clear that the audience is not very discerning. However, that also makes this a valuable test for gauging how the codecs will sound to a general audience.
In the 64 kbps test, MP3 was definitely strained. We can say that it was worse than WAV and the "high-end" codecs (Ogg Vorbis, MP3Pro, WMA, and AAC) with a reasonably degree of certainty. It is also likely that RealAudio is better than MP3, and might be worse than the high-end codecs. This test does not demonstrate any significant difference between WAV, Ogg Vorbis, MP3Pro, WMA, or AAC at 64 kbps.
At 128 kbps, the codecs cannot be ranked at all. This bitrate appears to be where all of the codecs are indistinguishable from the original.
In summary, the only really conclusive result of this test is that the big loser at 64 kbps is MP3. (Unfortunately, we already knew that.) It also shows that Ogg Vorbis is not the winner at either bitrate, though a casual inspection of the numbers might lead one to think so. The average user cannot really tell the difference between any of the high-end codecs, so vendors should use other criteria when selecting a codec for applications targeting those types of users.
More meaningful results could be found if the individual rankings were available. The relative rankings (percentage of users ranking MP3 better than WMA and so on) could be tabulated and analyzed in a manner similar to that done above. Additionally, a test with a more discerning audience could be simulated by removing all of the participants who did not give the original WAV a ranking of 1 and repeating the analysis.
Anyone from c't willing to provide the original data is encouraged to contact me at the email address given above. :)