For better voice quality,
Alice can use low-latency anonymity networks such as Tor [3] and JAP [4].
Something tells me these researchers dont know what they're talking about.
You could use Tor or JAP to route encrypted voice streams, although Tor is not able to route UDP. They are not claiming that using a low latency anonymity network improves voice quality over not using an anonymizer at all, simply that when it comes to anonymity networks routing VOIP anything other than a low latency network will totally destroy the voice quality (to the point of being impossible to use, I imagine). A lot of encrypted voice services provide their own 'anonymity networks' in the form of what are essentially high bandwidth VPNs , these solutions will give less of a hit to voice quality than networks such as Tor or even JAP, but they are also more limited in the amount of anonymity they can provide. There has actually been a fairly substantial amount of research done on anonymizing and deanonymizing VOIP streams, although to be fair none of it seems particularly unique to VOIP (judging from the little I have read regarding this particular aspect of anonymity).
Encrypted voip works by encrypting each UDP packet on its own. I have sniffed such traffic and was able to graph the data volume over time. I then used that data volume to synth a sound based on the volume over time.
The result was a bit like listening to Charlie Brown's parents, "wa wa waaa wa wawawawa".
I could not make out words but I could hear the pauses between the words and sentences. I could tell one side talked faster and the other slower. Granted this was about 4 years ago, perhaps they have something better now.
This can be avoided by sending dummy traffic to create a consistent volume, this however dramatically increases the required bandwidth.
Encrypting a real time protocol has special challenges.
If you had a trained classifier and used it to analyze the pauses (which can be determined via interpacket timing) you would quite possibly be able to determine the language being spoken and possibly even be able to pick out certain phrases or words. The paper I previously linked to shows that analysis of these pauses is enough to identify a previously fingerprinted speaker with high probability, and even to pick out previously fingerprinted words and phrases of a given speaker. I would hypothesize that even if there is not a previously created fingerprint of a particular individuals encrypted VOIP speech, that a classifier could be used to gain information about encrypted VOIP streams via interpacket timing characteristics. I believe that the speed of speech in itself would be useful for determining or at least narrowing in on the language being spoken. Different languages are spoken with varying average speeds and thus being able to determine the rate of speech should by itself help in identifying the spoken language with better than random chance probability.
Of course before you will be able to learn anything from the pauses you will need to have a substantial database of speech samplings in various languages. You would then see the average interpacket timing for each of the languages, and then after obtaining a sample of an encrypted VOIP stream you would use the classifier to see which of the previously fingerprinted languages the sample most closely matches. I highly suspect that this technique will have better than random chance probability of correctly identifying the spoken language.