Following my earlier post on "Opus SDP negotiation" in the series "For The Practical Man", I'm presenting today a related topic: Opus audio codec when transcoding is involved.
Most of the providers of PSTN connectivity require the simplest possible VoIP codec: G.711 (which comes in two flavours, u-law and a-law).
G.711 is a sort of PCM encoding at 8000 samples per second: 8000 times per second an audio sample is encoded with 8 bit. Sometimes Comfort Noise can be used, reducing the bitrate when silence is detected, but otherwise the typical working principle is a continuous flow of digitally-encoded packets of voice. u-law and a-law just use a different way to encode the data. (If you're curious about what does silence look like in G.711, I wrote a post about it some time ago).
G.729 is another widely adopted codec, but I'll leave it for another day.
With 8000 samples per second and 8 bit dedicated for each sample G.711 requires a net bit rate of 64 Kbps. "Net" because each packet will travel over the IP network with the IP/UDP overhead (40 Bytes, accounting for an additional 16 Kbps at 20 msec packetization). And this in each direction.
Except in cases where Comfort Noise is used, G.711 is not able to "recover" latency, so every little delay during the network transport will either sum up with the overall delay, or cause the packet to be dropped, should not arrive in time for the receiving jitter buffer to correctly handle it.
I hope I'm passing the message that G.711 is expensive and inefficient.
Today though G.711 is a sort of common denominator between communicating parties. It's the "last resort", if you will. In fact it's been chosen by the standardization committee working on WebRTC, together with the Opus codec, for audio. WebRTC requires support for these two audio codecs (see RFC 7875, chapter 3).
Opus is a much better solution. It's suitable for VoIP (low bitrate, small sample rates) but also for music (very high bitrate, large sample rates). It's flexible enough to adapt the bitrate during a call. When Variable Bitrate is enabled, Opus tries to optimize the bitrate given some network conditions (e.g. packet loss). If there's silence, the Opus encoder keeps sending packets at the packet rate required, but makes them smaller. As a drastic measure, Opus can be instructed to perform DTX (Discontinuous Transmission): silence is not sent at all.
Furthermore, Opus has an error correction mechanism: FEC (Forward Error Correction). FEC is a technique by which a packet of encoded audio doesn't simply contain the encoded audio at the desired quality for the current packet, but also a low bitrate encoding of the previous packet. This means that if packet N is lost, it's possible for the decoder to generate a lower quality approximation thanks to the FEC information received at packet N+1. Of course there's a price to pay: FEC information consumes bandwidth without improving quality! It's a redundancy system which advantages apply only in case of packet loss.
The other caveat is that given FEC works only on contiguous frames, if two consecutive packets are lost, then the decoder doesn't have any FEC information to reconstruct one of them. That data is lost forever.
What comes handy in this cases is a different technique: PLC (Packet Loss Conciliation). PLC is able to reconstruct missing packets by interpolation, and so works also when more than one packet is lost.
What has an impact in error recovering is also the size of the packets. Considering the overhead required to transmit a packet of voice over an IP network, "grouping" longer intervals of sound will allow a smaller bandwidth usage. A stream with packets representing 40 ms will require a smaller bandwidth than the same audio with packets representing 20 msec. The drawback here is that it's more "traumatic" to lose a bigger packet, so a compromise must be found.
Now, coming back to G.711 and Opus, there's one case where they are not just mutually exclusive, but work together. This happens when you would like to use Opus given its advantages, but you must use G.711 for interoperability reasons. Enter "transcoding". One entity, like a mobile app dealing with 3G (or worse), uses Opus, but the call has to be routed to a "traditional" GSM or landline phone, and the provider of this routing mandates G.711. It can be assumed that the network link between you, the service provider of the mobile app, and the PSTN provider, has properties that allow G.711 streams to be handled. Bandwidth efficiency won't be great but hey, this is life.
Transcoding means that one stream has to be decoded from one codec (e.g. from Opus) and then re-encoded with the other codec (e.g. G.711).
This operation is computationally demanding. And if you compare it with a scenario where the encoded packets just flow directly without the need to be touched, you can get how transcoding is not desirable. It probably reduces any system's capacity of at least an order of magnitude.
But as I wrote earlier, sometimes it's just a constraint to accept.
In the field of Open Source VoIP there are some well known applications able to perform audio transcoding, and I focus here on FreeSWITCH. FreeSWITCH can be configured to accept incoming calls with either Opus or G.711, and transcode the streams one into the other depending on the needs. For example, mobile app to FreeSWITCH will use Opus, FreeSWITCH to PSTN provider will use G.711. And the other way around.
Opus can work with sample rates up to 48000 Hz, which means 48000 audio samples each second, and a bitrate up to 510000 bps. The audio quality can be so good that Opus can be used to encode music. When dealing with VoIP though the key characteristics are not audio quality per se, but the compromise between available bandwidth, network conditions, and voice intelligibility.
When transcoding though from "high quality" (potentially Opus) to low quality (G.711), and viceversa, the advantages of higher sample rates are somehow lost. I have this analogy in mind, that works for my brain: it's like connecting two pipes (as in physical, plumbing pipes), with very different diameter. When water comes flowing from the narrow pipe, there's no advantage in making it flow through a wider pipe: the flow is limited upstream. Similarly, the wider pipe won't be able to transfer all the water to the narrower, and water leaks will appear. If you think this analogy is not quite right because what counts in plumbing pipes is also the speed of water, be kind and ignore the whole thing :-)
All this to say that it's possible to make bandwidth usage more efficient, without decreasing quality, by using Opus at 8000 samples/second, instead of the potential 48000. Furthermore, it's possible to limit the average bitrate, knowing that "quality can't be worse". Surely a compromise must be found, but the main principle is that since one side is "low quality" it's useless to try and "create quality" on the other side.
All this reasoning has been reflected in the work done inside the Libon project by Dragos. Recently we've tried to put down all this info in some sort of structured and (at least in our intentions) comprehensive document (FreeSWITCH and the Opus audio codec).
This document describes the usage of Opus inside FreeSWITCH under various points of view: configuration, installation, debugging, development. What we wanted to achieve was also the sharing of a common terminology to ease the info sharing and the discussions around this topic.
If you've made it to this point reading the article it means you're really interested in this topic: congratulations. Please take some time to read that document too and feel free to send over any feedback or question you may have, thank you. This is all just a learning process.