Skip to main content

Opus/G.711 Transcoding For The Practical Man

Following my earlier post on "Opus SDP negotiation" in the series "For The Practical Man", I'm presenting today a related topic: Opus audio codec when transcoding is involved.

Most of the providers of PSTN connectivity require the simplest possible VoIP codec: G.711 (which comes in two flavours, u-law and a-law).

G.711 is a sort of PCM encoding at 8000 samples per second: 8000 times per second an audio sample is encoded with 8 bit. Sometimes Comfort Noise can be used, reducing the bitrate when silence is detected, but otherwise the typical working principle is a continuous flow of digitally-encoded packets of voice. u-law and a-law just use a different way to encode the data. (If you're curious about what does silence look like in G.711, I wrote a post about it some time ago).

G.729 is another widely adopted codec, but I'll leave it for another day.

With 8000 samples per second and 8 bit dedicated for each sample G.711 requires a net bit rate of 64 Kbps. "Net" because each packet will travel over the IP network with the IP/UDP overhead (40 Bytes, accounting for an additional 16 Kbps at 20 msec packetization). And this in each direction.

Except in cases where Comfort Noise is used, G.711 is not able to "recover" latency, so every little delay during the network transport will either sum up with the overall delay, or cause the packet to be dropped, should not arrive in time for the receiving jitter buffer to correctly handle it.

I hope I'm passing the message that G.711 is expensive and inefficient.

Today though G.711 is a sort of common denominator between communicating parties. It's the "last resort", if you will. In fact it's been chosen by the standardization committee working on WebRTC, together with the Opus codec, for audio. WebRTC requires support for these two audio codecs (see RFC 7875, chapter 3).

Opus is a much better solution. It's suitable for VoIP (low bitrate, small sample rates) but also for music (very high bitrate, large sample rates). It's flexible enough to adapt the bitrate during a call. When Variable Bitrate is enabled, Opus tries to optimize the bitrate given some network conditions (e.g. packet loss). If there's silence, the Opus encoder keeps sending packets at the packet rate required, but makes them smaller. As a drastic measure, Opus can be instructed to perform DTX (Discontinuous Transmission): silence is not sent at all.

Furthermore, Opus has an error correction mechanism: FEC (Forward Error Correction). FEC is a technique by which a packet of encoded audio doesn't simply contain the encoded audio at the desired quality for the current packet, but also a low bitrate encoding of the previous packet. This means that if packet N is lost, it's possible for the decoder to generate a lower quality approximation thanks to the FEC information received at packet N+1. Of course there's a price to pay: FEC information consumes bandwidth without improving quality! It's a redundancy system which advantages apply only in case of packet loss.
The other caveat is that given FEC works only on contiguous frames, if two consecutive packets are lost, then the decoder doesn't have any FEC information to reconstruct one of them. That data is lost forever.

What comes handy in this cases is a different technique: PLC (Packet Loss Conciliation). PLC is able to reconstruct missing packets by interpolation, and so works also when more than one packet is lost.

What has an impact in error recovering is also the size of the packets. Considering the overhead required to transmit a packet of voice over an IP network, "grouping" longer intervals of sound will allow a smaller bandwidth usage. A stream with packets representing 40 ms will require a smaller bandwidth than the same audio with packets representing 20 msec. The drawback here is that it's more "traumatic" to lose a bigger packet, so a compromise must be found.

Now, coming back to G.711 and Opus, there's one case where they are not just mutually exclusive, but work together. This happens when you would like to use Opus given its advantages, but you must use G.711 for interoperability reasons. Enter "transcoding". One entity, like a mobile app dealing with 3G (or worse), uses Opus, but the call has to be routed to a "traditional" GSM or landline phone, and the provider of this routing mandates G.711. It can be assumed that the network link between you, the service provider of the mobile app, and the PSTN provider, has properties that allow G.711 streams to be handled. Bandwidth efficiency won't be great but hey, this is life.

Transcoding means that one stream has to be decoded from one codec (e.g. from Opus) and then re-encoded with the other codec (e.g. G.711).
This operation is computationally demanding. And if you compare it with a scenario where the encoded packets just flow directly without the need to be touched, you can get how transcoding is not desirable. It probably reduces any system's capacity of at least an order of magnitude.
But as I wrote earlier, sometimes it's just a constraint to accept.

In the field of Open Source VoIP there are some well known applications able to perform audio transcoding, and I focus here on FreeSWITCH. FreeSWITCH can be configured to accept incoming calls with either Opus or G.711, and transcode the streams one into the other depending on the needs. For example, mobile app to FreeSWITCH will use Opus, FreeSWITCH to PSTN provider will use G.711. And the other way around.

Opus can work with sample rates up to 48000 Hz, which means 48000 audio samples each second, and a bitrate up to 510000 bps. The audio quality can be so good that Opus can be used to encode music. When dealing with VoIP though the key characteristics are not audio quality per se, but the compromise between available bandwidth, network conditions, and voice intelligibility.

When transcoding though from "high quality" (potentially Opus) to low quality (G.711), and viceversa, the advantages of higher sample rates are somehow lost. I have this analogy in mind, that works for my brain: it's like connecting two pipes (as in physical, plumbing pipes), with very different diameter. When water comes flowing from the narrow pipe, there's no advantage in making it flow through a wider pipe: the flow is limited upstream. Similarly, the wider pipe won't be able to transfer all the water to the narrower, and water leaks will appear. If you think this analogy is not quite right because what counts in plumbing pipes is also the speed of water, be kind and ignore the whole thing :-)

All this to say that it's possible to make bandwidth usage more efficient, without decreasing quality, by using Opus at 8000 samples/second, instead of the potential 48000. Furthermore, it's possible to limit the average bitrate, knowing that "quality can't be worse". Surely a compromise must be found, but the main principle is that since one side is "low quality" it's useless to try and "create quality" on the other side.

All this reasoning has been reflected in the work done inside the Libon project by Dragos. Recently we've tried to put down all this info in some sort of structured and (at least in our intentions) comprehensive document (FreeSWITCH and the Opus audio codec).

This document describes the usage of Opus inside FreeSWITCH under various points of view: configuration, installation, debugging, development. What we wanted to achieve was also the sharing of a common terminology to ease the info sharing and the discussions around this topic.


If you've made it to this point reading the article it means you're really interested in this topic: congratulations. Please take some time to read that document too and feel free to send over any feedback or question you may have, thank you. This is all just a learning process. 

Popular posts from this blog

Troubleshooting TURN

  WebRTC applications use the ICE negotiation to discovery the best way to communicate with a remote party. I t dynamically finds a pair of candidates (IP address, port and transport, also known as “transport address”) suitable for exchanging media and data. The most important aspect of this is “dynamically”: a local and a remote transport address are found based on the network conditions at the time of establishing a session. For example, a WebRTC client that normally uses a server reflexive transport address to communicate with an SFU. when running inside the home office, may use a relay transport address over TCP when running inside an office network which limits remote UDP targets. The same configuration (defined as “iceServers” when creating an RTCPeerConnection will work in both cases, producing different outcomes.

Extracting RTP streams from network captures

I needed an efficient way to programmatically extract RTP streams from a network capture. In addition I wanted to: save each stream into a separate pcap file. extract SRTP-negotiated keys if present and available in the trace, associating them to the related RTP (or SRTP if the negotiation succeeded) stream. Some caveats: In normal conditions the negotiation of SRTP sessions happens via a secure transport, typically SIP over TLS, so the exchanged crypto information may not be available from a simple network capture. There are ways to extract RTP streams using Wireshark or tcpdump; it’s not necessary to do it programmatically. All this said I wrote a small tool ( https://github.com/giavac/pcap_tool ) that parses a network capture and tries to interpret each packet as either RTP/SRTP or SIP, and does two main things: save each detected RTP/SRTP stream into a dedicated pcap file, which name contains the related SSRC. print a summary of the crypto information exchanged, if available. With ...

Testing SIP platforms and pjsip

There are various levels of testing, from unit to component, from integration to end-to-end, not to mention performance testing and fuzzing. When developing or maintaining Real Time Communications (RTC or VoIP) systems,  all these levels (with the exclusion maybe of unit testing) are made easier by applications explicitly designed for this, like sipp . sipp has a deep focus on performance testing, or using a simpler term, load testing. Some of its features allow to fine tune properties like call rate, call duration, simulate packet loss, ramp up traffic, etc. In practical terms though once you have the flexibility to generate SIP signalling to negotiate sessions and RTP streams, you can use sipp for functional testing too. sipp can act as an entity generating a call, or receiving a call, which makes it suitable to surround the system under test and simulate its interactions with the real world. What sipp does can be generalised: we want to be able to simulate the real world tha...