Sunday, 7 May 2017

Monitoring FreeSWITCH with Homer - adding non-SIP events with hepipe.js

FreeSWITCH (from now on FS) provides a very powerful tool to interact with it: the Event Socket (ESL), made available via the mod_event_socket module (https://freeswitch.org/confluence/display/FREESWITCH/mod_event_socket).

ESL is a TCP socket where applications can connect to, and perform two types of action:
1. Send commands.
2. Subscribe to events.

The applications subscribing to events will receive the expected notifications through the same TCP connection.
A simple protocol and transport made it possible for various libraries in various languages to be written.

Events from FS can serve multiple purposes. In this article I'm interested in monitoring and event correlation.

Homer (http://sipcapture.org/) is a widely used, open source tool to monitor RTC infrastructures. It has a multitude of features, but the core is the ability to collect SIP signalling and other events from RTC applications, and perform a form of correlation. In particular, it's able to correlate the SIP signalling involved in a call with other events like RTCP reports or log lines associated to the same call.

While FS, through the sofia module, has native support for transmitting SIP signalling to Homer, the acquisition of other events can happen by collecting these events from the ESL, filtering them, and sending them to Homer with the proper formatting.

This is what hepipe.js (https://github.com/sipcapture/hepipe.js) does. hepipe.js is a simple nodejs application that is able to:
- connect to FS via ESL
- subscribe to specific event categories
- format the events into HEP messages. HEP is a binary protocol used to transmit data to Homer.

hepipe.js is easy to use:
- Clone it
- Run 'sudo npm install' to install the required dependencies
- Set configuration
- Run it ('sudo node hepipe.js', or 'sudo nodejs hepipe.js')

The configuration is organized in "modules", and for this example you'll have to configure at a minimum the esl module and the hep module.
Edit a config.js file in the same folder as hepipe.js with something like:

var config = {
  hep_config: {
    debug: true,
    HEP_SERVER: '10.0.0.17',
    HEP_PORT: 9060
  },
  esl_config: {
    debug: true,
    ESL_SERVER: '127.0.0.1',
    ESL_PORT: 8021,
    ESL_PASS: 'ClueCon',
    HEP_PASS: 'multipass',
    HEP_ID: 2222,
    report_call_events: true,
    report_rtcp_events: true,
    report_qos_events: true
  }
};

module.exports = config;

This will configure the hep module to send data to a Homer instance listening on UDP, IP address 10.0.0.17, port 9060, and will try to connect to a FS' ESL on localhost, via TCP port 8021 and using the default password. See also other configuration examples in the examples/ folder.

Please note that the ESL requires at least two levels of authorization: a password and an ACL. You can check conf/autoload_config/event_socket.conf.xml in the FS configuration folder to ensure the ACL in use, if any, is compatible to the source IP address of hepipe.js when connecting to FS.
e.g.:
or

Once config.js will be ready, launch hepipe.js and look at the events being sent to Homer.
Note that you can filter out event types by setting to false some of these:
    report_call_events: true,
    report_rtcp_events: true,
    report_qos_events: true

Assuming FS is configured to send SIP signalling to the same Homer instance, you'll be able to see, associated to its SIP call flows, also the events captured by hepipe.js.

See for example below log lines created by FS, sent to Homer, and then presented together with the SIP signalling in Homer:



Enjoy!






Sunday, 15 January 2017

Analysing Opus media from network traces

VoIP/RTC platforms have typically many elements processing audio. When an issue is reported it's important to be able to restrict the investigation field, to save time and resources.

A typical scenario is bad or missing audio perceived on the client side. As I've done previously (here for Opus and here for SILK) I'd like to share some practical strategies to extract audio from a pcap trace (to verify the audio received/sent was "correct") and to "re-play" the call inside a test bed (to verify that the audio was good but also carried correctly by the RTP stream). Of course a lot can be inferred by indirect data, for example the summary of RTCP reports showing the number of packets exchanged, packets lost, the latency. But sometimes those metrics are perfect while the issue is still there.

Focusing in this case on Opus audio, and starting from a pcap file with the network traces for a call under investigation, let's see how to decode the Opus frames carried by the RTP packets into an audible WAV file.

You don't even need to have captured the signalling: it's sufficient to have the UDP packets carrying the RTP. If signalling is not visible by Wireshark it may not recognize that the UDP packets carry RTP, but you give it a hint by right-clicking on a frame and "Decode as..." and selecting "RTP".

It's typically easy to find the relevant RTP stream in Wireshark ("Telephony -> RTP -> RTP Streams"), select it, and prepare a filter. Then you can Export the packets belonging to that stream into a dedicated pcap file ("File --> Export Specified Packets...").

I've then modified opusrtp from a fork of opus-tools in order to be able to extract the payload from a given pcap, creating an Opus file. e.g.:

./opusrtp --extract trace.pcap

This will output a rtpdump.opus file, which can be converted into a WAV file directly with opusdec, still part of opus-tools:

./opusdec --rate 8000 rtpdump.opus audio.wav

You can listen to the wav file and verify whether at least the carried RTP payload was valid.

The network trace with the RTP can also be used to re-play the call, injecting the same RTP as in the call under investigation. With the help of sipp you can set up a rudimentary but very powerful test bed. Use the standard UAS scenario (e.g. in uas.xml), but with an additional part:

right after the ACK is received. If you launch sipp with a command like:

sipp -sf uas.xml -i MEDIA_IP_ADDRESS

you'll be able to call sipp. It will answer the call, as the scenario mandates, and will play the RTP contained in rtp_opus.pcap. The stream SSRC, timestamps, even Marker bits will be preserved. This will give you quite an accurate simulation of the stream received by the client in the original call.

It should be straightforward to reach all these components. For opus-tools, on a debian-based machine, you can just:

sudo apt-get install libogg-dev libpcap-dev
git clone https://github.com/giavac/opus-tools.git
cd opus-tools
./autogen.sh
./configure
make

For sipp:
sudo apt-get install sip-tester

I hope this will save the reader some time in future investigations.

UPDATE: The fork of opus-tools was merged into the original repo, so you don't need my repo.

UPDATE 2: This only works if the opus payload in the RTP is not encrypted. Also it may need a patch when the extension header for volume indications are used (e.g. 'urn:ietf:params:rtp-hdrext:ssrc-audio-level', see RFC-6464). Don't forget that at the moment the payload type is harcoded to 120. You may need to rebuild opusrtp with the type your trace has, e.g. 96 (It should be easy to pass it as command line argument, something for a quiet moment).



Friday, 13 January 2017

VoIP calls encoded with SILK: from RTP to WAV, update

Three and a half years ago (which really sounds like a lot of time!) I was working with a VoIP infrastructure using SILK. As it often happens to server-side developers/integrators, you have to prove whether the audio provided by a client or to a client is correctly encoded :-)

Wireshark is able to decode, and play, G.711 streams, but not SILK (or Opus - more on this later). So I thought of having my own tool handy, to generate a WAV file for a PCAP with RTP carrying SILK frames.

The first part requires extracting the SILK payload and writing it down into a bistream file. Then you have to decode the audio using the SILK SDK decoder, to get a raw audio file. From there to a WAV file it is very easy.

As I tried to describe in this previous post, I had to reverse engineer the test files contained in the SDK, to see what a SILK file looked like.

Since the SILK payload is not constant, all that was needed was to insert 2 Bytes with the length of the following SILK frame. At the beginning of the file you have to add a header containing "#!SILK_V3", and voilĂ .

This is accomplished by silk_rtp_to_bistream.c (from https://github.com/giavac/silk_rtp_to_bitstream), a small program based on libpcap that extracts the SILK payload from a PCAP and writes it properly into a bistream file.

Build the binary with:

gcc silk_rtp_to_bitstream.c -lpcap -o silk_rtp_to_bitstream

(you'll need libpcap-dev installed)

Create the bistream with:

./silk_rtp_to_bitstream input.pcap silk.bit

Now you can decode, using the SILK SDK, from bitstream into raw audio with:

$SILK_SDK/decoder silk.bit silk.raw

Raw audio to WAV can be done with sox:

sox -V -t raw -b 16 -e signed-integer -r 24000 silk.raw silk.wav

This works fine with single channel SILK at 8000 Hz.


More to come: an update on how to accomplish the same but for Opus.


Monday, 19 December 2016

Thinking About Thinking

I've just finished reading a surprising book, and I'd like to share some notes with you. I was initially looking for something to improve the reasoning and logical flow during conversations. I found "Thinking, Fast and Slow", which is not really about that, but kept going for some pages and then couldn't stop.

I'm not using any kind of referral promotion, so if any reader of this post will buy the book I won't get a cent. And I'm not expert on psychology, sociology or economy.

It's just that "Thinking, Fast and Slow" has been a great reading experience. There are some insights on the way people think and decide that are worth analysing.

First of all the authors present two separate ways of "thinking": the first is intuitive, voluntary. It derives from millennia of evolution, where as animals we needed to decide in a fraction of second whether a situation was a danger or an opportunity. This is effortless and always active.

The second system is voluntary thinking. It requires an effort and it can't be always active during the day. This is strongly impacted by the first system: we may believe we are evaluating a problem coldly and rationally, but the intuitive system has already tagged it in a way or another, and energies need to be spent to fight that "first impression".

This model explains why often we understand something without really thinking about it, or we have a very different point of view on something after some proper analysis. The book is dense of information, and I'll just mention some psychological effects that surprised me particularly.

Loss aversion: People tent to evaluate losses more than gains. It seems the ratio is 2:1, i.e. losses weight double than gains. For example, to accept the risk of losing 100 euros most people would need to have an equal probability of gaining 200 euros. This is probably why people tend to leave things as they are, and defer changes: only rarely when comparing the current situation with a new one, gains seem double the losses.

Sunk-cost fallacy: Once somebody has invested (money, time) on something, they tend to evaluate the status in a better way than reality. Also people refrain from abandoning a project or investment if a large cost has already been faced, even though the forecast is not positive.

Competition neglect (and Planning fallacy): Once involved in an activity, our perception of competing factors tends to fade away. Similarly, when planning a task or project, we tend to be way more optimistic as we should given our own experience. I guess I saw this some times in software development...

Endowment effect (and Affect heuristic): People associate a higher value to things that they own. This is perhaps why we accept to buy something at a price, but would refuse to sell it for the same or only slightly higher price. Once we own it, there is an emotional value attached to it.

This book proves that humans are not very good in thinking in terms of probability, even, for example, when they are aware of the actual probability of an event. They'll tend to consider something more or less probable depending on various psychological factors.

Also we often believe that some events are a clear indication of a measurable reality, like the score of a football match, the performance of a company or a stockbroker, and like to ignore statistical fluctuations (which is called "Regression to the mean").


All these elements, and many more, are in contrast with the ideal economical mind, where values, durations, students, even probability of occurrence are always correctly estimated. People are less rational that they like to believe, and being aware of these psychological factors impacting rationality may be useful in our lives.

Wednesday, 30 November 2016

Docker networking and a tricky behaviour

There's something about debugging: the more you have experienced finding the root cause of bugs in the past, the highest the hope and confidence you'll squash the one currently under the microscope.

You see I didn't mention "fixing" a bug, because I think finding the root cause of any bug has value per se. Fixing them is a separate adventure.

Anyway this week I was entirely bamboozled by this: a Docker container was re-deployed with a different networking configuration. In particular, with Compose, I was dedicating a network for that container (let's say 172.18.0.0/24) separate from the default Docker network (e.g. 172.17.0.0/24).

In docker-compose.yml:

networks:
  apps:
    driver: bridge
    ipam:
      driver: default
      config:
      - subnet: 172.18.0.0/24
        gateway: 172.18.0.1


In the "service" definition I just added this:

services:
  THESERVICE:
     ..
      networks:
        - apps

No need to be more specific with names because Compose will add the project name to the network, so in the Docker network list you see something like: MYPROJECT_apps.

This change worked perfectly on two previous deployments - in fact they were deployments on intermediate stages, in preparation for a deployment on a third stage. Everything went smoothly there, no big deal, Compose updating the networking as the new configuration was brought up.

This particular container exposes a UDP port (say 8888) and it's expected to receive UDP packets from the VLAN via the host interface. Packets arrive at the host's UDP 8888 and are forwarded to the container's UDP 8888, where they are processed. Very simple.

But when I did the deployment on the third environment, reasonably similar to the first two, things just didn't work. tcpdump was showing that the UDP packets were still received by the host but not forwarded to the container.

The forwarding rules - inspected with iptables - were correct. The container had the expected IP Address. ip route was correct. Where are those packets going then?

I ssh'd into one of the machines that were sending UDP traffic to the receiving host, opened a netcat connections to that host, port 8888, and sent some data. Those packets were correctly being received by the container! So some traffic with similar characteristics was received, some wasn't.

Perhaps an issue with data length? Nope. I tried sending big UDP packets and they were correctly received too. It was a weak lead anyway, because the source traffic was of various lengths while not a single packet was forwarded.

Errors logged somewhere? None.

iptables stats were confirming the arrival of the packets, and that they were not routed to the container's virtual interface.

At this point I was enough out of ideas to start tapping the shoulder of some colleague. "I'm observing an interesting behaviour!". "Have you tried...": "Yes". "How about... ": "Checked." "WTF?": "Exactly".

It was indeed a WTF moment. Surely there were some differences in the list of virtual interfaces on the problematic host, but none seemed to have an impact.

We looked deeper into the network traces. And there, glooming in their "light green on black" magnificence there were the ICMP Destination Unreachable responses from the receiving host to the producing hosts. A confirmation that packets were attempted to be sent to an unreachable destination. Why? The forwarding rules were not just right, but also easily testable. It was just the "official" traffic that wasn't forwarded...

Then a colleague, one of the brightest I've ever had, started talking about tracked connections (/proc/net/nf_conntrack). We could see our netcat connections were tracked, as they were the ones from the other hosts producing data. But there was an important detail. The connections from the "official" producing hosts were associated with the old container IP address (e.g. 172.17.0.2 instead of the new one, 172.18.0.2)!

The usage of conntrack is visible from iptables' output, in particular for the FORWARDING table:

-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Why didn't we observe this behaviour in the other environments? Simple: there the traffic was more sporadic, and so the tracked connections had the time to expire. In the failing environment the continuous flow of UDP packets kept refreshing the tracked connections, even though the forwarding was actually failing.

Only at that point I had the right scenario described, and I could find this Docker issue. From there:
When you restart the container, container's ip has changed, so the DNAT rule, which will route to the new address. But the old connection's state in conntrack is not cleared. So when a packet arrives, it will not go through NAT table again, because it is not "the first" packet. So the solution is clearing the conntrack, [...].

It was a fun day :) I hope this may save some hours of head scratching to somebody in the future.


Saturday, 29 October 2016

Opus/G.711 Transcoding For The Practical Man

Following my earlier post on "Opus SDP negotiation" in the series "For The Practical Man", I'm presenting today a related topic: Opus audio codec when transcoding is involved.

Most of the providers of PSTN connectivity require the simplest possible VoIP codec: G.711 (which comes in two flavours, u-law and a-law).

G.711 is a sort of PCM encoding at 8000 samples per second: 8000 times per second an audio sample is encoded with 8 bit. Sometimes Comfort Noise can be used, reducing the bitrate when silence is detected, but otherwise the typical working principle is a continuous flow of digitally-encoded packets of voice. u-law and a-law just use a different way to encode the data. (If you're curious about what does silence look like in G.711, I wrote a post about it some time ago).

G.729 is another widely adopted codec, but I'll leave it for another day.

With 8000 samples per second and 8 bit dedicated for each sample G.711 requires a net bit rate of 64 Kbps. "Net" because each packet will travel over the IP network with the IP/UDP overhead (40 Bytes, accounting for an additional 16 Kbps at 20 msec packetization). And this in each direction.

Except in cases where Comfort Noise is used, G.711 is not able to "recover" latency, so every little delay during the network transport will either sum up with the overall delay, or cause the packet to be dropped, should not arrive in time for the receiving jitter buffer to correctly handle it.

I hope I'm passing the message that G.711 is expensive and inefficient.

Today though G.711 is a sort of common denominator between communicating parties. It's the "last resort", if you will. In fact it's been chosen by the standardization committee working on WebRTC, together with the Opus codec, for audio. WebRTC requires support for these two audio codecs (see RFC 7875, chapter 3).

Opus is a much better solution. It's suitable for VoIP (low bitrate, small sample rates) but also for music (very high bitrate, large sample rates). It's flexible enough to adapt the bitrate during a call. When Variable Bitrate is enabled, Opus tries to optimize the bitrate given some network conditions (e.g. packet loss). If there's silence, the Opus encoder keeps sending packets at the packet rate required, but makes them smaller. As a drastic measure, Opus can be instructed to perform DTX (Discontinuous Transmission): silence is not sent at all.

Furthermore, Opus has an error correction mechanism: FEC (Forward Error Correction). FEC is a technique by which a packet of encoded audio doesn't simply contain the encoded audio at the desired quality for the current packet, but also a low bitrate encoding of the previous packet. This means that if packet N is lost, it's possible for the decoder to generate a lower quality approximation thanks to the FEC information received at packet N+1. Of course there's a price to pay: FEC information consumes bandwidth without improving quality! It's a redundancy system which advantages apply only in case of packet loss.
The other caveat is that given FEC works only on contiguous frames, if two consecutive packets are lost, then the decoder doesn't have any FEC information to reconstruct one of them. That data is lost forever.

What comes handy in this cases is a different technique: PLC (Packet Loss Conciliation). PLC is able to reconstruct missing packets by interpolation, and so works also when more than one packet is lost.

What has an impact in error recovering is also the size of the packets. Considering the overhead required to transmit a packet of voice over an IP network, "grouping" longer intervals of sound will allow a smaller bandwidth usage. A stream with packets representing 40 ms will require a smaller bandwidth than the same audio with packets representing 20 msec. The drawback here is that it's more "traumatic" to lose a bigger packet, so a compromise must be found.

Now, coming back to G.711 and Opus, there's one case where they are not just mutually exclusive, but work together. This happens when you would like to use Opus given its advantages, but you must use G.711 for interoperability reasons. Enter "transcoding". One entity, like a mobile app dealing with 3G (or worse), uses Opus, but the call has to be routed to a "traditional" GSM or landline phone, and the provider of this routing mandates G.711. It can be assumed that the network link between you, the service provider of the mobile app, and the PSTN provider, has properties that allow G.711 streams to be handled. Bandwidth efficiency won't be great but hey, this is life.

Transcoding means that one stream has to be decoded from one codec (e.g. from Opus) and then re-encoded with the other codec (e.g. G.711).
This operation is computationally demanding. And if you compare it with a scenario where the encoded packets just flow directly without the need to be touched, you can get how transcoding is not desirable. It probably reduces any system's capacity of at least an order of magnitude.
But as I wrote earlier, sometimes it's just a constraint to accept.

In the field of Open Source VoIP there are some well known applications able to perform audio transcoding, and I focus here on FreeSWITCH. FreeSWITCH can be configured to accept incoming calls with either Opus or G.711, and transcode the streams one into the other depending on the needs. For example, mobile app to FreeSWITCH will use Opus, FreeSWITCH to PSTN provider will use G.711. And the other way around.

Opus can work with sample rates up to 48000 Hz, which means 48000 audio samples each second, and a bitrate up to 510000 bps. The audio quality can be so good that Opus can be used to encode music. When dealing with VoIP though the key characteristics are not audio quality per se, but the compromise between available bandwidth, network conditions, and voice intelligibility.

When transcoding though from "high quality" (potentially Opus) to low quality (G.711), and viceversa, the advantages of higher sample rates are somehow lost. I have this analogy in mind, that works for my brain: it's like connecting two pipes (as in physical, plumbing pipes), with very different diameter. When water comes flowing from the narrow pipe, there's no advantage in making it flow through a wider pipe: the flow is limited upstream. Similarly, the wider pipe won't be able to transfer all the water to the narrower, and water leaks will appear. If you think this analogy is not quite right because what counts in plumbing pipes is also the speed of water, be kind and ignore the whole thing :-)

All this to say that it's possible to make bandwidth usage more efficient, without decreasing quality, by using Opus at 8000 samples/second, instead of the potential 48000. Furthermore, it's possible to limit the average bitrate, knowing that "quality can't be worse". Surely a compromise must be found, but the main principle is that since one side is "low quality" it's useless to try and "create quality" on the other side.

All this reasoning has been reflected in the work done inside the Libon project by Dragos. Recently we've tried to put down all this info in some sort of structured and (at least in our intentions) comprehensive document (FreeSWITCH and the Opus audio codec).

This document describes the usage of Opus inside FreeSWITCH under various points of view: configuration, installation, debugging, development. What we wanted to achieve was also the sharing of a common terminology to ease the info sharing and the discussions around this topic.


If you've made it to this point reading the article it means you're really interested in this topic: congratulations. Please take some time to read that document too and feel free to send over any feedback or question you may have, thank you. This is all just a learning process. 

Thursday, 29 September 2016

Opus negotiation for the practical man

Opus [0] is a versatile audio codec, with a variable sample rate and bitrate, suitable for both music and speech. It is defined in RFC 6716 [1] and required by WebRTC [2].

Opus can operate at various sample rates, from 8 KHz to 48 KHz, and at variable bitrates, from 6 kbit/sec to 510 kbit/sec.

The RTP payload format defined for Opus in RFC 7587 [3] explains the use of media type parameters in SDP, and this article aims to analyze them and show in particular how "asymmetric streams" can be achieved.

This is an example of SDP defining an Opus offer or answer:

       m=audio 54312 RTP/AVP 101
       a=rtpmap:101 opus/48000/2
       a=fmtp:101 maxplaybackrate=16000; sprop-maxcapturerate=16000;
       maxaveragebitrate=20000; stereo=1; useinbandfec=1; usedtx=0
       a=ptime:40
       a=maxptime:40 


Let's clarify one thing immediately, about rtpmap.

rtpmap


As specified in RFC 7587 Ch. 7, the media subtype portion of rtpmap must always be 'opus/48000/2' (48000 samples/sec, 2 channels), regardless of the actual sample rate used. So you can leave happily this configuration element out of your thoughts, even if you want to use a narrowband version of Opus.

e.g.:

a=rtpmap:96 opus/48000/2

Another less than intuitive aspect to clarify is how RTP timestamp are managed as the RTP represents audio with variable sample rates.

RTP timestamp


From RFC 7587, Ch. 4.1:

   Opus supports 5 different audio bandwidths, which can be adjusted   during a stream.  The RTP timestamp is incremented with a 48000 Hz
   clock rate for all modes of Opus and all sampling rates.  The unit
   for the timestamp is samples per single (mono) channel.  The RTP
   timestamp corresponds to the sample time of the first encoded sample
   in the encoded frame.  For data encoded with sampling rates other
   than 48000 Hz, the sampling rate has to be adjusted to 48000 Hz.

This can be interpreted in this way: "The timestamp must always be set as if the sample rate is 48000 Hz."

Default case: the encoder is set at 48 KHz. A 20 msec frame contains 960 (48000 samples/sec * 20 msec) samples.
When the encoder is set at 8 KHz, instead, a 20 msec frame contains 160 (8000 samples/sec * 20 msec) samples. The timestamp in the RTP packet must be adapted, so that the sample rate is normalised to 48 KHz, by multiplying by 6 (48000/8000) the number of samples.

In both cases though a 20 msec frame will have an RTP representation with 960 "time clicks".

Now we start looking at the parameters that help the two parties in setting their encoders and decoders.

maxplaybackrate


From RFC 7587, Ch. 6.1:

     maxplaybackrate:  a hint about the maximum output sampling rate that
      the receiver is capable of rendering in Hz.  The decoder MUST be
      capable of decoding any audio bandwidth, but, due to hardware
      limitations, only signals up to the specified sampling rate can be
      played back.  Sending signals with higher audio bandwidth results
      in higher than necessary network usage and encoding complexity, so
      an encoder SHOULD NOT encode frequencies above the audio bandwidth
      specified by maxplaybackrate.  This parameter can take any value
      between 8000 and 48000, although commonly the value will match one
      of the Opus bandwidths (Table 1).  By default, the receiver is
      assumed to have no limitations, i.e., 48000.

This optional parameter is telling the encoder on the other side: "Since I won't be able to play at rates higher than `maxplaybackrate` you can save resources and bandwidth by limiting the encoding rate to this value."

A practical case is transcoding from Opus to G.711, where anyway the final playback rate will be 8000 Hz.

sprop-maxcapturerate


The specular (and still optional) parameter is sprop-maxcapturerate, defined in RFC 7587 Ch. 6.1:

     sprop-maxcapturerate:  a hint about the maximum input sampling rate
      that the sender is likely to produce.  This is not a guarantee
      that the sender will never send any higher bandwidth (e.g., it
      could send a prerecorded prompt that uses a higher bandwidth), but
      it indicates to the receiver that frequencies above this maximum
      can safely be discarded.  This parameter is useful to avoid
      wasting receiver resources by operating the audio processing
      pipeline (e.g., echo cancellation) at a higher rate than
      necessary.  This parameter can take any value between 8000 and
      48000, although commonly the value will match one of the Opus
      bandwidths (Table 1).  By default, the sender is assumed to have
      no limitations, i.e., 48000.

This parameter is telling the decoder on the other side: "Since I won't be able to produce audio at rates higher than `sprop-maxcapturerate` you can save resources by limiting the decoding rate to this value."

A practical example is transcoding from G.711 to Opus, with the source always limited to a capture rate of 8000 samples/sec.

maxaveragebitrate


An additional element, maxaveragebitrate, refers to the maximum average bitrate that the decoder will be able to manage. This is a hint that it's not worth for the remote encoder to use higher bitrates, and that it can instead save resources.

From RFC 7587, Ch. 6.1:

     maxaveragebitrate:  specifies the maximum average receive bitrate of
      a session in bits per second (bit/s).  The actual value of the
      bitrate can vary, as it is dependent on the characteristics of the
      media in a packet.  Note that the maximum average bitrate MAY be
      modified dynamically during a session.  Any positive integer is
      allowed, but values outside the range 6000 to 510000 SHOULD be
      ignored.

This parameter is telling the remote encoder: "Since my decoder can't handle bitrates higher than maxaveragebitrate, you can save computation power and bandwidth by limiting your encoder bitrate to this value."

A practical example could be a mobile client that wants to ensure the download bandwidth is not saturated. Note that this value refers only to the initial negotiation (SDP offer/answer), while the parties can negotiate different values during an active call.

Asymmetric negotiation


Given the interpretations above, it seems also possible to negotiate asymmetrical streams: the two entities involved can encode and decode at different rates when appropriate.

In particular, if we imagine an entity with local parameters:

maxplaybackrate=Da; sprop-maxcapturerate=Ea; maxaveragebitrate=Fa

and remote parameters:

maxplaybackrate=Db; sprop-maxcapturerate=Eb; maxaveragebitrate=Fb

then this entity can set the decoder at a sample rate of min(Da, Eb) and the encoder at a sample rate of min(Ea, Db) and bitrate at Fb.

Similarly and intuitively, the other entity involved can set the decoder at a sample rate of min(Db, Ea) and the encoder at a sample rate of min(Eb, Da) and bitrate Fa.

All these values are optional, as mentioned above, so there are various permutations possible here. In particular when maxaveragebitrate is not provided, then it's assumed to be the maximum (510000 bps).

I hope this can clarify some subtleties, or at least open a table for discussion and eventually lead to a better understanding of the topic.

References


[0] https://www.opus-codec.org/
[1] http://tools.ietf.org/html/rfc6716
[2] https://tools.ietf.org/html/draft-ietf-rtcweb-audio-10
[3] https://tools.ietf.org/html/rfc7587