Jitter Buffer

Abstract	Opus Codec
Authors	Walter Fan
Status	WIP
Updated	2024-08-21

Overview 

How to implementation a jitter buffer of audio 

Understand the concept of jitter: Jitter refers to the variation in the arrival time of audio packets due to network congestion, packet loss, or other factors. To handle jitter, you need to buffer incoming audio packets and play them out at a steady pace.
Design the buffer structure: Decide on the data structure for your buffer, which will store the incoming audio packets. Typically, a circular buffer is used for this purpose. The buffer should be large enough to accommodate variations in packet arrival times.
Set the buffer size: Determine the appropriate buffer size based on your requirements. A larger buffer can handle more significant variations in packet arrival times but may introduce additional latency.
Receive and store incoming packets: As audio packets arrive, store them in the buffer according to their sequence number or timestamp. Make sure to handle out-of-order packets correctly by reordering them in the buffer.
Estimate the playout time: Use the timestamps or sequence numbers of the received packets to estimate the playout time for each packet. This estimation is based on the average inter-arrival time of the packets.
Schedule packet playout: Start a playout timer based on the estimated playout time for the next packet in the buffer. When the timer expires, retrieve the packet from the buffer and play it out.
Handle late or missing packets: If a packet arrives late or is missing, you may need to decide how to handle the situation. One approach is to interpolate or extrapolate the missing audio based on adjacent packets. Alternatively, you can introduce silence or concealment techniques to minimize the impact of missing packets.
Adapt buffer size dynamically: Monitor the network conditions and adjust the buffer size dynamically if needed. This can help optimize the trade-off between latency and resilience to jitter.
Continuously update the buffer: Regularly update the buffer by removing played-out packets and adding new incoming packets. Maintain the correct order of packets and ensure that the buffer does not overflow or underflow.
Implement error resilience mechanisms: To handle severe jitter or packet loss, you can consider implementing additional error resilience mechanisms such as forward error correction (FEC) or retransmission.
Test and optimize: Thoroughly test your audio jitter buffer implementation under various network conditions to ensure its reliability and performance. Measure and fine-tune parameters such as buffer size, playout timing, and error resilience mechanisms.

Remember that developing an audio jitter buffer can be challenging, especially when considering real-time constraints and the need for low latency. It is often beneficial to refer to existing libraries or frameworks that provide audio streaming capabilities and jitter buffer implementations, such as the WebRTC library for web-based applications or specialized audio streaming frameworks.

Jitter 

抖动是由网络路径上的排队、争用和序列化效应引起的数据包传输延迟的变化。

一般而言，在慢速或严重拥塞的链路上更可能发生更高级别的抖动。

ITU-T G.114 recommend suggest one-way delay should be kept lower than 150ms for acceptable conversation experience. Delay between 150ms to 400ms areacceptable quality but user is aware of audio quality is impact. Latency bigger than 600ms is unacceptable

Audio Quality 

语音的连续性

声音是连续变化的信号，若突然遇到一小段 mute，听起来会是一下爆音或者杂音

For packet loss, artificial voice will be generated by receiver. For example: mute, repeat last packet

Latency in conversation

A one-way latency of up to 200 ms is considered acceptable

Voice overlap becomes a concern when the one-way latency is more than 200 ms

Jitter 分类 

Type A – constant jitter. This is a roughly constant level of packet to packet delay variation.
Type B – transient jitter. This is characterized by a substantial incremental delay that may be incurred by a single packet.
Type C – short term delay variation. This is characterized by an increase in delay that persists for some number of packets, and may be accompanied by an increase in packet to packet delay variation. Type C jitter is commonly associated with congestion and route changes.

A 类 – 恒定抖动。

这是数据包到数据包通过网络传输延迟变化的大致恒定水平。

B 类——瞬态抖动。

以单个数据包可能引起的大量的增量延迟为特征。

C 类——短期延迟变化。

特点是延迟的增加持续了一定数量的数据包，并且可能伴随着数据包到数据包延迟变化的增加。

C 类抖动通常与拥塞和路由变化有关。

Jitter buffer Overview 

The network delivers RTP packets asynchronously, with variable delays.

To be able to play the audio stream with reasonable quality, the receiving endpoint needs to turn the variable delays into constant delays.

This can be done by using a jitter buffer.

对于接收方来说，提高 voice quality 的首要工作是减少 packet loss 情况发生，最大限度保证 playout voice 是连续的并且是按照原顺序的

Jitter is defined as a variation in the delay of received packets.

Jitter buffer induces a small delay to collect a certain number of packets for rearranging them in the proper order as well as inducing equal spacing between them before sending them for decompression.

The (fixed) jitter buffer implementation is quite simple.

For example:

create a buffer to hold 100ms of audio (jitter buffer max size = 100ms)
place incoming audio frames to the buffer
start the playout when the buffer has at least 40ms data (delay = 40ms)

How long JB is better?

Latency/delay 设置得比较小，声音虽然及时了，但就有更大几率会出现 packet loss，导致音质不好
Latency/delay 设置得比较大，packet loss 机会变小，但过大的延迟会造成对话障碍

At the sending side, packets are sent in a continuous stream with the packets spaced evenly apart. Due to network congestion, improper queuing, or configuration errors, this steady stream can become lumpy, or the delay between each packet can vary instead of remaining constant.

When a router receives a Real-Time Protocol (RTP) audio stream for Voice over IP (VoIP), it must compensate for the jitter that is encountered. The mechanism that handles this function is the playout delay buffer. The playout delay buffer must buffer these packets and then play them out in a steady stream to the digital signal processors (DSPs) to be converted back to an analog audio stream. The playout delay buffer is referred to as the Jitter Buffer.

If the jitter is so large that it causes packets to be received out of the range of this buffer, the out-of-range packets are discarded and dropouts are heard in the audio. For losses as small as one packet, the DSP interpolates what it thinks the audio should be and no problem is audible. When jitter exceeds what the DSP can do to make up for the missing packets, audio problems are heard.

Jitter Buffer Purpose 

Absorb packet arrival variability to a decoder interface Buffers sets of data Compromises between buffering delay and concealment
Enable stream synchronization Delays a stream to match the playout of another: lip sync is an example
Delay the first packet decoding
Re-order the arrived packets
Determine packet loss

Jitter Buffer types 

Fixed Jitter Buffer 

Latency is fixed
Delta/Jitter statistics is not necessary since it is not referenced
Easy to implement.
Voice quality will be not good if jitter has large changes.

Adaptive Jitter Buffer 

trait：

Latency is dynamic according to jitter real-time changes
Key point: time scaling (adjust playout time without affecting voice quality)
- Scaling on voice packets
- Scaling on non-voice packets
Not easy to implement.
By using good time scaling algorithms, voice quality may not be impacted by large changes of jitter.