linux源码解读(二十三):网络通信简介——网络拥塞控制之BBR算法

  1、从网络问世直到10来年前,tcp拥塞控制采用的都是经典的reno、new-reno、bic、cubic等经典的算法,这些算法在低带宽的有线网络下运行了几十年。随着网络带宽增加、无线网络通信的普及,这些经典算法逐渐开始不适应新环境了:

  • 手机、wifi等的无线通信在空口段由于信道竞争等原因导致数据包传输出错,但其实网络可能并不拥塞,只是单纯的数据包出错,这是不拥塞被误判成了拥塞!
  • 网络设备buffer增加,能容纳的数据包也增加了。当buffer被填满后就产生了拥塞,但此时数据包还未丢失(或则发送端判断还未超时),所以如果以丢包判断拥塞,此时就会误判为不拥塞,导致拥塞判断延迟,和上述情况刚好相反!

  究其原因,还是传统的拥塞控制算法把丢包/错包等同于网络拥塞。这种认知的缺陷:

  • 导致整个网络的吞吐率呈现锯齿状:先是努力向上,达到阈值或丢失后就减半,再逐步增加cw,达到阈值或丢失后继续减半,周而复始,产生带宽震荡,导致大部分时候的带宽利用率或吞吐量不高!
  • 端到端延迟大:网络中转设备的buffer被填满,数据包排队等待通行,此时还未丢包,发送端无法判断是否拥塞
  • 算法侵略性强:抢占其他算法的带宽,导致整网的效果不好,带宽分配不均

   2、既然传统经典的拥塞控制算法有这么多缺陷,BBR又是怎么做的了?本质上讲,链路拥塞还是源端在短时间内发送了大量数据包,当数据包超过路由器等中转设备的buffer或转发能力后导致的,所以BBR控制拥塞的思路如下:

    (1)源端发送数据的速率不要超过瓶颈链路的带宽,避免长时间排队产生拥塞

  • reno和cubic发送数据包是“brust突发”的:一次性发送4个、8个等,可能导致路由设备buffer瞬间填满,超出瓶颈链路的带宽,所以要控制分组数据包的数量,避免瞬间超出BtlBW;这个间隔该怎么计算了?
  • 节拍参数pacing_gain: 1、1.25、0.75等取值;时间间隔就是packet.size/pacing_rate;next_send_time=now()+packet.size/pacing_rate;

    (2)BDP=RTT*BtlBW,源端发送的待确认在途数据包inflight不要超过BDP,换句话说双向链路中数据包总和inflight不要超过RTT*BtlBW

     3、BRR采用的拥塞控制算法需要两个变量:RTT(又被称为RTprop:round-trip propagation time)和BtlBW(bottleneck bandwidth),分别是传输延迟和链接瓶颈带宽,这两个变量的值又是怎么精确测量的了?

   (1)RTT的定义:源端从发送数据到收到ACK的耗时;也就是数据包一来一回的时间总和;这个时间差在应用受限阶段测量是最合适的,具体方法如下:

  • 双方握手阶段:此时还未发送大量数据,理论上链路的数据还不多,可以把syn+ack的时间作为RTT;
  • 已经握手完成:双方有交互式的应用,导致双方的数据量都不大,还没有把瓶颈链路的带宽打满,也可以把syn+ack的时间作为RTT;
  • 如果双方都开足马力收发数据,导致瓶颈链路都打满了,怎么测RTT了? 只能每隔一定时间段(比如10s到几分钟),选择2%左右的时间段(这里是200ms到几秒),双方主动降低发送速度,目的是让应用回到受限阶段后再测量RTT(这也是BBR算法相对公平、不恶意挤占整个网络带宽的原因)!

  (2)瓶颈链路带宽的测量BtlBW:在带宽受限阶段多次测量交付速率,将近期最大的交付速率作为BtlBW,具体测量方法为:双方建立连接后,不断增加在途inflight的数据包,连续三个RTT交付速率不增加25%时算作BL带宽受限状态;测量的时间窗口不低于6个RTT,最好在10个RTT左右!

  (3)截至目前涉及到好些个概念,这些概念之间的关系如下图所示:

  • 刚开始源端的发送速度还未达到BDP时,因为链路还有空闲,此时处于应用受限阶段(直白称之为应用不足),所以RTT保持稳定不变,整个网络的delivery rate持续上升!
  • 等达到BDP但小于BDP+BtlBufSize,代表着整个链路都塞满了但瓶颈设备的buffer还未慢,此时处于带宽受限阶段(直白称之为带宽不足);此时如果源端继续加速发送数据,直接导致RTT增加,delivery rate因为链路没了空闲也无法继续提升!
  • 如果源端继续火力全开地发送数据,使得insight的数据量超过了BDP+BtlBufSize,这代表这链路本身的带宽+路由设备的buffer都被填满,此时路由设备只能丢包,此阶段称为缓冲受限(直白称之为缓冲不足)
  • 所谓的拥塞控制,就是要让在途的inflight数据量不要超过BDP!所以是通过RTT和BtlBW这两个变量来控制拥塞的,而不是传统的遇到丢包就减半这种简单粗暴的方式!

      

   4、上述都是BBR出现的背景和原理,具体是怎么落地的了? 分了4个阶段,分别是startup、drain、probeBW和probe_RTT!

      

    (1)  Startup: 从名字就能看出来这是初始启动阶段!既然刚启动,通信双方互相发送的数据肯定不多,此时链路吞吐量较小。为了最大化利用链路带宽,Startup为BtlBw 实现了二分查找法:随着传输速率增加,每次用 2/ln2 增益来倍增发送速率,整个链路带宽很快会被填满!当连续三个RTT交付速率不增加25%时就达到了BL带宽受限状态,此时就能测量出BtlBW(也就是RTT*BtlBw)

 (2)Drain:经过startup阶段的灌水后,整个链路被洪水漫灌,导致吞吐量下降,此时发送方逐渐降低发送速率,使得inflight<BDP, 避免拥塞

   (3)  probe_BW:经过第二阶段的排水后,inflight基本稳定,这是整个BBR算法最稳定的状态了;从名字就能看出来,这个阶段是用来探测带宽的!

(4)probe_RTT:由于数据传输时可能会更改路由,之前测量的RTT不再适用,所以需要从probe_BW阶段;测量方法很简单:拿出2%的时间降低到应用受限阶段再探测RTT;不同的流通过重新测量RTT均分带宽,相对公平!

   5、通信双方的节点,要么是在发数据,要么是在收数据。google官方提供了伪代码说明了具体的动作事宜!

     (1)当收到ack包时,需要做的动作:

复制代码
function onAck(packet)
  rtt = now - packet.sendtime                      // 收包时间 减去 包中记录的发包时间就是RTT
  update_min_filter(RTpropFilter, rtt)             // 更新对 RTT 的估计
 
  delivered      += packet.size
  delivered_time =  now
  //计算当前实际的传输速率
  delivery_rate  =  (delivered - packet.delivered) / (delivered_time - packet.delivered_time)
 
  if (delivery_rate > BtlBwFilter.current_max      // 实际传输速率已经大于当前估计的瓶颈带宽,或
     || !packet.app_limited)                       // 不是应用受限(应用受限的样本对估计 BtlBw 无意义)
     update_max_filter(BtlBwFilter, delivery_rate) // 根更新对 BtlBw 的估计
 
  if (app_limited_until > 0)                       // 达到瓶颈带宽前,仍然可发送的字节数
     app_limited_until = app_limited_until - packet.size
复制代码

  总的来说就是:每个包都更新RTT、但部分包更新BtlBW!

    (2)当发送数据包时需要做的动作:

复制代码
function send(packet)
  bdp = BtlBwFilter.current_max * RTpropFilter.current_min  // 计算 BDP
  if (inflight >= cwnd_gain * bdp)                          // 如果正在传输中的数据量超过了允许的最大值
     return                                                 // 直接返回,接下来就等下一个 ACK,或者等超时重传

  // 能执行到这说明 inflight < cwnd_gain * bdp,即正在传输中的数据量 < 瓶颈容量
  if (now >= next_send_time)
     packet = nextPacketToSend()
     if (!packet)                      // 如果没有数据要发送
        app_limited_until = inflight   // 更新 “在达到瓶颈容量之前,仍然可发送的数据量”
        return

     packet.app_limited = (app_limited_until > 0)  // 如果仍然能发送若干字节才会达到瓶颈容量,说明处于 app_limited 状态
     packet.sendtime = now
     packet.delivered = delivered
     packet.delivered_time = delivered_time
     ship(packet)
     //下次发送数据的时间,通过这个控制拥塞
     next_send_time = now + packet.size / (pacing_gain * BtlBwFilter.current_max)
 //用定时器设置下次发送时间到期后的回调函数,就是继续执行send函数
  timerCallbackAt(send, next_send_time)
复制代码

  总的来说就是:先判断inflight的数据量是不是大于了BDP;如果是直接返回,结束send方法;如果不是,继续发送数据,并重新设置下次发送的定时器!

   6、接下来看看google提供的BBR源码,在net\ipv4\tcp_bbr.c这个文件里(我用的是linux 4.9的源码)!

  (1)BBR所有关键函数一览:还记得拥塞控制注册的结构体么?这个是BBR算法的注册结构体!

复制代码
static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
    .flags        = TCP_CONG_NON_RESTRICTED,
    .name        = "bbr",
    .owner        = THIS_MODULE,
    .init        = bbr_init,
    .cong_control    = bbr_main,
    .sndbuf_expand    = bbr_sndbuf_expand,
    .undo_cwnd    = bbr_undo_cwnd,
    .cwnd_event    = bbr_cwnd_event,
    .ssthresh    = bbr_ssthresh,
    .tso_segs_goal    = bbr_tso_segs_goal,
    .get_info    = bbr_get_info,
    .set_state    = bbr_set_state,
};
复制代码

        (2)因为.cong_control对应的函数是bbr_main,所以很明显这就是拥塞控制算法的入口了!

复制代码
static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw;
    
    bbr_update_model(sk, rs);

    bw = bbr_bw(sk);
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
    bbr_set_tso_segs_goal(sk);
    bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}
复制代码

  从调用的函数名称看,有计算带宽的,有设置pacing_rate的(通过这个控制发送速度来控制拥塞),也有设置拥塞窗口的,通过层层调用拨开后,发现几个比较重要的函数如下:

     (3)bbr_update_bw:估算带宽值

复制代码
/* Estimate the bandwidth based on how fast packets are delivered
   估算实际的带宽 
    1、更新RTT周期
    2、计算带宽=确认的字节数*BW_UNIT/采样时间
    3、带宽和minirtt样本加入新的rtt、bw样本
*/
static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u64 bw;

    bbr->round_start = 0;
    if (rs->delivered < 0 || rs->interval_us <= 0)
        return; /* Not a valid observation */

    /* See if we've reached the next RTT */
    if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {
        bbr->next_rtt_delivered = tp->delivered;
        bbr->rtt_cnt++;
        bbr->round_start = 1;
        bbr->packet_conservation = 0;
    }
    bbr_lt_bw_sampling(sk, rs);

    /* Divide delivered by the interval to find a (lower bound) bottleneck
     * bandwidth sample. Delivered is in packets and interval_us in uS and
     * ratio will be <<1 for most connections. So delivered is first scaled.
     计算带宽
     */
    bw = (u64)rs->delivered * BW_UNIT;
    do_div(bw, rs->interval_us);

    /* If this sample is application-limited, it is likely to have a very
     * low delivered count that represents application behavior rather than
     * the available network rate. Such a sample could drag down estimated
     * bw, causing needless slow-down. Thus, to continue to send at the
     * last measured network rate, we filter out app-limited samples unless
     * they describe the path bw at least as well as our bw model.
     *
     * So the goal during app-limited phase is to proceed with the best
     * network rate no matter how long. We automatically leave this
     * phase when app writes faster than the network can deliver :)
     */
    if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {
        /* Incorporate new sample into our max bw filter. 
         带宽和minirtt样本加入新的rtt、bw样本*/
        minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);
    }
}
复制代码

  (4)bbr_set_pacing_rate:通过设置pacing_rate控制发包的速度:

复制代码
/* Pace using current bw estimate and a gain factor. In order to help drive the
 * network toward lower queues while maintaining high utilization and low
 * latency, the average pacing rate aims to be slightly (~1%) lower than the
 * estimated bandwidth. This is an important aspect of the design. In this
 * implementation this slightly lower pacing rate is achieved implicitly by not
 * including link-layer headers in the packet size used for the pacing rate.
   
 */
static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u64 rate = bw;

    rate = bbr_rate_bytes_per_sec(sk, rate, gain);
    rate = min_t(u64, rate, sk->sk_max_pacing_rate);
    if (bbr->mode != BBR_STARTUP || rate > sk->sk_pacing_rate)
        sk->sk_pacing_rate = rate;
}
复制代码

  (5)bbr_update_min_rtt:更新最小的rtt

复制代码
/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
 * periodically drain the bottleneck queue, to converge to measure the true
 * min_rtt (unloaded propagation delay). This allows the flows to keep queues
 * small (reducing queuing delay and packet loss) and achieve fairness among
 * BBR flows.
 *
 * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires,
 * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets.
 * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed
 * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and
 * re-enter the previous mode. BBR uses 200ms to approximately bound the
 * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s).
 *
 * Note that flows need only pay 2% if they are busy sending over the last 10
 * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have
 * natural silences or low-rate periods within 10 seconds where the rate is low
 * enough for long enough to drain its queue in the bottleneck. We pick up
 * these min RTT measurements opportunistically with our min_rtt filter. :-)
 */
static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    bool filter_expired;

    /* Track min RTT seen in the min_rtt_win_sec filter window: */
    filter_expired = after(tcp_time_stamp,
                   bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
    if (rs->rtt_us >= 0 &&
        (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
        bbr->min_rtt_us = rs->rtt_us;
        bbr->min_rtt_stamp = tcp_time_stamp;
    }

    if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
        !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
        bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
        bbr->pacing_gain = BBR_UNIT;
        bbr->cwnd_gain = BBR_UNIT;
        bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
        bbr->probe_rtt_done_stamp = 0;
    }

    if (bbr->mode == BBR_PROBE_RTT) {//如果是probe_rtt状态
        /* Ignore low rate samples during this mode. */
        tp->app_limited =
            (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
        /* Maintain min packets in flight for max(200 ms, 1 round). */
        if (!bbr->probe_rtt_done_stamp &&
            tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
            bbr->probe_rtt_done_stamp = tcp_time_stamp +
                msecs_to_jiffies(bbr_probe_rtt_mode_ms);
            bbr->probe_rtt_round_done = 0;
            bbr->next_rtt_delivered = tp->delivered;
        } else if (bbr->probe_rtt_done_stamp) {
            if (bbr->round_start)
                bbr->probe_rtt_round_done = 1;
            if (bbr->probe_rtt_round_done &&
                after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) {
                bbr->min_rtt_stamp = tcp_time_stamp;
                bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
                bbr_reset_mode(sk);
            }
        }
    }
    bbr->idle_restart = 0;
}
复制代码

  6、总结:BBR算法不再基于丢包判断,也不再使用AIMD线性增乘性减策略来维护拥塞窗口,而是分别采样估计(网络链路拓扑情况对于发送端和接收端来说都是黑盒,不太可能完全实时掌控,只能不停地采样)极大带宽和极小延时,并用二者乘积作为发送窗口。同事BBR引入了Pacing Rate限制数据发送速率,配合cwnd使用来降低冲击!

 

 

 

参考:

1、https://www.cnblogs.com/HadesBlog/p/13347418.html  google bbr源码分析

2、https://www.bilibili.com/video/BV1iq4y1H7Zf/?spm_id_from=333.788.recommend_more_video.-1  BBR拥塞控制算法

3、http://arthurchiao.art/blog/bbr-paper-zh/   google论文:基于拥塞(而非丢包)的拥塞控制

posted @   第七子007  阅读(2003)  评论(0编辑  收藏  举报
点击右上角即可分享
微信分享提示