TCP Pacing function

https://www.programmersought.com/article/56241261068/

The TCP Pacing function controls the rate at which TCP is sent.

Initialization of Pacing

In the TCP protocol initial function tcp_sk_init, two parameters related to Pacing are assigned, namely sysctl_tcp_pacing_ss_ratio and sysctl_tcp_pacing_ca_ratio, which are times the control pacing rate. In the slow start phase of the former application, the default value is 200, which is to increase the rate by 200%; the latter is applied in the congestion avoidance phase, the default value is 120, and the rate is increased by 120%.

static int __net_init tcp_sk_init(struct net *net)
{
    net->ipv4.sysctl_tcp_pacing_ss_ratio = 200;
    net->ipv4.sysctl_tcp_pacing_ca_ratio = 120;
}

$ cat /proc/sys/net/ipv4/tcp_pacing_ss_ratio
200
$ cat /proc/sys/net/ipv4/tcp_pacing_ca_ratio
120
In addition, in the TCP timer initialization function tcp_init_xmit_timers, the kernel initializes a high-precision pacing timer, and the timeout processing function is set to tcp_pace_kick.

void tcp_init_xmit_timers(struct sock *sk)
{
    inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer, &tcp_keepalive_timer);
    hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
    tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
}
Finally, in the socket initialization function sock_init_data, three pacing related parameters are initialized. The maximum rate sk_max_pacing_rate and the current rate sk_pacing_rate are set to the largest unsigned integer value, and sk_pacing_shift is set to 10.

void sock_init_data(struct socket *sock, struct sock *sk)
{
    sk->sk_max_pacing_rate = ~0U;
    sk->sk_pacing_rate = ~0U;
    sk->sk_pacing_shift = 10;
}
Pacing function is enabled
The TCP congestion control algorithm BBR requires the support of the pacing function. In its initialization function bbr_init, the pacing state sk_pacing_status is set to SK_PACING_NEEDED to enable the pacing function. In addition, the user can set the maximum rate of pacing through the option SO_MAX_PACING_RATE called by the setsockopt system, which implicitly opens the pacing function of the socket.

static void bbr_init(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);

    bbr->has_seen_rtt = 0;
    bbr_init_pacing_rate_from_rtt(sk);

    cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
}
int sock_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
{
    switch (optname) {
    case SO_MAX_PACING_RATE:
        if (val != ~0U)
            cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
        sk->sk_max_pacing_rate = val;
        sk->sk_pacing_rate = min(sk->sk_pacing_rate, sk->sk_max_pacing_rate);
        break;
    }
}
Pacing function enabled

The flow control algorithm Fair queue in the kernel network can perform the pacing function of data packets well, but the current system selects the sch_fq algorithm for its network interface. Therefore, regardless of the flow control system, the function tcp_needs_internal_pacing checks if the pacing function is to be performed in the TCP subsystem.

static bool tcp_needs_internal_pacing(const struct sock *sk)
{
    return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
}
The function tcp_internal_pacing is responsible for enabling TCP's own pacing function, provided that it is requested by pacing, and the current rate is not zero and is not equal to the maximum unsigned integer value. The length of time required to send the skb packet, in nanoseconds, is calculated at the current rate. The pace_timer is started, and the timeout period is set to the length of time required for the current packet to be sent at the set pacing rate. However, in reality, the data may not need such a long time.

static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
{
    u64 len_ns;
    u32 rate;

    if (!tcp_needs_internal_pacing(sk))
        return;
    rate = sk->sk_pacing_rate;
    if (!rate || rate == ~0U)
        return;

    /* Should account for header sizes as sch_fq does, but lets make things simple. */
    len_ns = (u64)skb->len * NSEC_PER_SEC;
    do_div(len_ns, rate);
    hrtimer_start(&tcp_sk(sk)->pacing_timer, ktime_add_ns(ktime_get(), len_ns), HRTIMER_MODE_ABS_PINNED);
}
The above pacing enable function tcp_internal_pacing is called in the TCP transfer function tcp_transmit_skb, provided that the currently transmitted data packet has data, not a control message of SYN or ACK type. In view of the fact that the skb has not added the network layer and the header information of the link layer at this time, tcp_internal_pacing is only the length of the TCP header and the data in the calculation of the packet transmission duration algorithm. However, the sch_fq algorithm of the flow control system is different, which can obtain the final complete packet length.

static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, gfp_t gfp_mask)
{
    if (skb->len != tcp_header_size) {
        tcp_event_data_sent(tp, sk);
        tp->data_segs_out += tcp_skb_pcount(skb);
        tcp_internal_pacing(sk, skb);
    }
}
Pacing check

The check of the Pacing function is the previous tcp_needs_internal_pacing function to check whether TCP pacing is enabled. Another condition is whether the pacing timer is started, which is implemented by the function hrtimer_active. Both conditions are true at the same time, indicating that pacing is working and suspending the transmission of the packet.

static bool tcp_pacing_check(const struct sock *sk)
{
    return tcp_needs_internal_pacing(sk) && hrtimer_active(&tcp_sk(sk)->pacing_timer);
}
See the following TCP send queue processing function tcp_write_xmit and retransmission queue processing function tcp_xmit_retransmit_queue. When tcp_pacing_check returns true, indicating that pacing is already working, it exits the sending process.

static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp)
{
    max_segs = tcp_tso_segs(sk, mss_now);
    while ((skb = tcp_send_head(sk))) {

        if (tcp_pacing_check(sk))
            break;
}
void tcp_xmit_retransmit_queue(struct sock *sk)
{
    rtx_head = tcp_rtx_queue_head(sk);
    skb = tp->retransmit_skb_hint ?: rtx_head;
    max_segs = tcp_tso_segs(sk, tcp_current_mss(sk));
    skb_rbtree_walk_from(skb) {

        if (tcp_pacing_check(sk))
            break;
}
Pacing processing

Since the tcp_pacing_check function and the tcp_internal_pacing function work together, the pacing function of the TCP packet has been implemented. After the pacing timer expires, the pacing timeout handler tcp_pace_kick does not need to be processed.

However, in the implementation, the tcp_pace_kick function will handle the socket set to the blocking state by the TSQ (TCP Small Queue) function. See TSs at https://blog.csdn.net/sinat_20184565/article/details/89341370.

static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb, unsigned int factor)
{
    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> sk->sk_pacing_shift);
    limit = min_t(u32, limit, sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
    limit <<= factor;
}
Since the current pacing rate sk_pacing_rate is too high during the TSQ check, the TSQ limits the transmission of data packets and sets the socket to the blocking state. Therefore, the TSQ queue is processed in the tcp_pace_kick function, and the TSQ tasklet processing is called when necessary.

enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
{
    struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer);
    struct sock *sk = (struct sock *)tp;

    for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
        struct tsq_tasklet *tsq;
        bool empty;

        if (oval & TSQF_QUEUED)
            break;

        nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
        nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
        if (nval != oval)
            continue;

        if (!refcount_inc_not_zero(&sk->sk_wmem_alloc))
            break;
        /* queue this socket to tasklet queue */
        tsq = this_cpu_ptr(&tsq_tasklet);
        empty = list_empty(&tsq->head);
        list_add(&tp->tsq_node, &tsq->head);
        if (empty)
            tasklet_schedule(&tsq->tasklet);
        break;
    }
    return HRTIMER_NORESTART;
}
Pacing rate
The base function for rate updates is tcp_update_pacing_rate. As follows, the current pacing rate calculation consists of three variables. The current MSS buffer value mss_cache is multiplied by the congestion window cwnd, divided by the smooth round trip time srtt, and the maximum transmittable data length divided by srtt to get the current pacing rate. The congestion window function is the maximum value from the transmission congestion window value snd_cwnd and the number of sent packet packets_out. For the socket in the slow start phase, the obtained rate value is increased by 200% (sysctl_tcp_pacing_ss_ratio) by default; otherwise, for the socket in the congestion avoidance phase, the rate is increased by 120% by default (sysctl_tcp_pacing_ca_ratio). The final pacing rate cannot be greater than the defined maximum value sk_max_pacing_rate.

static void tcp_update_pacing_rate(struct sock *sk)
{
    /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
    rate = (u64)tp->mss_cache * ((USEC_PER_SEC / 100) << 3);

    /* current rate is (cwnd * mss) / srtt
     * In Slow Start [1], set sk_pacing_rate to 200 % the current rate.
     * In Congestion Avoidance phase, set it to 120 % the current rate.
     *
     * [1] : Normal Slow Start condition is (tp->snd_cwnd < tp->snd_ssthresh)
     *   If snd_cwnd >= (tp->snd_ssthresh / 2), we are approaching end of slow start and should slow down.
     */
    if (tp->snd_cwnd < tp->snd_ssthresh / 2)
        rate *= sock_net(sk)->ipv4.sysctl_tcp_pacing_ss_ratio;
    else
        rate *= sock_net(sk)->ipv4.sysctl_tcp_pacing_ca_ratio;

    rate *= max(tp->snd_cwnd, tp->packets_out);

    if (likely(tp->srtt_us))
        do_div(rate, tp->srtt_us);

    /* WRITE_ONCE() is needed because sch_fq fetches sk_pacing_rate
     * without any lock. We want to make sure compiler wont store intermediate values in this location.
     */
    WRITE_ONCE(sk->sk_pacing_rate, min_t(u64, rate, sk->sk_max_pacing_rate));
}
There are two entries for the Pacing rate update. After the TCP server receives the ACK message of the three-way handshake of the client, it initializes the pacing rate. The premise is that the congestion avoidance algorithm currently used by TCP does not implement the cong_control callback function. Currently, only the BBR algorithm implements this callback.

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
    switch (sk->sk_state) {
    case TCP_SYN_RECV:
        if (!inet_csk(sk)->icsk_ca_ops->cong_control)
            tcp_update_pacing_rate(sk);
}
static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
    .flags      = TCP_CONG_NON_RESTRICTED,
    .name       = "bbr",
    .cong_control   = bbr_main,
};
If the BBR algorithm is used, the pacing rate will not be initialized in the tcp_rcv_state_process function. The BBR congestion algorithm sets the pacing rate in the cong_control callback (bbr_main), such as the tcp_cong_control function. If the cong_control has a value, the function ends after execution. Only when other congestion algorithms other than the BBR algorithm are used (cong_control is a null pointer) will be executed later, and the pace rate update function is called. The tcp_cong_control function is called at the end of processing the ACK acknowledgement message.

static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked, int flag, const struct rate_sample *rs)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);

    if (icsk->icsk_ca_ops->cong_control) {
        icsk->icsk_ca_ops->cong_control(sk, rs);
        return;
    }
    if (tcp_in_cwnd_reduction(sk)) {
        tcp_cwnd_reduction(sk, acked_sacked, flag);   /* Reduce cwnd if state mandates */
    } else if (tcp_may_raise_cwnd(sk, flag)) {
        tcp_cong_avoid(sk, ack, acked_sacked);        /* Advance cwnd if state allows */
    }
    tcp_update_pacing_rate(sk);
}
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
    tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);
    tcp_xmit_recovery(sk, rexmit);
    return 1;
}
BBR adjusts the pacing rate
In the congestion control algorithm BBR, the cong_control callback is initialized to a pointer to the bbr_main function, which calls the bbr_set_pacing_rate function to update the pacing rate.

static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);

    bbr_update_model(sk, rs);
    bw = bbr_bw(sk);
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
}
The core pacing rate calculation function in BBR is as follows bbr_rate_bytes_per_sec. The rate parameter is the current bandwidth value estimated by the BBR algorithm, multiplied by the MTU value, and multiplied by a gain value gain, similar to the process of calculating the pacing rate in the previously received function tcp_update_pacing_rate, the difference is in the packet length used here. It is the MTU (excluding the link layer length), and the previous congestion window is replaced with the bandwidth value, and the previous incremental magnification (sysctl_tcp_pacing_ss_ratio/sysctl_tcp_pacing_ca_ratio) is replaced with the gain value gain of the BBR calculation.

The function bbr_bw_to_pacing_rate ensures that the pacing value does not exceed the maximum limit value sk_max_pacing_rate.

static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
{
    rate *= tcp_mss_to_mtu(sk, tcp_sk(sk)->mss_cache);
    rate *= gain;
    rate >>= BBR_SCALE;
    rate *= USEC_PER_SEC;
    return rate >> BW_SCALE;
}
/* Convert a BBR bw and gain factor to a pacing rate in bytes per second. */
static u32 bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    u64 rate = bw;

    rate = bbr_rate_bytes_per_sec(sk, rate, gain);
    rate = min_t(u64, rate, sk->sk_max_pacing_rate);
    return rate;
}
In the main callback function bbr_main, bbr_set_pacing_rate calls the above bbr_bw_to_pacing_rate to get the rate value of the pacing. Normally the has_seen_rtt variable has been set and the function bbr_init_pacing_rate_from_rtt has been called in the initialization function bbr_init. If the bandwidth is full, or the calculated pacing rate is greater than the currently used rate sk_pacing_rate, update the current rate.

In order to maintain high network utilization and low latency while ensuring fewer queues, the calculated pacing rate is slightly lower than about one percent of the estimated bandwidth. To achieve this, the chain is used in calculating the pacing rate. The MTU value of the path does not include the link layer header data length.

static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u32 rate = bbr_bw_to_pacing_rate(sk, bw, gain);

    if (unlikely(!bbr->has_seen_rtt && tp->srtt_us))
        bbr_init_pacing_rate_from_rtt(sk);
    if (bbr_full_bw_reached(sk) || rate > sk->sk_pacing_rate)
        sk->sk_pacing_rate = rate;
}
The function bbr_init_pacing_rate_from_rtt has been called when br_init is initialized. The value of the bandwidth is multiplied by BW_UNIT by the transmit congestion window. The value of rtt_us is obtained. The pacing rate is calculated by the above bbr_bw_to_pacing_rate function, and the gain value is bbr_high_gain. If the smooth round trip time is zero, rtt_us uses the default USEC_PER_MSEC(1000), otherwise, the srtt_us value is divided by the value of 8 (BBR_SCALE).

#define BW_SCALE 24
#define BW_UNIT (1 << BW_SCALE)
#define BBR_SCALE 8                /* scaling factor for fractions in BBR (e.g. gains) */
#define BBR_UNIT (1 << BBR_SCALE)
static const int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1;

/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */
static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bbr *bbr = inet_csk_ca(sk);
    u64 bw;
    u32 rtt_us;

    if (tp->srtt_us) {      /* any RTT sample yet? */
        rtt_us = max(tp->srtt_us >> 3, 1U);
        bbr->has_seen_rtt = 1;
    } else {             /* no RTT sample yet */
        rtt_us = USEC_PER_MSEC; /* use nominal default RTT */
    }
    bw = (u64)tp->snd_cwnd * BW_UNIT;
    do_div(bw, rtt_us);
    sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
}

Kernel version 4.15

posted @ 2022-02-07 23:11 张同光阅读(232) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

张同光 (Tongguang Zhang)

张同光 (Tongguang Zhang)：Hello everyone !
Let us make progress together every day ! —— 微信号：ztguang

TCP Pacing function

公告

张同光 (Tongguang Zhang)

张同光 (Tongguang Zhang)：Hello everyone ! Let us make progress together every day ! —— 微信号：ztguang

TCP Pacing function

公告

张同光 (Tongguang Zhang)：Hello everyone !
Let us make progress together every day ! —— 微信号：ztguang