Google对TCP快速重传算法的改进
内容:本文主要分析Google对TCP快速重传算法的改进,即TCP early retransmit。
内核版本:3.6
作者:zhangskd @ csdn blog
patch series:
(1) tcp: early retransmit: tcp_enter_recovery()
(2) tcp: early retransmit
(3) tcp: early retransmit: delayed fast retransmit
这3个patch包含在3.5之后的版本中。
patch描述
以下是提交者Yuchung Cheng对3个patch的描述:
(1)tcp: early retransmit: tcp_enter_recovery()
This is a preparation patch that refactors the code to enter recovery into a new function tcp_enter_recovery(). It's needed
to implement the delayed fast retransmit in ER.
(2)tcp: early retransmit
This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery
instead of timeout.
While the algorithm is simple, small but frequent network reordering makes this feature dangerous:
the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation
suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web
server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate
Reduction for TCP", IMC 2011.
Note that Linux has a similar feature called THIN_DUPACK. The difference are THIN_DUPACK do not mitigate
reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be
happy to merge THIN_DUPACK feature with ER if people think it's a good idea.
ER is enabled by sysctl_tcp_early_retrans:
0: Disables ER
1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4
2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4.
Note: mode 2 is implemented in the third part of this patch series.
(3)tcp: early retransmit: delayed fast retransmit
Implementing the advanced early retransmit (sysctl_tcp_early_retrans == 2).
Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay.
If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform
fast retransmit.
patch实现
以下是核心代码,完整代码可见相应的patch。
(1)tcp: early retransmit: tcp_enter_recovery()
@net/ipv4/tcp_input.c
static void tcp_enter_recovery (struct sock *sk, bool ece_ack) { struct tcp_sock *tp = tcp_sk(sk); int mib_idx; if (tcp_is_reno(tp)) mib_idx = LINUX_MIB_TCPRENORECOVERY; else mib_idx = LINUX_MIB_TCPSACKRECOVERY; NET_INC_STATS_BH(sock_net(sk), mib_idx); tp->high_seq = tp->snd_nxt; /* 保存Recovery状态的推出点*/ tp->prior_ssthresh = 0; tp->undo_marker = tp->snd_una; /* 用于判断是否可以进行拥塞调整撤销*/ tp->undo_retrans = tp->retrans_out; /* 用于判断是否可以进行拥塞调整撤销*/ if (net_csk(sk)->icsk_ca_state < TCP_CA_CWR) { if (!ece_ack) tp->prior_ssthresh = tcp_current_ssthresh(sk); /* 保留旧阈值,除非有显式拥塞通知*/ tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk); /* 根据拥塞算法,计算新阈值*/ TCP_ECN_queue_cwr(tp); } tp->bytes_acked = 0; tp->snd_cwnd_cnt = 0; tp->prior_cwnd = tp->snd_cwnd; tp->prr_delivered = 0; tp->prr_out = 0; tcp_set_ca_state(sk, TCP_CA_Recovery); /* 设置Recovery状态标志*/ }
我们知道进入CWR状态前会调用tcp_enter_cwr()进行相关设置,进入Loss状态会调用tcp_enter_loss()进行
相关设置,那么进入Recovery状态呢?
之前版本中刚进入Recovery状态的相关设置是直接包含在tcp_fastretrans_alert()中的,并没有专门的处理函数。
现在因为在别的函数中也要使用,所以把对应的代码独立出来,这就是tcp_enter_recovery()。
(2)tcp: early retransmit
这个patch主要实现TCP early retrans的enable和disable,以及tcp_early_retrans==1时的功能。
@include/linux/tcp.h
struct tcp_sock { ... u8 do_early_retrans:1; /* Enable RFC5827 early-retransmit,是否使用TCP early retrans */ ... } /* TCP early-retransmit (ER) is similar to but more conservative than the thin-dupack feature. * Enable ER only if thin-dupack is disabled. */ static inline void tcp_enable_early_retrans(struct tcp_sock *tp) { tp->do_early_retrans = sysctl_tcp_early_retrans && ! sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3; } static inline void tcp_disable_early_retrans(struct tcp_sock *tp) { tp->do_early_retrans = 0; }
Q: 启用TCP early retransmit需要什么条件呢?
A: 需要同时满足以下条件才能启用TCP early retransmit:
tcp_early_retrans的值为1或者2 (内核默认设置为2)
tcp_thin_dupack为0 (内核默认设置为0)
tcp_reordering为3 (内核默认设置为3)
所以,TCP early retransmit是默认启用的。
Q: 什么时候会禁用TCP early retransmit呢?
A: 满足以下任意一个条件就会禁用TCP early retransmit:
动态启用了TCP thin dupack (do_tcp_setsockopt()中)
检测到乱序 (tcp_update_reordering()中禁用)
@net/ipv4/tcp_input.c
static int tcp_time_to_recover(struct sock *sk) { ... /* Trick#6: TCP early retransmit, per RFC5827. To avoid spurious retransmissions * due to small network reorderings, we implement Mitigation A.3 in the RFC and * delay the retransmission for a short interval if appropriate. */ if (tp->do_early_retrans && ! retrans_out && tp->sacked_out && (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) && ! tcp_may_send_now(sk)) return 1; ... }
在tcp_time_to_recover()中判断是否触发early retrans,需同时满足:
1) tp->do_early_retrans >0 ,TCP early retransmit启用。
2) tp->retrans_out == 0,没有重传且未确认的数据段。
3) tp->sacked_out > 0,有收到被SACKed的数据段。
4) tp->packets_out < 4,网络中发送且未确认的数据段小于4个。
5) tp->packets_out == tp->sacked_out + 1,被SACKed的数据只比网络中发送且未确认的数据段少一个。
6) tcp_may_send_now()为假,此时不能发送新的数据。
(3)tcp: early retransmit: delayed fast retransmit
这个patch主要实现tcp_early_retrans == 2时的功能,即延迟fast retransmit四分之一RTT的时间,主要目的在于
减轻乱序时ER带来的不良影响。
@include/linux/tcp.h
struct tcp_sock { ... u8 do_early_retrans:1, /* Enable RFC5827 early-retransmit,是否使用TCP early retrans */ early_retrans_delayed:1, /* Delayed ER timer installed,快速重传的延迟定时器是否开启 */ ... }
@net/ipv4/tcp_input.c
static int tcp_time_to_recover(struct sock *sk) { ... /* Trick#6: TCP early retransmit, per RFC5827. To avoid spurious retransmissions * due to small network reorderings, we implement Mitigation A.3 in the RFC and * delay the retransmission for a short interval if appropriate. */ if (tp->do_early_retrans && ! retrans_out && tp->sacked_out && (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) && ! tcp_may_send_now(sk)) return ! tcp_pause_early_retransmit(sk, flag); /* 是否需要采取保守策略,使ER延迟RTT/4*/ ... } /* 决定是否延迟ER,有以下条件不延迟,直接快速重传: * 1) tcp_early_retrans == 1,不采用延迟。 * 2) 此ACK携带ECE标志,即收到显式拥塞通知。 * 3) 没有RTT样本,无法计算延迟的时间。 * 4) 超时重传定时器更早超时,所以设置ER延迟定时器没必要了。 */ static bool tcp_pause_early_retransmit (struct sock *sk, int flag) { struct tcp_sock *tp = tcp_sk(sk); unsigned long delay; /* Delay early retransmit and entering fast recovery for max(RTT/4, 2msec) unless ack has ECE mark, * no RTT samples available, or RTO is scheduled to fire first. */ if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || ! tp->srtt) return false; /* 延迟的时间 = max(RTT/4, 2ms) */ delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2)); /* 如果超时重传定时器更早超时*/ if (! time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay))) return false; /* 借用超时重传定时器,把它暂时用作delayed ER timer */ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX); /* 表示early retransmit被延迟了,定时器正在计时中*/ tp->early_retrans_delayed = 1; return true; } /* Restart timer after forward progress on connection. * RFC2988 recommends to restart timer to now + rto. * 之前此函数功能:收到新的ACK,重设超时重传定时器。 * 现在由于ER延迟定时器和超时重传定时器共用,所以新增了功能: * 撤销ER延迟定时器,恢复超时重传定时器。至于为什么撤销ER延迟定时器, * 1)ER延迟定时器被触发了,接着就可以重传,不再需要了。 * 2)在ER延迟定时器计时期间,如果收到了新的ACK、发送新的数据包、重传了数据包, * 就没有理由使用TCP early retransmit,所以需要撤销ER定时器。 */ void tcp_rearm_rto(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); if (! tp->packets_out) { /* 网络中不存在未确认的数据包*/ inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS); } else { u32 rto = inet_csk(sk)->icsk_rto; /* Offset the time elapsed after installing regular RTO. * 恢复超时重传定时器,需要减去被ER延迟定时器占用的时间。 */ if (tp->early_retrans_delayed) { struct sk_buff *skb = tcp_write_queue_head(sk); const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto; /*超时重传定时器的触发时间点*/ s32 delta = (s32) (rto_time_stamp - tcp_time_stamp); /* time still left to go,还需要多久触发*/ /* delta may not be positive if the socket is locked when the delayed ER timer fires and * is rescheduled. */ if (delta > 0) rto = delta; } inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX); /*恢复超时重传定时器*/ } tp->early_retrans_delayed = 0; /* 撤销ER延迟标志*/ } /* The TCP retransmit timer. */ void tcp_retransmit_timer(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); if (tp->early_retrans_delayed) { /* 如果此时是作为ER延迟定时器被触发了*/ /* 延迟时间到,可以early retransmit了*/ tcp_resume_early_retransmit(sk); return; } ... } /* This function is called when the delayed ER timer fires. TCP enters fast recovery * and performs fast-retransmit. * 当ER延迟结束,即ER延迟定时器被触发时,进行early retransmit。 */ void tcp_resume_early_retransmit (struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); /* 撤销ER延迟定时器,恢复超时重传定时器*/ tcp_rearm_rto(sk); /* Stop if ER is disabled after the delayed ER timer is scheduled * 如果TCP early retrans在ER延迟定时器计时期间被禁用了,则不进行ER。 */ if (! tp->do_early_retrans) return; tcp_enter_recovery(sk, false); /* 刚进入Recovery时的设置*/ tcp_update_scoreboard(sk, 1); /* 标志发送队列第一个未确认数据包为丢失*/ tcp_xmit_retransmit_queue(sk); /* 重传发送队列第一个未确认的数据包*/ }
在ER延迟定时器计时期间,如果收到了新的ACK、发送新的数据包,
就没有理由使用TCP early retransmit,所以需要撤销ER定时器。
1)收到新的ACK
@tcp_ack()
if (tp->early_retrans_delayed) /* 表示ER延迟定时器正在计时中*/
tcp_rearm_rto(sk); /* 撤销ER延迟定时器,恢复超时重传定时器*/
2)发送了新的数据包
@tcp_event_new_data_sent()
/* 之前网络中不存在未确认数据包或者ER延迟定时器正在使用*/
if (! prior_packets || tp->early_retrans_delayed)
tcp_rearm_rto(sk); /* 可以两用*/
Tcp thin dupack
(1)选项
tcp_thin_dupack — BOOLEAN
Enable dynamic triggering of retransmissions after one dupACK for thin streams.
If set, a check is performed upon reception of a dupACK to determine if the stream is thin ( less than 4
packets in flight). As long as the stream is found to be thin, data is retransmitted on the first received
dupACK. This improves retransmission latency for non-aggressive thin streams, often found to be
time-dependent.
Default: 0
tcp_thin_linear_timeouts — BOOLEAN
Enable dynamic triggering of linear timeouts for thin streams.
If set, a check is performed upon retransmission by timeout to determine if the stream is thin (less than 4
packets in flight). As long as the stream is found to be thin, up to 6 linear timeouts may be performed
before exponential backoff mode is initiated. This improves retransmission latency for non-aggressive thin
streams, often found to be time-dependent.
Default: 0
这两个选项默认不启用。
(2)原理
A wide range of Internet-based services that use reliable transport protocols display what we call thin-stream
properties. This means that the application sends data with such a low rate that the retransmission mechanisms
of the transport protocol are not fully effective. In time-dependent scenarios (like online games, control systems,
stock trading etc.) where the user experience depends on the data delivery latency, packet loss can be devastating
for the service quality. Extreme latencies are caused by TCP's dependency on the arrival of new data from the
application to trigger retransmissions effectively through fast retransmit instead of waiting for long timeouts.
After analysing a large number of time-dependent interactive applications, we have seen that they often produce
thin streams and also stay with this traffic pattern throughout its entire lifespan. The combination of time-dependency
and the fact that the streams provoke high latencies when using TCP is unfortunate.
In order to reduce application-layer latency when packets are lost, a set of mechanisms has been made, which
address these latency issues for thin streams. In short, if the kernel detects a thin stream, the retransmission
mechanisms are modified in the following manner:
1) If the stream is thin, fast retransmit on the first dupACK.
2) If the stream is thin, do not apply exponential backoff.
These enhancement are applied only if the stream is detected as thin. This is accomplished by defining a
threshold for the number of packets in flight. If there are less than 4 packets in flight, fast retransmissions
can not be triggered, and the stream is prone to experience high retransmission latencies.
Tcp early retransmit
(1)选项
tcp_early_retrans— INTEGER
Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold for triggering fast retransmit when
the amount of outstanding data is small and when no previously unsent data can be transmitted (such
that limited transmit could be used).
Possible values:
0 disables ER
1 enables ER
2 enables ER but delays fast recovery and fast retransmit by a fourth of RTT. This mitigates connection
falsely recovers when network has a small degree of reordering (less than 3 packets).
Default: 2
tcp_early_retrans默认启用,且值为2。
(2)原理
The average Google HTTP response is only 7.5KB or about 5-6 segments.
Early retransmit (ER) is designed to overcome the well known limitation with fast retransmit:
if a loss occurs too close to the end of a stream, there will not be enough dupacks to trigger a fast
retransmission. ER lowers the dupthresh to 1 or 2 when the outstanding data drops to 2 or 3
segments respectively.
Clearly, any reordering can falsely trigger early retransmit. If this happens near the end of one HTTP
response, the sender will falsely enter fast recovery which lowers the cwnd and slows the next HTTP
response over the same connection. To make ER more robust in the presence of reordering,
RFC 5827 describes three mitigation algorithms:
1. Disabling early retransmit if the connection has detected past reordering.
2. Adding a small delay to early retransmit so it might be canceled if the missing segment arrives
slightly late.
3. Throttling the total early retransmission rate.
(3)测试效果
ER with both mitigations can reduce 34% of the timeouts in Disorder state with 6% of early retransmits
identified as spurious retransmissions via DSACKs.
ER with both mitigations reduces latencies up to 8.5% and is most effective for short transactions.
However, the overall latency reduction by ER is significantly limited in Web servers. Since the number of
timeouts in disorder is very small compared to timeouts occuring in the open state.