TCP拥塞状态机的实现(上)
内容:本文主要分析TCP拥塞状态机的实现中,主体函数tcp_fastretrans_alert()的实现。接下来的文章会对其中重要的
部分进行更具体的分析。
内核版本:2.6.37
作者:zhangskd @ csdn
原理
先来看一下涉及到的知识。
拥塞状态:
(1)Open:Normal state, no dubious events, fast path.
(2)Disorder:In all respects it is Open, but requres a bit more attention.
It is entered when we see some SACKs or dupacks. It is split of Open
mainly to move some processing from fast path to slow one.
(3)CWR:cwnd was reduced due to some Congestion Notification event.
It can be ECN, ICMP source quench, local device congestion.
(4)Recovery:cwnd was reduced, we are fast-retransmitting.
(5)Loss:cwnd was reduced due to RTO timeout or SACK reneging.
tcp_fastretrans_alert() is entered:
(1)each incoming ACK, if state is not Open
(2)when arrived ACK is unusual, namely:
SACK
Duplicate ACK
ECN ECE
Counting packets in flight is pretty simple.
(1)in_flight = packets_out - left_out + retrans_out
packets_out is SND.NXT - SND.UNA counted in packets.
retrans_out is number of retransmitted segments.
left_out is number of segments left network, but not ACKed yet.
(2)left_out = sacked_out + lost_out
sacked_out:Packets, which arrived to receiver out of order and hence not ACKed. With SACK this
number is simply amount of SACKed data. Even without SACKs it is easy to give pretty reliable
estimate of this number, counting duplicate ACKs.
(3)lost_out:Packets lost by network. TCP has no explicit loss notification feedback from network
(for now). It means that this number can be only guessed. Actually, it is the heuristics to predict
lossage that distinguishes different algorithms.
F.e. after RTO, when all the queue is considered as lost, lost_out = packets_out and
in_flight = retrans_out.
Essentially, we have now two algorithms counting lost packets.
1)FACK:It is the simplest heuristics. As soon as we decided that something is lost, we decide that
all not SACKed packets until the most forward SACK are lost. I.e.
lost_out = fackets_out - sacked_out and left_out = fackets_out
It is absolutely correct estimate, if network does not reorder packets. And it loses any connection
to reality when reordering takes place. We use FACK by defaut until reordering is suspected on
the path to this destination.
2)NewReno:when Recovery is entered, we assume that one segment is lost (classic Reno). While
we are in Recovery and a partial ACK arrives, we assume that one more packet is lost (NewReno).
This heuristics are the same in NewReno and SACK.
Imagine, that's all! Forget about all this shamanism about CWND inflation deflation etc. CWND
is real congestion window, never inflated, changes only according to classic VJ rules.
Really tricky (and requiring careful tuning) part of algorithm is hidden in functions
tcp_time_to_recover() and tcp_xmit_retransmit_queue().
tcp_time_to_recover()
It determines the moment when we should reduce cwnd and, hence, slow down forward
transmission. In fact, it determines the moment when we decide that hole is caused by loss,
rather than by a reorder.
tcp_xmit_retransmit_queue()
It decides what we should retransmit to fill holes, caused by lost packets.
undo heuristics
And the most logically complicated part of algorithm is undo heuristics. We detect false
retransmits due to both too early fast retransmit (reordering) and underestimated RTO,
analyzing timestamps and D-SACKs. When we detect that some segments were retransmitted
by mistake and CWND reduction was wrong, we undo window reduction and abort recovery
phase. This logic is hidden inside several functions named tcp_try_undo_<something>.
主体函数
TCP拥塞状态机主要是在tcp_fastretrans_alert()中实现的,tcp_fastretrans_alert()在tcp_ack()中被调用。
此函数分成几个阶段:
A. FLAG_ECE,收到包含ECE标志的ACK。
B. reneging SACKs,ACK指向已经被SACK的数据段。如果是此原因,进入超时处理,然后返回。
C. state is not Open,发现丢包,需要标志出丢失的包,这样就知道该重传哪些包了。
D. 检查是否有错误( left_out > packets_out)。
E. 各个状态是怎样退出的,当snd_una >= high_seq时候。
F. 各个状态的处理和进入。
下文会围绕这几个阶段进行具体分析。
/* Process an event, which can update packets-in-flight not trivially. * Main goal of this function is to calculate new estimate for left_out, * taking into account both packets sitting in receiver's buffer and * packets lost by network. * * Besides that it does CWND reduction, when packet loss is detected * and changes state of machine. * * It does not decide what to send, it is made in function * tcp_xmit_retransmit_queue(). */ /* 此函数被调用的条件: * (1) each incoming ACK, if state is not Open * (2) when arrived ACK is unusual, namely: * SACK * Duplicate ACK * ECN ECE */ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); /* 判断是不是重复的ACK*/ int is_dupack = ! (flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP)); /* tcp_fackets_out()返回hole的大小,如果大于reordering,则认为发生丢包.*/ int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) && (tcp_fackets_out(tp) > tp->reordering )); int fast_rexmit = 0, mib_idx; /* 如果packet_out为0,那么不可能有sacked_out */ if (WARN_ON(!tp->packets_out && tp->sacked_out)) tp->sacked_out = 0; /* fack的计数至少需要依赖一个SACK的段.*/ if (WARN_ON(!tp->sacked_out && tp->fackets_out)) tp->fackets_out = 0; /* Now state machine starts. * A. ECE, hence prohibit cwnd undoing, the reduction is required. * 禁止拥塞窗口撤销,并开始减小拥塞窗口。 */ if (flag & FLAG_ECE) tp->prior_ssthresh = 0; /* B. In all the states check for reneging SACKs. * 检查是否为虚假的SACK,即ACK是否确认已经被SACK的数据. */ if (tcp_check_sack_reneging(sk, flag)) return; /* C. Process data loss notification, provided it is valid. * 为什么需要这么多个条件?不太理解。 * 此时不在Open态,发现丢包,需要标志出丢失的包。 */ if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOSS) && before(tp->snd_una, tp->high_seq) && icsk->icsk_ca_state != TCP_CA_Open && tp->fackets_out > tp->reordering) { tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS); } /* D. Check consistency of the current state. * 确定left_out < packets_out */ tcp_verify_left_out(tp); /* E. Check state exit conditions. State can be terminated * when high_seq is ACKed. */ if (icsk->icsk_ca_state == TCP_CA_Open) { /* 在Open状态,不可能有重传且尚未确认的段*/ WARN_ON(tp->retrans_out != 0); /* 清除上次重传阶段第一个重传段的发送时间*/ tp->retrans_stamp = 0; } else if (!before(tp->snd_una, tp->high_seq) {/* high_seq被确认了*/ switch(icsk->icsk_ca_state) { case TCP_CA_Loss: icsk->icsk_retransmits = 0; /*超时重传次数归0*/ /*不管undo成功与否,都会返回Open态,除非没有使用SACK*/ if (tcp_try_undo_recovery(sk)) return; break; case TCP_CA_CWR: /* CWR is to be held someting *above* high_seq is ACKed * for CWR bit to reach receiver. * 需要snd_una > high_seq才能撤销 */ if (tp->snd_una != tp->high_seq) { tcp_complete_cwr(sk); tcp_set_ca_state(sk, TCP_CA_Open); } break; case TCP_CA_Disorder: tcp_try_undo_dsack(sk); /* For SACK case do not Open to allow to undo * catching for all duplicate ACKs.?*/ if (!tp->undo_marker || tcp_is_reno(tp) || tp->snd_una != tp->high_seq) { tp->undo_marker = 0; tcp_set_ca_state(sk, TCP_CA_Open); } case TCP_CA_Recovery: if (tcp_is_reno(tp)) tcp_reset_reno_sack(tp)); /* sacked_out清零*/ if (tcp_try_undo_recovery(sk)) return; tcp_complete_cwr(sk); break; } } /* F. Process state. */ switch(icsk->icsk_ca_state) { case TCP_CA_Recovery: if (!(flag & FLAG_SND_UNA_ADVANCED)) { if (tcp_is_reno(tp) && is_dupack) tcp_add_reno_sack(sk); /* 增加sacked_out ,检查是否出现reorder*/ } else do_lost = tcp_try_undo_partial(sk, pkts_acked); break; case TCP_CA_Loss: /* 收到partical ack,超时重传的次数归零*/ if (flag & FLAG_DATA_ACKED) icsk->icsk_retransmits = 0; if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED) tcp_reset_reno_sack(tp); /* sacked_out清零*/ if (!tcp_try_undo_loss(sk)) { /* 尝试撤销拥塞调整,进入Open态*/ /* 如果不能撤销,则继续重传标志为丢失的包*/ tcp_moderate_cwnd(tp); tcp_xmit_retransmit_queue(sk); /* 待看*/ return; } if (icsk->icsk_ca_state != TCP_CA_Open) return; /* Loss is undone; fall through to process in Open state.*/ default: if (tcp_is_reno(tp)) { if (flag & FLAG_SND_UNA_ADVANCED) tcp_reset_reno_sack(tp); if (is_dupack) tcp_add_reno_sack(sk); } if (icsk->icsk_ca_state == TCP_CA_Disorder) tcp_try_undo_dsack(sk); /*D-SACK确认了所有重传的段*/ /* 判断是否应该进入Recovery状态*/ if (! tcp_time_to_recover(sk)) { /*此过程中,会判断是否进入Open、Disorder、CWR状态*/ tcp_try_to_open(sk, flag); return; } /* MTU probe failure: don't reduce cwnd */ /* 关于MTU探测部分此处略过!*/ ...... /* Otherwise enter Recovery state */ if (tcp_is_reno(tp)) mib_idx = LINUX_MIB_TCPRENORECOVERY; else mib_idx = LINUX_MIB_TCPSACKRECOVERY; NET_INC_STATS_BH(sock_net(sk), mib_idx); /* 进入Recovery状态前,保存那些用于恢复的数据*/ tp->high_seq = tp->snd_nxt; /* 用于判断退出时机*/ tp->prior_ssthresh = 0; tp->undo_marker = tp->snd_una; tp->undo_retrans=tp->retrans_out; if (icsk->icsk_ca_state < TCP_CA_CWR) { if (! (flag & FLAG_ECE)) tp->prior_ssthresh = tcp_current_ssthresh(sk); /*保存旧阈值*/ tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);/*更新阈值*/ TCP_ECN_queue_cwr(tp); } tp->bytes_acked = 0; tp->snd_cwnd_cnt = 0; tcp_set_ca_state(sk, TCP_CA_Recovery); /* 进入Recovery状态*/ fast_rexmit = 1; /* 快速重传标志 */ } if (do_lost || (tcp_is_fack(tp) && tcp_head_timeout(sk))) /* 更新记分牌,标志丢失和超时的数据包,增加lost_out */ tcp_update_scoreboard(sk, fast_rexmit); /* 减小snd_cwnd */ tcp_cwnd_down(sk, flag); tcp_xmit_retransmit_queue(sk); }
flag标志
#define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ #define FLAG_SND_UNA_ADVANCED 0x400 /* snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DATA_SACKED 0x20 /* New SACK. */ #define FLAG_ECE 0x40 /* ECE in this ACK */ #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ #define FLAG_DATA_LOST /* SACK detected data lossage. */ #define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */ #define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */ #define FLAG_ACKED (FLAG_DATA_ACKED | FLAG_SYN_ACKED) #define FLAG_NOT_DUP (FLAG_DATA | FLAG_WIN_UPDATE | FLAG_ACKED) /* 定义非重复ACK*/ #define FLAG_FORWARD_PROGRESS (FLAG_ACKED | FLAG_DATA_SACKED) #define FLAG_ANY_PROGRESS (FLAG_FORWARD_PROGRESS | FLAG_SND_UNA_ADVANCED) #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ struct tcp_sock { ... u32 retrans_out; /*重传还未得到确认的TCP段数目*/ u32 retrans_stamp; /* 记录上次重传阶段,第一个段的发送时间,用于判断是否可以进行拥塞调整撤销*/ struct sk_buff *highest_sack; /* highest skb with SACK received, *(validity guaranteed only if sacked_out > 0) */ ... } struct inet_connection_sock { ... __u8 icks_retransmits; /* 记录超时重传的次数*/ ... }
SACK/ RENO/ FACK是否启用
/* These function determine how the currrent flow behaves in respect of SACK * handling. SACK is negotiated with the peer, and therefore it can very between * different flows. * * tcp_is_sack - SACK enabled * tcp_is_reno - No SACK * tcp_is_fack - FACK enabled, implies SACK enabled */ static inline int tcp_is_sack (const struct tcp_sock *tp) { return tp->rx_opt.sack_ok; /* SACK seen on SYN packet */ } static inline int tcp_is_reno (const struct tcp_sock *tp) { return ! tcp_is_sack(tp); } static inline int tcp_is_fack (const struct tcp_sock *tp) { return tp->rx_opt.sack_ok & 2; } static inline void tcp_enable_fack(struct tcp_sock *tp) { tp->rx_opt.sack_ok |= 2; } static inline int tcp_fackets_out(const struct tcp_sock *tp) { return tcp_is_reno(tp) ? tp->sacked_out +1 : tp->fackets_out; }
(1)如果启用了FACK,那么fackets_out = left_out
fackets_out = sacked_out + loss_out
所以:loss_out = fackets_out - sacked_out
这是一种比较激进的丢包估算,即FACK。
(2)如果没启用FACK,那么就假设只丢了一个数据包,所以left_out = sacked_out + 1
这是一种较为保守的做法,当出现大量丢包时,这种做法会出现问题。