QCN

The differences between the operating environments of the Internet and switched Ethernet:

(i) No per-packet acks in Ethernet. This has several consequences for congestion control mechanisms: (i) Packet transmission is not self-clocked as in the Internet, (ii) path delays (round trip times) are not knowable, and congestion must be signaled by switches directly to sources. The last point makes it difficult to know path congestion; one only knows about node congestion.路径延迟(RTT)不可知,拥塞必须由交换机直接通知给发送源

(ii) Packets may not be dropped. As mentioned, Ethernet links may be paused and packets may not be dropped. A significant side-effect of this is that congestion spreading can occur, causing spurious secondary bottlenecks.

(iii) No packet sequence numbers. L2 packets do not have sequence numbers from which RTTs, or the length of the “control loop” in terms of number of packets in flight, may be inferred.

(iv) Sources start at the line rate. Unlike the slow-start mechanism in TCP, L2 sources may start transmission at the full line rate of 10 Gbps. This is because L2 sources are implemented in hardware, and installing rate limiters is the only way to have a source send at less than the line rate. But since rate limiters are typically few in number, it is preferable to install them only when a source gets a congestion message from a switch.

(v) Very shallow buffers. Ethernet switch buffers are typically 100s of KBytes deep, as opposed to Internet router buffers which are 100s of MBytes deep. Even though in terms of bandwidth-delay product the difference is about right (Ethernet RTTs are a few 100 μsecs, as opposed Internet RTTs which are a few 100 msecs), the transfer of a single file of, say, 1 MByte length can overwhelm an Ethernet buffer. This is especially true when L2 sources come on at the line rate.

(vi) Small number-of-sources regime is typical. In the Internet literature on congestion control, one usually studies the system when the number of sources is large, which is typical in the Internet. However, in Ethernet (especially in Data Centers), it is the small number of sources that is typical. This imposes serious constraints on the stability of congestion control loops, see below.

(vii) Multipathing. Forwarding in Ethernet is done on spanning trees. While this avoids loops, it is both fragile (there is only one path on a tree between any pair of nodes) and leads to an underutilization of network capacity. For these reasons, equal cost multipathing (ECMP) is some times implemented in Ethernet. In this scenario there is more than one path for packets to go from an L2 source to an L2 destination. However, congestion levels on the different paths may be vastly different!

Performance requirements

(i) Stable. This means buffer occupancy processes should not fluctuate, causing overflows and underflows. Such episodes either lead to dropped packets or to link underutilization. This is particularly important when trying to control a small number of high bandwidth sources with a shallow buffer, whose depth is a fraction of the bandwidth-delay product. For example, we would like to operate switch buffers at 30 KByte occupancy when a single 10 Gbps source is traversing it and the overall RTT is 500 μsecs. That is, we aim to keep the buffer occupancy at less than 6% of the bandwidth-delay product!

(ii) Responsive. Ethernet link bandwidth on a priority can vary with time due to traffic fluctuation in other prior- ities, the appearance of bottlenecks due to pause, the arrival of new sources, etc. These variations can be extreme: from 10 Gbps to 0.5 Gbps and back up again. The algorithm needs to rapidly adapt source rates to these variations.

(iii) Fair. When multiple flows share a link, they should obtain nearly the same share of the link’s bandwidth.

(iv) Simple to implement. The algorithm will be implemented entirely in hardware. Therefore, it should be very simple. A corollary of this requirement is that complicated calculations of rates, control loop gains and other “variables” should be avoided.

QCN

• Note: The QCN algorithm we have developed has Internet relatives; notably BIC-TCP at the source and the REM/PI controllers at switches

IEEE 802.1 Data Center Bridging standards: Enhancements to Ethernet

• Reliable delivery (802.1Qbb): Link-level flow control (PAUSE) prevents congestion drops

• Ethernet congestion management (802.1Qau): Prevents congestion spreading due to PAUSE

The QCN (Quantized Congestion Notification) algorithm has been developed to provide congestion control at the Ethernet layer, or at L2. It has been developed for the IEEE 802.1Qau standard, which is a part of the IEEE Data Center Bridging Task Group’s efforts. A related effort is the Priority Flow Control project, IEEE 802.1Qbb, for enabling hop-by- hop, per-priority pausing of traffic at congested links. Thus, when the buffer at a congested link fills up, it issues a PAUSE message to upstream buffers, an action which ensures packets do not get dropped due to congestion. A consequence of link- level pausing is the phenomenon of “congestion spreading:” the domino effect of buffer congestion propagating upstream causing secondary bottlenecks. Secondary bottlenecks are highly undesirable as they affect sources whose packets do not pass through the primary bottleneck. An L2 congestion control scheme allows a primary bottleneck to directly reduce the rates of those sources whose packets pass through it, thereby preventing (or reducing the instances of) secondary bottlenecks. The L2 congestion control algorithm is expected to operate well regardless of whether link-level pause exists or not (i.e. packets may be dropped).

算法分为两部分:

一.Switch or Congestion Point (CP)部分:

this is the mechanism by which a switch buffer attached to an oversubscribed link samples incoming packets and generates a feedback message addressed to the source of the sampled packet. The feedback message contains information about the extent of congestion at the CP.

依靠这种机制,交换机缓存对连接的超额认购(即该链路被预想的更多的流所占用、预留)的链路上传来的包进行采样,并发回包含拥塞程度信息的返回消息给包的发送者。

The goal of the CP is to maintain the buffer occupancy at a desired operating point, Qeq . The CP computes a congestion measure Fb (defined below) and, with a probability depending on the severity of congestion, randomly samples an incoming packet and sends the value of Fb in a feedback message to the source of the sampled packet. The value of Fb is quantized to 6 bits. Let Q denote the instantaneous queue-size and Qold denote the queue-size when the the last feedback message was generated. Let

Q:当前的队列长度 Qeq:期望的队列长度点 Qδ:上次发送反馈反馈消息时的队列长度

Fb综合表示了队列大小的超值以及速率的超值。根据Fb的值可以确定一个概率值,用于对后序接收到的包按一定概率来采样并发回拥塞信息。

Qoff = Q − Qeq

Qδ = Q − Qold

Fb = −(Qof f + wQδ )

The interpretation is that Fb captures a combination of queue-size excess (Qof f ) and rate excess (Qδ ). Indeed,

Qδ = Q−Qold is the derivative of the queue-size and equals input rate less output rate. Thus, when Fb is negative, either the buffers or the link or both are oversubscribed. When Fb < 0, Fig. 2 shows the probability with which a congestion message is reflected back to the source as a function of |Fb|.The feedback message contains the value of Fb, quantized to 6 bits. When Fb ≥ 0, there is no congestion and no feedback

messages are sent.

当Fb>=0时,不会发送拥塞和反馈消息。

二.Rate limiter or Reaction Point (RP) 部分:

this is the mechanism by which a rate limiter (RL) associated with a source decreases its sending rate based on feedback received from the CP, and increases its rate voluntarily to recover lost bandwidth and probe for extra available bandwidth.

源端通过速率限制器来根据反馈的信息来增加或减少流的发送速率。

Since the RP is not given positive rate-increase signals by the network, it needs a mechanism for increasing its sending rate on its own. Due to the absence of acks in Ethernet, the increases of rate need to be clocked internally at the RP. Before proceeding to explain the RP algorithm, we will need the following terminology:

Current Rate (CR): The transmission rate of the RL at any time.

Target Rate (TR): The sending rate of the RL just before the arrival of the last feedback message.

Byte Counter: A counter at the RP for counting the number of bytes transmitted by the RL. It times rate increases by the RL. See below.

Timer: A clock at the RP which is also used for timing rate increases at the RL. The main purpose of the timer is to allow the RL to rapidly increase when its sending rate is very low and there is a lot of bandwidth becomes available. See below.

We now explain the RP algorithm assuming that only the byte counter is available. Later, we will briefly explain how the timer is integrated into the RP algorithm. Fig. 3 shows the basic RP behavior.

Rate decreases:

This occurs only when a feedback message is received, in which case Cfl and Tfl are updated as follows:

(2)

(3)

where the constant Gd is chosen so that Gd丨Fbmax丨=1/2. the sending rate can decrease by at most 50%.

Rate increases:

速率增加发生在两个阶段: Fast Recovery and Active Increase.

Fast Recovery (FR ): The byte counter is reset every time a rate decrease is applied and it enters the FR state. FR consists of 5 cycles, each cycle equal to 150 KBytes of data transmission by the RL. The cycles are counted by the byte counter. At the end of each cycle, TR remains unchanged while CR is updated as follows:(五次循环增加速度,每次150K个字节,只更新CR值)

(4)

The cycle duration of 150 KBytes is chosen to correspond to the transmission of 100 packets, each 1500 Bytes long. The idea is that when the RL has transmitted 100 packets and, given that the minimum sampling probability at the CP is 1 %, if it hasn’t received a feedback message then it may infer that the CP is uncongested. Therefore it increases its rate as above, recovering some of the bandwidth it lost at the previous rate decrease episode. Thus, the goal of the RP in FR is to rapidly recover be rate it lost at the last rate decrease episode.

Active Increase (AI ):

After 5 cycles of FR have completed, the RP enters the Active Increase (AI) phase where it probes for extra bandwidth on the path. During AI, the byte counter counts out cycles of 50 packets (this can be set to 100 packets for a less frequent probing). At the end of a feedback(每50个包进行一次计算)

message, the RL updates TR and CR as follows where RAI, is a constant chosen to be 5 Mbps in the baseline implementation.

posted @ 2013-07-04 20:49  苝極鉯苝  阅读(629)  评论(0编辑  收藏  举报