tcp bbr v2
4.2. Algorithm Organization
The BBR algorithm is an event-driven algorithm that executes steps upon the following events: connection initialization, upon each ACK, upon the transmission of each quantum, and upon loss detection events. All of the sub-steps invoked referenced below are described below.
4.2.1. Initialization
Upon transport connection initialization, BBR executes its initialization steps:
BBROnInit(): init_windowed_max_filter(filter=BBR.MaxBwFilter, value=0, time=0) BBR.min_rtt = SRTT ? SRTT : Inf BBR.min_rtt_stamp = Now() BBR.probe_rtt_done_stamp = 0 BBR.probe_rtt_round_done = false BBR.prior_cwnd = 0 BBR.idle_restart = false BBR.extra_acked_interval_start = Now() BBR.extra_acked_delivered = 0 BBRResetCongestionSignals() BBRResetLowerBounds() BBRInitRoundCounting() BBRInitFullPipe() BBRInitPacingRate() BBREnterStartup()
4.2.2. Per-Transmit Steps
When transmitting, BBR merely needs to check for the case where the flow is restarting from idle:
BBROnTransmit():
BBRHandleRestartFromIdle()
4.2.3. Per-ACK Steps
On every ACK, the BBR algorithm executes the following BBRUpdateOnACK() steps in order to update its network path model, update its state machine, and adjust its control parameters to adapt to the updated model:
BBRUpdateOnACK():
BBRUpdateModelAndState()
BBRUpdateControlParameters()
BBRUpdateModelAndState():
BBRUpdateLatestDeliverySignals()
BBRUpdateCongestionSignals()
BBRUpdateACKAggregation()
BBRCheckStartupDone()
BBRCheckDrain()
BBRUpdateProbeBWCyclePhase()
BBRUpdateMinRTT()
BBRCheckProbeRTT()
BBRAdvanceLatestDeliverySignals()
BBRBoundBWForModel()
BBRUpdateControlParameters():
BBRSetPacingRate()
BBRSetSendQuantum()
BBRSetCwnd()
4.2.4. Per-Loss Steps
On every packet loss event, where some sequence range "packet" is marked lost, the BBR algorithm executes the following BBRUpdateOnLoss() steps in order to update its network path model¶
BBRUpdateOnLoss(packet):
BBRHandleLostPacket(packet)
4.3. State Machine Operation
4.3.1. Startup
4.3.1.1. Startup Dynamics
When a BBR flow starts up, it performs its first (and most rapid) sequential probe/drain process in the Startup and Drain states. Network link bandwidths currently span a range of at least 11 orders of magnitude, from a few bps to 200 Gbps. To quickly learn BBR.max_bw, given this huge range to explore, BBR's Startup state does an exponential search of the rate space, doubling the sending rate each round. This finds BBR.max_bw in O(log_2(BDP)) round trips.
To achieve this rapid probing in the smoothest possible fashion, in Startup BBR uses the minimum gain values that will allow the sending rate to double each round: in Startup BBR sets BBR.pacing_gain to BBRStartupPacingGain (2.77) [BBRStartupPacingGain] and BBR.cwnd_gain to BBRStartupCwndGain (2).
When initializing a connection, or upon any later entry into Startup mode, BBR executes the following BBREnterStartup() steps:
BBREnterStartup(): BBR.state = Startup BBR.pacing_gain = BBRStartupPacingGain BBR.cwnd_gain = BBRStartupCwndGain
As BBR grows its sending rate rapidly, it obtains higher delivery rate samples, BBR.max_bw increases, and the pacing rate and cwnd both adapt by smoothly growing in proportion. Once the pipe is full, a queue typically forms, but the cwnd_gain bounds any queue to (cwnd_gain - 1) * estimated_BDP, which is approximately (2.77 - 1) * estimated_BDP = 1.77 * estimated_BDP. The immediately following Drain state is designed to quickly drain that queue.
During Startup, BBR estimates whether the pipe is full using two estimators. The first looks for a plateau in the BBR.max_bw estimate. The second looks for packet loss. The following subsections discuss these estimators.
BBRCheckStartupDone(): BBRCheckStartupFullBandwidth() BBRCheckStartupHighLoss() if (BBR.state == Startup and BBR.filled_pipe) BBREnterDrain()
4.3.1.2. Exiting Startup Based on Bandwidth Plateau
During Startup, BBR estimates whether the pipe is full by looking for a plateau in the BBR.max_bw estimate. The output of this "full pipe" estimator is tracked in BBR.filled_pipe, a boolean that records whether BBR estimates that it has ever fully utilized its available bandwidth ("filled the pipe"). If BBR notices that there are several (three) rounds where attempts to double the delivery rate actually result in little increase (less than 25 percent), then it estimates that it has reached BBR.max_bw, sets BBR.filled_pipe to true, exits Startup and enters Drain.
Upon connection initialization the full pipe estimator runs:
BBRInitFullPipe(): BBR.filled_pipe = false BBR.full_bw = 0 BBR.full_bw_count = 0
Once per round trip, upon an ACK that acknowledges new data, and when the delivery rate sample is not application-limited (see [draft-cheng-iccrg-delivery-rate-estimation]), BBR runs the "full pipe" estimator, if needed:
BBRCheckStartupFullBandwidth(): if BBR.filled_pipe or !BBR.round_start or rs.is_app_limited return /* no need to check for a full pipe now */ if (BBR.max_bw >= BBR.full_bw * 1.25) /* still growing? */ BBR.full_bw = BBR.max_bw /* record new baseline level */ BBR.full_bw_count = 0 return BBR.full_bw_count++ /* another round w/o much growth */ if (BBR.full_bw_count >= 3) BBR.filled_pipe = true
BBR waits three rounds to have solid evidence that the sender is not detecting a delivery-rate plateau that was temporarily imposed by the receive window. Allowing three rounds provides time for the receiver's receive-window auto-tuning to open up the receive window and for the BBR sender to realize that BBR.max_bw should be higher: in the first round the receive-window auto-tuning algorithm grows the receive window; in the second round the sender fills the higher receive window; in the third round the sender gets higher delivery-rate samples. This three-round threshold was validated by YouTube experimental data.
4.3.1.3. Exiting Startup Based on Packet Loss
A second method BBR uses for estimating the bottleneck is full is by looking at sustained packet losses Specifically for a case where the following criteria are all met:
- The connection has been in fast recovery for at least one full round trip.
- The loss rate over the time scale of a single full round trip exceeds BBRLossThresh (2%).
- There are at least BBRStartupFullLossCnt=3 discontiguous sequence ranges lost in that round trip.
If these criteria are all met, then BBRCheckStartupHighLoss() sets BBR.filled_pipe = true and exits Startup and enters Drain.
The algorithm waits until all three criteria are met to filter out noise from burst losses, and to try to ensure the bottleneck is fully utilized on a sustained basis, and the full bottleneck bandwidth has been measured, before attempting to drain the level of in-flight data to the estimated BDP.
4.3.2. Drain
Upon exiting Startup, BBR enters its Drain state. In Drain, BBR aims to quickly drain any queue created in Startup by switching to a pacing_gain well below 1.0, until any estimated queue has been drained. It uses a pacing_gain that is the inverse of the value used during Startup, chosen to try to drain the queue in one round [BBRDrainPacingGain]:
BBREnterDrain(): BBR.state = Drain BBR.pacing_gain = 1/BBRStartupCwndGain /* pace slowly */ BBR.cwnd_gain = BBRStartupCwndGain /* maintain cwnd */
In Drain, when the amount of data in flight is less than or equal to the estimated BDP, meaning BBR estimates that the queue has been fully drained, then BBR exits Drain and enters ProbeBW. To implement this, upon every ACK BBR executes:
BBRCheckDrain(): if (BBR.state == Drain and packets_in_flight <= BBRInflight(1.0)) BBREnterProbeBW() /* BBR estimates the queue was drained */
4.3.3. ProbeBW
Long-lived BBR flows tend to spend the vast majority of their time in the ProbeBW states. In the ProbeBW states, a BBR flow sequentially accelerates, decelerates, and cruises, to measure the network path, improve its operating point (increase throughput and reduce queue pressure), and converge toward a more fair allocation of bottleneck bandwidth. To do this, the flow sequentially cycles through all three tactics: trying to send faster than, slower than, and at the same rate as the network delivery process. To achieve this, a BBR flow in ProbeBW mode cycles through the four Probe bw states - DOWN, CRUISE, REFILL, and UP - described below in turn.
4.3.3.1. ProbeBW_DOWN
In the ProbeBW_DOWN phase of the cycle, a BBR flow pursues the deceleration tactic, to try to send slower than the network is delivering data, to reduce the amount of data in flight, with all of the standard motivations for the deceleration tactic (discussed in "State Machine Tactics", above). It does this by switching to a BBR.pacing_gain of 0.9, sending at 90% of BBR.bw. The pacing_gain value of 0.9 is derived based on the ProbeBW_UP pacing gain of 1.25, as the minimum pacing_gain value that allows bandwidth-based convergence to approximate fairness.
Exit conditions: The flow exits this phase and enters CRUISE when the flow estimates that both of the following conditions have been met:
- There is free headroom: If inflight_hi is set, then BBR remains in DOWN at least until the volume of in-flight data is less than or equal to BBRHeadroom*BBR.inflight_hi. The goal of this constraint is to ensure that in cases where loss signals suggest an upper limit on the volume of in-flight data, then the flow attempts to leave some free headroom in the path (e.g. free space in the bottleneck buffer or free time slots in the bottleneck link) that can be used by cross traffic (both for volume-based convergence of bandwidth shares and for burst tolerance).
- The volume of in-flight data is less than or equal to BBR.BDP, i.e. the flow estimates that it has drained any queue at the bottleneck.
4.3.3.2. ProbeBW_CRUISE
In the ProbeBW_CRUISE phase of the cycle, a BBR flow pursues the "cruising" tactic (discussed in "State Machine Tactics", above), attempting to send at the same rate the network is delivering data. It tries to match the sending rate to the flow's current available bandwidth, to try to achieve high utilization of the available bandwidth without increasing queue pressure. It does this by switching to a pacing_gain of 1.0, sending at 100% of BBR.bw. Notably, while in this state it responds to concrete congestion signals (loss) by reducing BBR.bw_lo and BBR.inflight_lo, because these signals suggest that the available bandwidth and deliverable volume of in-flight data have likely reduced, and the flow needs to change to adapt, slowing down to match the latest delivery process.
Exit conditions: The connection adaptively holds this state until it decides that it is time to probe for bandwidth, at which time it enters ProbeBW_REFILL (see "Time Scale for Bandwidth Probing", below).
4.3.3.3. ProbeBW_REFILL
The goal of the ProbeBW_REFILL state is to "refill the pipe", to try to fully utilize the network bottleneck without creating any significant queue pressure.
To do this, BBR first resets the short-term model parameters bw_lo and inflight_lo, setting both to "Infinity". This is the key moment in the BBR time scale strategy (see "Time Scale Strategy", above) where the flow pivots, discarding its conservative short-term bw_lo and inflight_lo parameters and beginning to robustly probe the bottleneck's long-term available bandwidth. During this time bw_hi and inflight_hi, if set, constrain the connection.
During ProbeBW_REFILL BBR uses a BBR.pacing_gain of 1.0, to send at a rate that matches the current estimated available bandwidth, for one packet-timed round trip. The goal is to fully utilize the bottleneck link before transitioning into ProbeBW_UP and significantly increasing the chances of causing a loss signal. The motivating insight is that, as soon as a flow starts acceleration, sending faster than the available bandwidth, it will start building a queue at the bottleneck. And if the buffer is shallow enough, then the flow can cause loss signals very shortly after the first accelerating packets arrive at the bottleneck. If the flow were to neglect to fill the pipe before it causes this loss signal, then these very quick signals of excess queue could cause the flow's estimate of the path's capacity (i.e. inflight_hi) to significantly underestimate. In particular, if the flow were to transition directly from ProbeBW_CRUISE to ProbeBW_UP, the volume of in-flight data (at the time the first accelerating packets were sent) may often be still very close to the volume of in-flight data maintained in CRUISE, which may be only BBRHeadroom*inflight_hi.
Exit conditions: The flow exits ProbeBW_REFILL after one packet-timed round trip, and enters UP. This is because after one full round trip of sending in ProbeBW_REFILL the flow (if not application-limited) has had an opportunity to place as many packets in flight as its BBR.bw estimate permits. And correspondingly, at this point the flow starts to see bandwidth samples reflecting its ProbeBW_REFILL behavior, which may be putting too much data in flight.
4.3.3.4. ProbeBW_UP
After ProbeBW_REFILL refills the pipe, ProbeBW_UP probes for possible increases in available bandwidth by using a BBR.pacing_gain of 1.25, sending faster than the current estimated available bandwidth.
If the flow has not set BBR.inflight_hi or BBR.bw_hi, it tries to raise the volume of in-flight data to at least BBR.pacing_gain * BBR.bdp = 1.25 * BBR.bdp; note that this may take more than BBR.min_rtt if BBR.min_rtt is small (e.g. on a LAN).
If the flow has set BBR.inflight_hi or BBR.bw_hi, it moves to an operating point based on those limits and then gradually increases the upper volume bound (BBR.inflight_hi) and rate bound (BBR.bw_hi) using the following approach:
- bw_hi: The flow raises bw_hi to the latest measured bandwidth sample if the latest measured bandwidth sample is above bw_hi and the loss rate for the sample is not above the BBRLossThresh.
- inflight_hi: The flow raises inflight_hi in ProbeBW_UP in a manner that is slow and cautious at first, but increasingly rapid and bold over time. The initial caution is motivated by the fact that a given BBR flow may be sharing a shallow buffer with thousands of other flows, so that the buffer space available to the flow may be quite tight - even just a single packet. The increasingly rapid growth over time is motivated by the fact that in a high-speed WAN the increase in available bandwidth (and thus the estimated BDP) may require the flow to grow the volume of its inflight data by up to O(1,000,000); even a quite typical BDP like 10Gbps * 100ms is 82,563 packets. BBR takes an approach where the additive increase to BBR.inflight_hi exponentially doubles each round trip; in each successive round trip, inflight_hi grows by 1, 2, 4, 8, 16, etc, with the increases spread uniformly across the entire round trip. This helps allow BBR to utilize a larger BDP in O(log(BDP)) round trips, meeting the design goal for scalable utilization of newly-available bandwidth.
Exit conditions: The BBR flow ends ProbeBW_UP bandwidth probing and transitions to ProbeBW_DOWN to try to drain the bottleneck queue when any of the following conditions are met:
4.3.3.5. Time Scale for Bandwidth Probing
Choosing the time scale for probing bandwidth is tied to the question of how to coexist with legacy Reno/CUBIC flows, since probing for bandwidth runs a significant risk of causing packet loss, and causing packet loss can significantly limit the throughput of such legacy Reno/CUBIC flows.
4.3.3.5.1. Bandwidth Probing and Coexistence with Reno/CUBIC
BBR has an explicit strategy for coexistence with Reno/CUBIC: to try to behave in a manner so that Reno/CUBIC flows coexisting with BBR can continue to work well in the primary contexts where they do today:
-
Intra-datacenter/LAN traffic: we want Reno/CUBIC to be able to perform well in 100M through 40G enterprise and datacenter Ethernet
-
Public Internet last mile traffic: we want Reno/CUBIC to be able to support up to 25Mbps (for 4K Video) at an RTT of 30ms, typical parameters for common CDNs for large video services:
The challenge in meeting these goals is that Reno/CUBIC need long periods of no loss to utilize large BDPs. The good news is that in the environments where Reno/CUBIC work well today (mentioned above), the BDPs are small, roughly ~100 packets or less.
4.3.3.5.2. A Dual-Time-Scale Approach for Coexistence
The BBR strategy has several aspects:
- The highest priority is to estimate the bandwidth available to the BBR flow in question.
- Secondarily, a given BBR flow adapts (within bounds) the frequency at which it probes bandwidth and knowingly risks packet loss, to allow Reno/CUBIC to reach a bandwidth at least as high as that given BBR flow.
To adapt the frequency of bandwidth probing, BBR considers two time scales: a BBR-native time scale, and a bounded Reno-conscious time scale:
This dual-time-scale approach is similar to that used by CUBIC, which has a CUBIC-native time scale given by a cubic curve, and a "Reno emulation" module that estimates what cwnd would give the flow Reno-equivalent throughput. At any given moment, CUBIC choose the cwnd implied by the more aggressive strategy.
We randomize both the T_bbr and T_reno parameters, for better mixing and fairness convergence.
4.3.3.5.3. Design Considerations for Choosing Constant Parameters
We design the maximum wall-clock bounds of BBR-native inter-bandwidth-probe wall clock time, T_bbr, to be:
- Higher than 2 sec to try to avoid causing loss for a long enough time to allow Reno flow with RTT=30ms to get 25Mbps (4K video) throughput. For this workload, given the Reno sawtooth that raises cwnd from roughly BDP to 2*BDP, one MSS per round trip, the inter-bandwidth-probe time must be at least: BDP * RTT = 25Mbps * .030 sec / (1514 bytes) * 0.030 sec = 1.9secs
- Lower than 3 sec to ensure flows can start probing in a reasonable amount of time to discover unutilized bw on human-scale interactive time-scales (e.g. perhaps traffic from a competing web page download is now complete).
The maximum round-trip bounds of the Reno-coexistence time scale, T_reno, are chosen to be 62-63 with the following considerations in mind:
- Choosing a value smaller than roughly 60 would imply that when BBR flows coexisted with Reno/CUBIC flows (e.g. Netflix Reno flows) on public Internet broadband links, the Reno/CUBIC flows would not be able to achieve enough bandwidth to show 4K video.
- Choosing a value larger than roughly 65 would prevent BBR from reaching its goal of tolerating 1% loss per round trip. Given that the steady-state (non-bandwidth-probing) BBR response to a round trip with X% packet loss is to reduce the sending rate by X% (see the "Updating the Model Upon Packet Loss" section), this means that the BBR sending rate after N rounds of packet loss at a rate loss_rate is (1 - loss_rate)^N. This means that for a flow that encounters 1% loss in 65 round trips of ProbeBW_CRUISE, and then doubles its cwnd (back to BBR.inflight_hi) in ProbeBW_REFILL and ProbeBW_UP, it will be able to restore and reprobe its original sending rate, since: BBW.max_bw * (1 - loss_rate)^N * 2 = BBR.max_bw * (1 - .01)^65 ~= 1.04 * BBR.max_bw. That is, the flow will be able to fully respond to packet loss signals in ProbeBW_CRUISE while also fully re-measuring its maximum achievable throughput in ProbeBW_UP.
The resulting behavior is that for BBR flows with small BDPs, the bandwidth probing will be on roughly the same time scale as Reno/CUBIC; flows with large BDPs will intentionally probe more rapidly/frequently than Reno/CUBIC would (roughly every 62 round trips for low-RTT flows, or 2-3 secs for high-RTT flows).
The considerations above for timing bandwidth probing can be implemented as follows:
/* Is it time to transition from DOWN or CRUISE to REFILL? */ BBRCheckTimeToProbeBW(): if (BBRHasElapsedInPhase(BBR.bw_probe_wait) || BBRIsRenoCoexistenceProbeTime()) BBRStartProbeBW_REFILL() return true return false /* Randomized decision about how long to wait until * probing for bandwidth, using round count and wall clock. */ BBRPickProbeWait(): /* Decide random round-trip bound for wait: */ BBR.rounds_since_bw_probe = random_int_between(0, 1); /* 0 or 1 */ /* Decide the random wall clock bound for wait: */ BBR.bw_probe_wait = 2sec + random_float_between(0.0, 1.0) /* 0..1 sec */ BBRIsRenoCoexistenceProbeTime(): reno_rounds = BBRTargetInflight() rounds = min(reno_rounds, 63) return BBR.rounds_since_bw_probe >= rounds /* How much data do we want in flight? * Our estimated BDP, unless congestion cut cwnd. */ BBRTargetInflight() return min(BBR.bdp, cwnd)
4.3.3.6. ProbeBW Algorithm Details
BBR's ProbeBW algorithm operates as follows.
Upon entering ProbeBW, BBR executes:
BBREnterProbeBW():
BBRStartProbeBW_DOWN()
The core logic for entering each state:
BBRStartProbeBW_DOWN(): BBRResetCongestionSignals() BBR.probe_up_cnt = Infinity /* not growing inflight_hi */ BBRPickProbeWait() BBR.cycle_stamp = Now() /* start wall clock */ BBR.ack_phase = ACKS_PROBE_STOPPING BBRStartRound() BBR.state = ProbeBW_DOWN BBRStartProbeBW_CRUISE(): BBR.state = ProbeBW_CRUISE BBRStartProbeBW_REFILL(): BBRResetLowerBounds() BBR.bw_probe_up_rounds = 0 BBR.bw_probe_up_acks = 0 BBR.ack_phase = ACKS_REFILLING BBRStartRound() BBR.state = ProbeBW_REFILL BBRStartProbeBW_UP(): BBR.ack_phase = ACKS_PROBE_STARTING BBRStartRound() BBR.cycle_stamp = Now() /* start wall clock */ BBR.state = ProbeBW_UP BBRRaiseInflightHiSlope()
BBR executes the following BBRUpdateProbeBWCyclePhase() logic on each ACK that ACKs or SACKs new data, to advance the ProbeBW state machine:
/* The core state machine logic for ProbeBW: */ BBRUpdateProbeBWCyclePhase(): if (!BBR.filled_pipe) return /* only handling steady-state behavior here */ BBRAdaptUpperBounds() if (!IsInAProbeBWState()) return /* only handling ProbeBW states here: */ switch (state) ProbeBW_DOWN: if (BBRCheckTimeToProbeBW()) return /* already decided state transition */ if (BBRCheckTimeToCruise()) BBRStartProbeBW_CRUISE() ProbeBW_CRUISE: if (BBRCheckTimeToProbeBW()) return /* already decided state transition */ ProbeBW_REFILL: /* After one round of REFILL, start UP */ if (BBR.round_start) BBR.bw_probe_samples = 1 BBRStartProbeBW_UP() ProbeBW_UP: if (BBRHasElapsedInPhase(BBR.min_rtt) and inflight > BBRInflight(BBR.max_bw, 1.25)) BBRStartProbeBW_DOWN()
The ancillary logic to implement the ProbeBW state machine:
IsInAProbeBWState() state = BBR.state return (state == ProbeBW_DOWN or state == ProbeBW_CRUISE or state == ProbeBW_REFILL or state == ProbeBW_UP) /* Time to transition from DOWN to CRUISE? */ BBRCheckTimeToCruise(): if (inflight > BBRInflightWithHeadroom()) return false /* not enough headroom */ if (inflight <= BBRInflight(BBR.max_bw, 1.0)) return true /* inflight <= estimated BDP */ BBRHasElapsedInPhase(interval): return Now() > BBR.cycle_stamp + interval /* Return a volume of data that tries to leave free * headroom in the bottleneck buffer or link for * other flows, for fairness convergence and lower * RTTs and loss */ BBRInflightWithHeadroom(): if (BBR.inflight_hi == Infinity) return Infinity headroom = max(1, BBRHeadroom * BBR.inflight_hi) return max(BBR.inflight_hi - headroom, BBRMinPipeCwnd) /* Raise inflight_hi slope if appropriate. */ BBRRaiseInflightHiSlope(): growth_this_round = 1MSS << BBR.bw_probe_up_rounds BBR.bw_probe_up_rounds = min(BBR.bw_probe_up_rounds + 1, 30) BBR.probe_up_cnt = max(cwnd / growth_this_round, 1) /* Increase inflight_hi if appropriate. */ BBRProbeInflightHiUpward(): if (!is_cwnd_limited or cwnd < BBR.inflight_hi) return /* not fully using inflight_hi, so don't grow it */ BBR.bw_probe_up_acks += rs.newly_acked if (BBR.bw_probe_up_acks >= BBR.probe_up_cnt) delta = BBR.bw_probe_up_acks / BBR.probe_up_cnt BBR.bw_probe_up_acks -= delta * BBR.bw_probe_up_cnt BBR.inflight_hi += delta if (BBR.round_start) BBRRaiseInflightHiSlope() /* Track ACK state and update BBR.max_bw window and * BBR.inflight_hi and BBR.bw_hi. */ BBRAdaptUpperBounds(): if (BBR.ack_phase == ACKS_PROBE_STARTING and BBR.round_start) /* starting to get bw probing samples */ BBR.ack_phase = ACKS_PROBE_FEEDBACK if (BBR.ack_phase == ACKS_PROBE_STOPPING and BBR.round_start) /* end of samples from bw probing phase */ if (IsInAProbeBWState() and !rs.is_app_limited) BBRAdvanceMaxBwFilter() if (!CheckInflightTooHigh()) /* Loss rate is safe. Adjust upper bounds upward. */ if (BBR.inflight_hi == Infinity or BBR.bw_hi == Infinity) return /* no upper bounds to raise */ if (rs.tx_in_flight > BBR.inflight_hi) BBR.inflight_hi = rs.tx_in_flight if (rs.delivery_rate > BBR.bw_hi) BBR.bw_hi = rs.bw if (BBR.state == ProbeBW_UP) BBRProbeInflightHiUpward()
4.3.4. ProbeRTT
4.3.4.1. ProbeRTT Overview
To help probe for BBR.min_rtt, on an as-needed basis BBR flows enter the ProbeRTT state to try to cooperate to periodically drain the bottleneck queue - and thus improve their BBR.min_rtt estimate of the unloaded two-way propagation delay.
A critical point is that before BBR raises its BBR.min_rtt estimate (which would in turn raise its maximum permissible cwnd), it first enters ProbeRTT to try to make a concerted and coordinated effort to drain the bottleneck queue and make a robust BBR.min_rtt measurement. This allows the BBR.min_rtt estimates of ensembles of BBR flows to converge avoiding feedback loops of ever-increasing queues and RTT samples.
The ProbeRTT state works in concert with BBR.min_rtt estimation. Up to once every ProbeRTTInterval = 5 seconds, the flow enters ProbeRTT, decelerating by setting its cwnd_gain to BBRProbeRTTCwndGain = 0.5 to reduce its volume of inflight data to half of its estimated BDP, to try to allow the flow to measure the unloaded two-way propagation delay.
There are two main motivations for making the MinRTTFilterLen roughly twice the ProbeRTTInterval. First, this ensures that during a ProbeRTT episode the flow will "remember" the BBR.min_rtt value it measured during the previous ProbeRTT episode, providing a robust bdp estimate for the cwnd = 0.5*bdp calculation, increasing the likelihood of fully draining the bottleneck queue. Second, this allows the flow's BBR.min_rtt filter window to generally include RTT samples from two ProbeTT episodes, providing a more robust estimate.
The algorithm for ProbeRTT is as follows:
Entry conditions: In any state other than ProbeRTT itself, if the BBR.probe_rtt_min_delay estimate has not been updated (i.e., by getting a lower RTT measurement) for more than ProbeRTTInterval = 5 seconds, then BBR enters ProbeRTT and reduces the BBR.cwnd_gain to BBRProbeRTTCwndGain = 0.5.
Exit conditions: After maintaining the volume of in-flight data at BBRProbeRTTCwndGain*BBR.bdp for at least ProbeRTTDuration (200 ms) and at least one round trip, BBR leaves ProbeRTT and transitions to ProbeBW if it estimates the pipe was filled already, or Startup otherwise.
4.3.4.2. ProbeRTT Design Rationale
BBR is designed to have ProbeRTT sacrifice no more than roughly 2% of a flow's available bandwidth. It is also designed to spend the vast majority of its time (at least roughly 96 percent) in ProbeBW and the rest in ProbeRTT, based on a set of tradeoffs. ProbeRTT lasts long enough (at least ProbeRTTDuration = 200 ms) to allow flows with different RTTs to have overlapping ProbeRTT states, while still being short enough to bound the throughput penalty of ProbeRTT's cwnd capping to roughly 2%, with the average throughput targeted at:
throughput = (200ms*0.5*BBR.bw + (5s - 200ms)*BBR.bw) / 5s = (.1s + 4.8s)/5s * BBR.bw = 0.98 * BBR.bw
As discussed above, BBR's BBR.min_rtt filter window, MinRTTFilterLen, and time interval between ProbeRTT states, ProbeRTTInterval, work in concert. BBR uses a MinRTTFilterLen equal to or longer than ProbeRTTInterval to allow the filter window to include at least one ProbeRTT.
To allow coordination with other BBR flows, each flow MUST use the standard ProbeRTTInterval of 5 secs.
An ProbeRTTInterval of 5 secs is short enough to allow quick convergence if traffic levels or routes change, but long enough so that interactive applications (e.g., Web, remote procedure calls, video chunks) often have natural silences or low-rate periods within the window where the flow's rate is low enough for long enough to drain its queue in the bottleneck. Then the BBR.probe_rtt_min_delay filter opportunistically picks up these measurements, and the BBR.probe_rtt_min_delay estimate refreshes without requiring ProbeRTT. This way, flows typically need only pay the 2 percent throughput penalty if there are multiple bulk flows busy sending over the entire ProbeRTTInterval window.
As an optimization, when restarting from idle and finding that the BBR.probe_rtt_min_delay has expired, BBR does not enter ProbeRTT; the idleness is deemed a sufficient attempt to coordinate to drain the queue.
4.3.4.3. Calculating the rs.rtt RTT Sample
Upon transmitting each packet, BBR (or the associated transport protocol) stores in per-packet data the wall-clock scheduled transmission time of the packet in packet.departure_time (see the "Pacing Rate: BBR.pacing_rate" section for how this is calculated).
For every ACK that newly acknowledges some data (whether cumulatively or selectively), the sender's BBR implementation (or the associated transport protocol implementation) attempts to calculate an RTT sample. The sender MUST consider any potential retransmission ambiguities that can arise in some transport protocols. If some of the acknowledged data was not retransmitted, or some of the data was retransmitted but the sender can still unambiguously determine the RTT of the data (e.g. if the transport supports [RFC7323] TCP timestamps or an equivalent mechanism), then the sender calculates an RTT sample, rs.rtt, as follows:
rs.rtt = Now() - packet.departure_time
4.3.4.4. ProbeRTT Logic
On every ACK BBR executes BBRUpdateMinRTT() to update its ProbeRTT scheduling state (BBR.probe_rtt_min_delay and BBR.probe_rtt_min_stamp) and its BBR.min_rtt estimate:
BBRUpdateMinRTT() BBR.probe_rtt_expired = Now() > BBR.probe_rtt_min_stamp + ProbeRTTInterval if (rs.rtt >= 0 and (rs.rtt < BBR.probe_rtt_min_delay or BBR.probe_rtt_expired)) BBR.probe_rtt_min_delay = rs.rtt BBR.probe_rtt_min_stamp = Now() min_rtt_expired = Now() > BBR.min_rtt_stamp + MinRTTFilterLen if (BBR.probe_rtt_min_delay < BBR.min_rtt or min_rtt_expired) BBR.min_rtt = BBR.probe_rtt_min_delay BBR.min_rtt_stamp = BBR.probe_rtt_min_stamp
Here BBR.probe_rtt_expired is a boolean recording whether the BBR.probe_rtt_min_delay has expired and is due for a refresh, via either an application idle period or a transition into ProbeRTT state.
On every ACK BBR executes BBRCheckProbeRTT() to handle the steps related to the ProbeRTT state as follows:
BBRCheckProbeRTT(): if (BBR.state != ProbeRTT and BBR.probe_rtt_expired and not BBR.idle_restart) BBREnterProbeRTT() BBRSaveCwnd() BBR.probe_rtt_done_stamp = 0 BBR.ack_phase = ACKS_PROBE_STOPPING BBRStartRound() if (BBR.state == ProbeRTT) BBRHandleProbeRTT() if (rs.delivered > 0) BBR.idle_restart = false BBREnterProbeRTT(): BBR.state = ProbeRTT BBR.pacing_gain = 1 BBR.cwnd_gain = BBRProbeRTTCwndGain /* 0.5 */ BBRHandleProbeRTT(): /* Ignore low rate samples during ProbeRTT: */ MarkConnectionAppLimited() if (BBR.probe_rtt_done_stamp == 0 and packets_in_flight <= BBRProbeRTTCwnd()) /* Wait for at least ProbeRTTDuration to elapse: */ BBR.probe_rtt_done_stamp = Now() + ProbeRTTDuration /* Wait for at least one round to elapse: */ BBR.probe_rtt_round_done = false BBRStartRound() else if (BBR.probe_rtt_done_stamp != 0) if (BBR.round_start) BBR.probe_rtt_round_done = true if (BBR.probe_rtt_round_done) BBRCheckProbeRTTDone() BBRCheckProbeRTTDone(): if (BBR.probe_rtt_done_stamp != 0 and Now() > BBR.probe_rtt_done_stamp) /* schedule next ProbeRTT: */ BBR.probe_rtt_min_stamp = Now() BBRRestoreCwnd() BBRExitProbeRTT() MarkConnectionAppLimited(): C.app_limited = (C.delivered + packets_in_flight) ? : 1
4.3.4.5. Exiting ProbeRTT
When exiting ProbeRTT, BBR transitions to ProbeBW if it estimates the pipe was filled already, or Startup otherwise.
When transitioning out of ProbeRTT, BBR calls BBRResetLowerBounds() to reset the lower bounds, since any congestion encountered in ProbeRTT may have pulled the short-term model far below the capacity of the path.
But the algorithm is cautious in timing the next bandwidth probe: raising inflight after ProbeRTT may cause loss, so the algorithm resets the bandwidth-probing clock by starting the cycle at ProbeBW_DOWN(). But then as an optimization, since the connection is exiting ProbeRTT, we know that infligh is already below the estimated BDP, so the connection can proceed immediately to ProbeBW_CRUISE.
To summarize, the logic for exiting ProbeRTT is as follows:
BBRExitProbeRTT(): BBRResetLowerBounds() if (BBR.filled_pipe) BBRStartProbeBW_DOWN() BBRStartProbeBW_CRUISE() else BBREnterStartup()
4.4. Restarting From Idle
4.4.1. Setting Pacing Rate in ProbeBW
When restarting from idle in ProbeBW states, BBR leaves its cwnd as-is and paces packets at exactly BBR.bw, aiming to return as quickly as possible to its target operating point of rate balance and a full pipe. Specifically, if the flow's BBR.state is ProbeBW, and the flow is application-limited, and there are no packets in flight currently, then at the moment the flow sends one or more packets BBR sets BBR.pacing_rate to exactly BBR.bw. More precisely, the BBR algorithm takes the following steps in BBRHandleRestartFromIdle() before sending a packet for a flow.
The "Restarting Idle Connections" section of [RFC5681] suggests restarting from idle by slow-starting from the initial window. However, this approach was assuming a congestion control algorithm that had no estimate of the bottleneck bandwidth and no pacing, and thus resorted to relying on slow-starting driven by an ACK clock. The long (log_2(BDP)*RTT) delays required to reach full utilization with that "slow start after idle" approach caused many large deployments to disable this mechanism, resulting in a "BDP-scale line-rate burst" approach instead. Instead of these two approaches, BBR restarts by pacing at BBR.bw, typically achieving approximate rate balance and a full pipe after only one BBR.min_rtt has elapsed.
4.4.2. Checking for ProberRTT Completion
As an optimization, when restarting from idle BBR checks to see if the connection is in ProbeRTT and has met the exit conditions for ProbeRTT. If a connection goes idle during ProbeRTT then often it will have met those exit conditions by the time it restarts, so that the connection can restore the cwnd to its full value before it starts transmitting a new flight of data.
4.4.3. Logic
The BBR algorithm takes the following steps in BBRHandleRestartFromIdle() before sending a packet for a flow:
BBRHandleRestartFromIdle(): if (packets_in_flight == 0 and C.app_limited) BBR.idle_restart = true BBR.extra_acked_interval_start = Now() if (IsInAProbeBWState()) BBRSetPacingRateWithGain(1) else if (BBR.state == ProbeRTT) BBRCheckProbeRTTDone()
4.5. Updating Network Path Model Parameters
BBR is a model-based congestion control algorithm: it is based on an explicit model of the network path over which a transport flow travels. The following is a summary of each parameter, including its meaning and how the algorithm calculates and uses its value. We can group the parameter into three groups:
- core state machine parameters
- parameters to model the data rate
- parameters to model the volume of in-flight data
4.5.1. BBR.round_count: Tracking Packet-Timed Round Trips
Several aspects of the BBR algorithm depend on counting the progress of "packet-timed" round trips, which start at the transmission of some segment, and then end at the acknowledgement of that segment. BBR.round_count is a count of the number of these "packet-timed" round trips elapsed so far. BBR uses this virtual BBR.round_count because it is more robust than using wall clock time. In particular, arbitrary intervals of wall clock time can elapse due to application idleness, variations in RTTs, or timer delays for retransmission timeouts, causing wall-clock-timed model parameter estimates to "time out" or to be "forgotten" too quickly to provide robustness.
BBR counts packet-timed round trips by recording state about a sentinel packet, and waiting for an ACK of any data packet that was sent after that sentinel packet, using the following pseudocode:
Upon connection initialization:
BBRInitRoundCounting(): BBR.next_round_delivered = 0 BBR.round_start = false BBR.round_count = 0
Upon sending each packet, the rate estimation algorithm [draft-cheng-iccrg-delivery-rate-estimation] records the amount of data thus far acknowledged as delivered:
packet.delivered = C.delivered
Upon receiving an ACK for a given data packet, the rate estimation algorithm [draft-cheng-iccrg-delivery-rate-estimation] updates the amount of data thus far acknowledged as delivered:
C.delivered += packet.size
Upon receiving an ACK for a given data packet, the BBR algorithm first executes the following logic to see if a round trip has elapsed, and if so, increment the count of such round trips elapsed:
4.5.2. BBR.max_bw: Estimated Maximum Bandwidth
BBR.max_bw is BBR's estimate of the maximum bottleneck bandwidth available to data transmissions for the transport flow. At any time, a transport connection's data transmissions experience some slowest link or bottleneck. The bottleneck's delivery rate determines the connection's maximum data-delivery rate. BBR tries to closely match its sending rate to this bottleneck delivery rate to help seek "rate balance", where the flow's packet arrival rate at the bottleneck equals the departure rate. The bottleneck rate varies over the life of a connection, so BBR continually estimates BBR.max_bw using recent signals.¶
4.5.2.1. Delivery Rate Samples for Estimating BBR.max_bw
Since calculating delivery rate samples is subtle, and the samples are useful independent of congestion control, the approach BBR uses for measuring each single delivery rate sample is specified in a separate Internet Draft [draft-cheng-iccrg-delivery-rate-estimation].
4.5.2.2. BBR.max_bw Max Filter
Delivery rate samples are often below the typical bottleneck bandwidth available to the flow, due to "noise" introduced by random variation in physical transmission processes (e.g. radio link layer noise) or queues or along the network path. To filter these effects BBR uses a max filter: BBR estimates BBR.max_bw using the windowed maximum recent delivery rate sample seen by the connection over recent history.
The BBR.max_bw max filter window covers a time period extending over the past two ProbeBW cycles. The BBR.max_bw max filter window length is driven by trade-offs among several considerations:
- It is long enough to cover at least one entire ProbeBW cycle (see the "ProbeBW" section). This ensures that the window contains at least some delivery rate samples that are the result of data transmitted with a super-unity pacing_gain (a pacing_gain larger than 1.0). Such super-unity delivery rate samples are instrumental in revealing the path's underlying available bandwidth even when there is noise from delivery rate shortfalls due to aggregation delays, queuing delays from variable cross-traffic, lossy link layers with uncorrected losses, or short-term buffer exhaustion (e.g., brief coincident bursts in a shallow buffer).
- It aims to be long enough to cover short-term fluctuations in the network's delivery rate due to the aforementioned sources of noise. In particular, the delivery rate for radio link layers (e.g., wifi and cellular technologies) can be highly variable, and the filter window needs to be long enough to remember "good" delivery rate samples in order to be robust to such variations.
- It aims to be short enough to respond in a timely manner to sustained reductions in the bandwidth available to a flow, whether this is because other flows are using a larger share of the bottleneck, or the bottleneck link service rate has reduced due to layer 1 or layer 2 changes, policy changes, or routing changes. In any of these cases, existing BBR flows traversing the bottleneck should, in a timely manner, reduce their BBR.max_bw estimates and thus pacing rate and in-flight data, in order to match the sending behavior to the new available bandwidth.
4.5.2.3. BBR.max_bw and Application-limited Delivery Rate Samples
Transmissions can be application-limited, meaning the transmission rate is limited by the application rather than the congestion control algorithm. This is quite common because of request/response traffic. When there is a transmission opportunity but no data to send, the delivery rate sampler marks the corresponding bandwidth sample(s) as application-limited [draft-cheng-iccrg-delivery-rate-estimation]. The BBR.max_bw estimator carefully decides which samples to include in the bandwidth model to ensure that BBR.max_bw reflects network limits, not application limits. By default, the estimator discards application-limited samples, since by definition they reflect application limits. However, the estimator does use application-limited samples if the measured delivery rate happens to be larger than the current BBR.max_bw estimate, since this indicates the current BBR.Max_bw estimate is too low.
4.5.2.4. Updating the BBR.max_bw Max Filter
For every ACK that acknowledges some data packets as delivered, BBR invokes BBRUpdateMaxBw() to update the BBR.max_bw estimator as follows (here rs.delivery_rate is the delivery rate sample obtained from the ACK that is being processed, as specified in [draft-cheng-iccrg-delivery-rate-estimation]):
BBRUpdateMaxBw() BBRUpdateRound() if (rs.delivery_rate >= BBR.max_bw || !rs.is_app_limited) BBR.max_bw = update_windowed_max_filter( filter=BBR.MaxBwFilter, value=rs.delivery_rate, time=BBR.cycle_count, window_length=MaxBwFilterLen)
4.5.2.5. Tracking Time for the BBR.max_bw Max Filter
BBR tracks time for the BBR.max_bw filter window using a virtual (non-wall-clock) time tracked by counting the cyclical progression through ProbeBW cycles. Each time through the Probe bw cycle, one round trip after exiting ProbeBW_UP (the point at which the flow has its best chance to measure the highest throughput of the cycle), BBR increments BBR.cycle_count, the virtual time used by the BBR.max_bw filter window. Note that BBR.cycle_count only needs to be tracked with a single bit, since the BBR.max_bw filter only needs to track samples from two time slots: the previous ProbeBW cycle and the current ProbeBW cycle:¶
BBRAdvanceMaxBwFilter():
BBR.cycle_count++
4.5.3. BBR.min_rtt: Estimated Minimum Round-Trip Time
BBR.min_rtt is BBR's estimate of the round-trip propagation delay of the path over which a transport connection is sending. The path's round-trip propagation delay determines the minimum amount of time over which the connection must be willing to sustain transmissions at the BBR.bw rate, and thus the minimum amount of data needed in-flight, for the connection to reach full utilization (a "Full Pipe"). The round-trip propagation delay can vary over the life of a connection, so BBR continually estimates BBR.min_rtt using recent round-trip delay samples.
4.5.3.1. Round-Trip Time Samples for Estimating BBR.min_rtt
For every data packet a connection sends, BBR calculates an RTT sample that measures the time interval from sending a data packet until that packet is acknowledged.
For the most part, the same considerations and mechanisms that apply to RTT estimation for the purposes of retransmission timeout calculations [RFC6298] apply to BBR RTT samples. Namely, BBR does not use RTT samples based on the transmission time of retransmitted packets, since these are ambiguous, and thus unreliable. Also, BBR calculates RTT samples using both cumulative and selective acknowledgments (if the transport supports [RFC2018] SACK options or an equivalent mechanism), or transport-layer timestamps (if the transport supports [RFC7323] TCP timestamps or an equivalent mechanism).
The only divergence from RTT estimation for retransmission timeouts is in the case where a given acknowledgment ACKs more than one data packet. In order to be conservative and schedule long timeouts to avoid spurious retransmissions, the maximum among such potential RTT samples is typically used for computing retransmission timeouts; i.e., SRTT is typically calculated using the data packet with the earliest transmission time. By contrast, in order for BBR to try to reach the minimum amount of data in flight to fill the pipe, BBR uses the minimum among such potential RTT samples; i.e., BBR calculates the RTT using the data packet with the latest transmission time.
4.5.3.2. BBR.min_rtt Min Filter
RTT samples tend to be above the round-trip propagation delay of the path, due to "noise" introduced by random variation in physical transmission processes (e.g. radio link layer noise), queues along the network path, the receiver's delayed ack strategy, ack aggregation, etc. Thus to filter out these effects BBR uses a min filter: BBR estimates BBR.min_rtt using the minimum recent RTT sample seen by the connection over that past MinRTTFilterLen seconds. (Many of the same network effects that can decrease delivery rate measurements can increase RTT samples, which is why BBR's min-filtering approach for RTTs is the complement of its max-filtering approach for delivery rates.)
The length of the BBR.min_rtt min filter window is MinRTTFilterLen = 10 secs. This is driven by trade-offs among several considerations:
- The MinRTTFilterLen is longer than ProbeRTTInterval, so that it covers an entire ProbeRTT cycle (see the "ProbeRTT" section below). This helps ensure that the window can contain RTT samples that are the result of data transmitted with inflight below the estimated BDP of the flow. Such RTT samples are important for helping to reveal the path's underlying two-way propagation delay even when the aforementioned "noise" effects can often obscure it.
- The MinRTTFilterLen aims to be long enough to avoid needing to cut in-flight and throughput often. Measuring two-way propagation delay requires in-flight to be at or below BDP, which risks some amount of underutilization, so BBR uses a filter window long enough that such underutilization events can be rare.
- The MinRTTFilterLen aims to be long enough that many applications have a "natural" moment of silence or low utilization that can cut in-flight below BDP and naturally serve to refresh the BBR.min_rtt, without requiring BBR to force an artificial cut in in-flight. This applies to many popular applications, including Web, RPC, chunked audio or video traffic.
- The MinRTTFilterLen aims to be short enough to respond in a timely manner to real increases in the two-way propagation delay of the path, e.g. due to route changes, which are expected to typically happen on longer time scales.
A BBR implementation MAY use a generic windowed min filter to track BBR.min_rtt. However, a significant savings in space and improvement in freshness can be achieved by integrating the BBR.min_rtt estimation into the ProbeRTT state machine, so this document discusses that approach in the ProbeRTT section.
4.5.4. BBR.offload_budget
BBR.offload_budget is the estimate of the minimum volume of data necessary to achieve full throughput using sender (TSO/GSO) and receiver (LRO, GRO) host offload mechanisms, computed as follows:
BBRUpdateOffloadBudget(): BBR.offload_budget = 3 * BBR.send_quantum
4.5.5. BBR.extra_acked
BBR.extra_acked is a volume of data that is the estimate of the recent degree of aggregation in the network path. For each ACK, the algorithm computes a sample of the estimated extra ACKed data beyond the amount of data that the sender expected to be ACKed over the timescale of a round-trip, given the BBR.bw. Then it computes BBR.extra_acked as the windowed maximum sample over the last BBRExtraAckedFilterLen=10 packet-timed round-trips. If the ACK rate falls below the expected bandwidth, then the algorithm estimates an aggregation episode has terminated, and resets the sampling interval to start from the current time.
The BBR.extra_acked thus reflects the recently-measured magnitude of data and ACK aggregation effects such as batching and slotting at shared-medium L2 hops (wifi, cellular, DOCSIS), as well as end-host offload mechanisms (TSO, GSO, LRO, GRO), and end host or middlebox ACK decimation/thinning.
BBR augments its cwnd by BBR.extra_acked to allow the connection to keep sending during inter-ACK silences, to an extent that matches the recently measured degree of aggregation.
More precisely, this is computed as:
BBRUpdateACKAggregation(): /* Find excess ACKed beyond expected amount over this interval */ interval = (Now() - BBR.extra_acked_interval_start) expected_delivered = BBR.bw * interval /* Reset interval if ACK rate is below expected rate: */ if (BBR.extra_acked_delivered <= expected_delivered) BBR.extra_acked_delivered = 0 BBR.extra_acked_interval_start = Now() expected_delivered = 0 BBR.extra_acked_delivered += rs.newly_acked extra = BBR.extra_acked_delivered - expected_delivered extra = min(extra, cwnd) BBR.extra_acked = update_windowed_max_filter( filter=BBR.ExtraACKedFilter, value=extra, time=BBR.round_count, window_length=BBRExtraAckedFilterLen)
4.5.6. Updating the Model Upon Packet Loss
In every state, BBR responds to (filtered) congestion signals, including loss. The response to those congestion signals depends on the flow's current state, since the information that the flow can infer depends on what the flow was doing when the flow experienced the signal.
4.5.6.1. Probing for Bandwidth In Startup
In Startup, if the congestion signals meet the Startup exit criteria, the flow exits Startup and enters Drain.
4.5.6.2. Probing for Bandwidth In ProbeBW
BBR searches for the maximum volume of data that can be sensibly placed in-flight in the network. A key precondition is that the flow is actually trying robustly to find that operating point. To implement this, when a flow is in ProbeBW, and an ACK covers data sent in one of the accelerating phases (REFILL or UP), and the ACK indicates that the loss rate over the past round trip exceeds the queue pressure objective, and the flow is not application limited, and has not yet responded to congestion signals from the most recent REFILL or UP phase, then the flow estimates that the volume of data it allowed in flight exceeded what matches the current delivery process on the path, and reduces BBR.inflight_hi:
/* Do loss signals suggest inflight is too high? * If so, react. */ CheckInflightTooHigh(): if (IsInflightTooHigh(rs)) if (BBR.bw_probe_samples) BBRHandleInflightTooHigh() return true /* inflight too high */ else return false /* inflight not too high */ IsInflightTooHigh(): return (rs.lost > rs.tx_in_flight * BBRLossThresh) BBRHandleInflightTooHigh(): BBR.bw_probe_samples = 0; /* only react once per bw probe */ if (!rs.is_app_limited) BBR.inflight_hi = max(rs.tx_in_flight, BBRTargetInflight() * BBRBeta)) If (BBR.state == ProbeBW_UP) BBRStartProbeBW_DOWN()
Here rs.tx_in_flight is the amount of data that was estimated to be in flight when the most recently ACKed packet was sent. And the BBRBeta (0.7x) bound is to try to ensure that BBR does not react more dramatically than CUBIC's 0.7x multiplicative decrease factor.
Some loss detection algorithms, including algorithms like RACK [RFC8985] that delay loss marking while waiting for potential reordering to resolve, may mark packets as lost long after the loss itself happened. In such cases, the tx_in_flight for the delivered sequence range that allowed the loss to be detected may be considerably smaller than the tx_in_flight of the lost packet itself. In such cases using the former tx_in_flight rather than the latter can cause BBR.inflight_hi to be significantly underestimated. To avoid such issues, BBR processes each loss detection event to more precisely estimate the volume of in-flight data at which loss rates cross BBRLossThresh, noting that this may have happened mid-way through some packet. To estimate this value, we can solve for "lost_prefix" in the following equation, where inflight_prev represents the volume of in-flight data preceding this packet, lost_prev represents the data lost among that previous in-flight data:
lost / inflight >= BBRLossThresh (lost_prev + lost_prefix) / (inflight_prev + lost_prefix) >= BBRLossThresh /* solving for lost_prefix we arrive at: */ lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / (1 - BBRLossThresh)
BBRHandleLostPacket(packet): if (!BBR.bw_probe_samples) return /* not a packet sent while probing bandwidth */ rs.tx_in_flight = packet.tx_in_flight /* inflight at transmit */ rs.lost = C.lost - packet.lost /* data lost since transmit */ rs.is_app_limited = packet.is_app_limited; if (IsInflightTooHigh(rs)) rs.tx_in_flight = BBRInflightHiFromLostPacket(rs, packet) BBRHandleInflightTooHigh(rs) /* At what prefix of packet did losses exceed BBRLossThresh? */ BBRInflightHiFromLostPacket(rs, packet): size = packet.size /* What was in flight before this packet? */ inflight_prev = rs.tx_in_flight - size /* What was lost before this packet? */ lost_prev = rs.lost - size lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / (1 - BBRLossThresh) /* At what inflight value did losses cross BBRLossThresh? */ inflight = inflight_prev + lost_prefix return inflight
When not explicitly accelerating to probe for bandwidth (Drain, ProbeRTT, ProbeBW_DOWN, ProbeBW_CRUISE), BBR responds to loss by slowing down to some extent. This is because loss suggests that the available bandwidth and safe volume of in-flight data may have decreased recently, and the flow needs to adapt, slowing down toward the latest delivery process. BBR flows implement this response by reducing the short-term model parameters, BBR.bw_lo and BBR.inflight_lo
When encountering packet loss when the flow is not probing for bandwidth, the strategy is to gradually adapt to the current measured delivery process (the rate and volume of data that is delivered through the network path over the last round trip). This applies generally: whether in fast recovery, RTO recovery, TLP recovery; whether application-limited or not.
There are two key parameters the algorithm tracks, to measure the current delivery process:
BBR.bw_latest: a 1-round-trip max of delivered bandwidth (rs.delivery_rate).
BBR.inflight_latest: a 1-round-trip max of delivered volume of data (rs.delivered).
Upon the ACK at the end of each round that encountered a newly-marked loss, the flow updates its model (bw_lo and inflight_lo) as follows:
bw_lo = max( bw_latest, BBRBeta * bw_lo )
inflight_lo = max( inflight_latest, BBRBeta * inflight_lo )
This logic can be represented as follows:
/* Near start of ACK processing: */ BBRUpdateLatestDeliverySignals(): BBR.loss_round_start = 0 BBR.bw_latest = max(BBR.bw_latest, rs.delivery_rate) BBR.inflight_latest = max(BBR.inflight_latest, rs.delivered) if (rs.prior_delivered >= BBR.loss_round_delivered) BBR.loss_round_delivered = C.delivered BBR.loss_round_start = 1 /* Near end of ACK processing: */ BBRAdvanceLatestDeliverySignals(): if (BBR.loss_round_start) BBR.bw_latest = rs.delivery_rate BBR.inflight_latest = rs.delivered BBRResetCongestionSignals(): BBR.loss_in_round = 0 BBR.bw_latest = 0 BBR.inflight_latest = 0 /* Update congestion state on every ACK */ BBRUpdateCongestionSignals(): BBRUpdateMaxBw() if (rs.losses > 0) BBR.loss_in_round = 1 if (!BBR.loss_round_start) return /* wait until end of round trip */ BBRAdaptLowerBoundsFromCongestion() BBR.loss_in_round = 0 /* Once per round-trip respond to congestion */ BBRAdaptLowerBoundsFromCongestion(): if (BBRIsProbingBW()) return if (BBR.loss_in_round()) BBRInitLowerBounds() BBRLossLowerBounds() /* Handle the first congestion episode in this cycle */ BBRInitLowerBounds(): if (BBR.bw_lo == Infinity) BBR.bw_lo = BBR.max_bw if (BBR.inflight_lo == Infinity) BBR.inflight_lo = cwnd /* Adjust model once per round based on loss */ BBRLossLowerBounds() BBR.bw_lo = max(BBR.bw_latest, BBRBeta * BBR.bw_lo) BBR.inflight_lo = max(BBR.inflight_latest, BBRBeta * BBR.infligh_lo) BBRResetLowerBounds(): BBR.bw_lo = Infinity BBR.inflight_lo = Infinity BBRBoundBWForModel(): BBR.bw = min(BBR.max_bw, BBR.bw_lo, BBR.bw_hi)
4.6. Updating Control Parameters
BBR uses three distinct but interrelated control parameters: pacing rate, send quantum, and congestion window (cwnd).
4.6.1. Summary of Control Behavior in the State Machine
The following table summarizes how BBR modulates the control parameters in each state. In the table below, the semantics of the columns are as follows:
- State: the state in the BBR state machine, as depicted in the "State Transition Diagram" section above.
- Tactic: The tactic chosen from the "State Machine Tactics" subsection above: "accel" refers to acceleration, "decel" to deceleration, and "cruise" to cruising.
- Pacing Gain: the value used for BBR.pacing_gain in the given state.
- Cwnd Gain: the value used for BBR.cwnd_gain in the given state.
- Rate Cap: the rate values applied as bounds on the BBR.max_bw value applied to compute BBR.bw.
- Volume Cap: the volume values applied as bounds on the BBR.max_inflight value to compute cwnd.
The control behavior can be summarized as follows. Upon processing each ACK, BBR uses the values in the table below to compute BBR.bw in BBRBoundBWForModel(), and the cwnd in BBRBoundCwndForModel():
+-----------------+--------+--------+------+--------+------------------+ | State | Tactic | Pacing | Cwnd | Rate | Volume | | | | Gain | Gain | Cap | Cap | +-----------------+--------+--------+------+--------+------------------+ | Startup | accel | 2.77 | 2 | | | | | | | | | | +-----------------+--------+--------+------+--------+------------------+ | Drain | decel | 0.5 | 2 | bw_hi, | inflight_hi, | | | | | | bw_lo | inflight_lo | +-----------------+--------+--------+------+--------+------------------+ | ProbeBW_DOWN | decel | 0.9 | 2 | bw_hi, | inflight_hi, | | | | | | bw_lo | inflight_lo | +-----------------+--------+--------+------+--------+------------------+ | ProbeBW_CRUISE | cruise | 1.0 | 2 | bw_hi, | 0.85*inflight_hi | | | | | | bw_lo | inflight_lo | +-----------------+--------+--------+------+--------+------------------+ | ProbeBW_REFILL | accel | 1.0 | 2 | bw_hi | inflight_hi | | | | | | | | +-----------------+--------+--------+------+--------+------------------+ | ProbeBW_UP | accel | 1.25 | 2 | bw_hi | inflight_hi | | | | | | | | +-----------------+--------+--------+------+--------+------------------+ | ProbeRTT | decel | 1.0 | 0.5 | bw_hi, | 0.85*inflight_hi | | | | | | bw_lo | inflight_lo | +-----------------+--------+--------+------+--------+------------------+
4.6.2. Pacing Rate: BBR.pacing_rate
To help match the packet-arrival rate to the bottleneck bandwidth available to the flow, BBR paces data packets. Pacing enforces a maximum rate at which BBR schedules quanta of packets for transmission.
The sending host implements pacing by maintaining inter-quantum spacing at the time each packet is scheduled for departure, calculating the next departure time for a packet for a given flow (BBR.next_departure_time) as a function of the most recent packet size and the current pacing rate, as follows:
BBR.next_departure_time = max(Now(), BBR.next_departure_time) packet.departure_time = BBR.next_departure_time pacing_delay = packet.size / BBR.pacing_rate BBR.next_departure_time = BBR.next_departure_time + pacing_delay
To adapt to the bottleneck, in general BBR sets the pacing rate to be proportional to bw, with a dynamic gain, or scaling factor of proportionality, called pacing_gain.
When a BBR flow starts it has no bw estimate (bw is 0). So in this case it sets an initial pacing rate based on the transport sender implementation's initial congestion window ("InitialCwnd", e.g. from [RFC6928]), the initial SRTT (smoothed round-trip time) after the first non-zero RTT sample, and the initial pacing_gain:
BBRInitPacingRate(): nominal_bandwidth = InitialCwnd / (SRTT ? SRTT : 1ms) BBR.pacing_rate = BBRStartupPacingGain * nominal_bandwidth
After initialization, on each data ACK BBR updates its pacing rate to be proportional to bw, as long as it estimates that it has filled the pipe (BBR.filled_pipe is true; see the "Startup" section for details), or doing so increases the pacing rate. Limiting the pacing rate updates in this way helps the connection probe robustly for bandwidth until it estimates it has reached its full available bandwidth ("filled the pipe"). In particular, this prevents the pacing rate from being reduced when the connection has only seen application-limited bandwidth samples. BBR updates the pacing rate on each ACK by executing the BBRSetPacingRate() step as follows:
BBRSetPacingRateWithGain(pacing_gain): rate = pacing_gain * bw * (100 - BBRPacingMarginPercent) / 100 if (BBR.filled_pipe || rate > BBR.pacing_rate) BBR.pacing_rate = rate BBRSetPacingRate(): BBRSetPacingRateWithGain(BBR.pacing_gain)
To help drive the network toward lower queues and low latency while maintaining high utilization, the BBRPacingMarginPercent constant of 1 aims to cause BBR to pace at 1% below the bw, on average.
4.6.3. Send Quantum: BBR.send_quantum
In order to amortize per-packet overheads involved in the sending process (host CPU, NIC processing, and interrupt processing delays), high-performance transport sender implementations (e.g., Linux TCP) often schedule an aggregate containing multiple packets (multiple SMSS) worth of data as a single quantum (using TSO, GSO, or other offload mechanisms). The BBR congestion control algorithm makes this control decision explicitly, dynamically calculating a quantum control parameter that specifies the maximum size of these transmission aggregates. This decision is based on a trade-off:
- A smaller quantum is preferred at lower data rates because it results in shorter packet bursts, shorter queues, lower queueing delays, and lower rates of packet loss.
- A bigger quantum can be required at higher data rates because it results in lower CPU overheads at the sending and receiving hosts, who can ship larger amounts of data with a single trip through the networking stack.
On each ACK, BBR runs BBRSetSendQuantum() to update BBR.send_quantum as follows:
BBRSetSendQuantum(): if (BBR.pacing_rate < 1.2 Mbps) floor = 1 * SMSS else floor = 2 * SMSS BBR.send_quantum = min(BBR.pacing_rate * 1ms, 64KBytes) BBR.send_quantum = max(BBR.send_quantum, floor)
A BBR implementation MAY use alternate approaches to select a BBR.send_quantum, as appropriate for the CPU overheads anticipated for senders and receivers, and buffering considerations anticipated in the network path. However, for the sake of the network and other users, a BBR implementation SHOULD attempt to use the smallest feasible quanta.
4.6.4. Congestion Window
The congestion window, or cwnd, controls the maximum volume of data BBR allows in flight in the network at any time. It is the maximum volume of in-flight data that the algorithm estimates is appropriate for matching the current network path delivery process, given all available signals in the model, at any time scale. BBR adapts the cwnd based on its model of the network path and the state machine's decisions about how to probe that path.
By default, BBR grows its cwnd to meet its BBR.max_inflight, which models what's required for achieving full throughput, and as such is scaled to adapt to the estimated BDP computed from its path model. But BBR's selection of cwnd is designed to explicitly trade off among competing considerations that dynamically adapt to various conditions. So in loss recovery BBR more conservatively adjusts its sending behavior based on more recent delivery samples, and if BBR needs to re-probe the current BBR.min_rtt of the path then it cuts its cwnd accordingly. The following sections describe the various considerations that impact cwnd.
4.6.4.1. Initial cwnd
BBR generally uses measurements to build a model of the network path and then adapts control decisions to the path based on that model. As such, the selection of the initial cwnd is considered to be outside the scope of the BBR algorithm, since at initialization there are no measurements yet upon which BBR can operate. Thus, at initialization, BBR uses the transport sender implementation's initial congestion window (e.g. from [RFC6298] for TCP).
4.6.4.2. Computing BBR.max_inflight
The BBR BBR.max_inflight is the upper bound on the volume of data BBR allows in flight. This bound is always in place, and dominates when all other considerations have been satisfied: the flow is not in loss recovery, does not need to probe BBR.min_rtt, and has accumulated confidence in its model parameters by receiving enough ACKs to gradually grow the current cwnd to meet the BBR.max_inflight.
On each ACK, BBR calculates the BBR.max_inflight in BBRUpdateMaxInflight() as follows:
BBRBDPMultiple(gain): if (BBR.min_rtt == Inf) return InitialCwnd /* no valid RTT samples yet */ BBR.bdp = BBR.bw * BBR.min_rtt return gain * BBR.bdp BBRQuantizationBudget(inflight) BBRUpdateOffloadBudget() inflight = max(inflight, BBR.offload_budget) inflight = max(inflight, BBRMinPipeCwnd) if (BBR.state == ProbeBW && BBR.cycle_idx == ProbeBW_UP) inflight += 2 return inflight BBRInflight(gain): inflight = BBRBDPMultiple(gain) return BBRQuantizationBudget(inflight) BBRUpdateMaxInflight(): BBRUpdateAggregationBudget() inflight = BBRBDPMultiple(BBR.cwnd_gain) inflight += BBR.extra_acked BBR.max_inflight = BBRQuantizationBudget(inflight)
The "estimated_bdp" term tries to allow enough packets in flight to fully utilize the estimated BDP of the path, by allowing the flow to send at BBR.bw for a duration of BBR.min_rtt. Scaling up the BDP by BBR.cwnd_gain bounds in-flight data to a small multiple of the BDP, to handle common network and receiver behavior, such as delayed, stretched, or aggregated ACKs [A15]. The "quanta" term allows enough quanta in flight on the sending and receiving hosts to reach high throughput even in environments using offload mechanisms.
4.6.4.3. Minimum cwnd for Pipelining
For BBR.max_inflight, BBR imposes a floor of BBRMinPipeCwnd (4 packets, i.e. 4 * SMSS). This floor helps ensure that even at very low BDPs, and with a transport like TCP where a receiver may ACK only every alternate SMSS of data, there are enough packets in flight to maintain full pipelining. In particular BBR tries to allow at least 2 data packets in flight and ACKs for at least 2 data packets on the path from receiver to sender.
4.6.4.4. Modulating cwnd in Loss Recovery
BBR interprets loss as a hint that there may be recent changes in path behavior that are not yet fully reflected in its model of the path, and thus it needs to be more conservative.
Upon a retransmission timeout (RTO), BBR conservatively reduces cwnd to a value that will allow 1 SMSS to be transmitted. Then BBR gradually increases cwnd using the normal approach outlined below in "Core cwnd Adjustment Mechanism".
When a BBR sender detects packet loss but there are still packets in flight, on the first round of the loss-repair process BBR temporarily reduces the cwnd to match the current delivery rate as ACKs arrive. On second and later rounds of loss repair, it ensures the sending rate never exceeds twice the current delivery rate as ACKs arrive.
When BBR exits loss recovery it restores the cwnd to the "last known good" value that cwnd held before entering recovery. This applies equally whether the flow exits loss recovery because it finishes repairing all losses or because it executes an "undo" event after inferring that a loss recovery event was spurious.
There are several ways to implement this high-level design for updating cwnd in loss recovery. One is as follows:
Upon retransmission timeout (RTO):
BBROnEnterRTO(): BBR.prior_cwnd = BBRSaveCwnd() cwnd = packets_in_flight + 1
Upon entering Fast Recovery, set cwnd to the number of packets still in flight (allowing at least one for a fast retransmit):
BBROnEnterFastRecovery(): BBR.prior_cwnd = BBRSaveCwnd() cwnd = packets_in_flight + max(rs.newly_acked, 1) BBR.packet_conservation = true
Upon every ACK in Fast Recovery, run the following BBRModulateCwndForRecovery() steps, which help ensure packet conservation on the first round of recovery, and sending at no more than twice the current delivery rate on later rounds of recovery (given that "rs.newly_acked" packets were newly marked ACKed or SACKed and "rs.newly_lost" were newly marked lost):
BBRModulateCwndForRecovery(): if (rs.newly_lost > 0) cwnd = max(cwnd - rs.newly_lost, 1) if (BBR.packet_conservation) cwnd = max(cwnd, packets_in_flight + rs.newly_acked)
After one round-trip in Fast Recovery:
BBR.packet_conservation = false
Upon exiting loss recovery (RTO recovery or Fast Recovery), either by repairing all losses or undoing recovery, BBR restores the best-known cwnd value we had upon entering loss recovery:
BBR.packet_conservation = false BBRRestoreCwnd()
Note that exiting loss recovery happens during ACK processing, and at the end of ACK processing BBRBoundCwndForModel() will bound the cwnd based on the current model parameters. Thus the cwnd and pacing rate after loss recovery will generally be smaller than the values entering loss recovery.
The BBRSaveCwnd() and BBRRestoreCwnd() helpers help remember and restore the last-known good cwnd (the latest cwnd unmodulated by loss recovery or ProbeRTT), and is defined as follows:
BBRSaveCwnd(): if (!InLossRecovery() and BBR.state != ProbeRTT) return cwnd else return max(BBR.prior_cwnd, cwnd) BBRRestoreCwnd(): cwnd = max(cwnd, BBR.prior_cwnd)
If BBR decides it needs to enter the ProbeRTT state (see the "ProbeRTT" section below), its goal is to quickly reduce the volume of in-flight data and drain the bottleneck queue, thereby allowing measurement of BBR.min_rtt. To implement this mode, BBR bounds the cwnd to BBRMinPipeCwnd, the minimal value that allows pipelining (see the "Minimum cwnd for Pipelining" section, above):
BBRProbeRTTCwnd(): probe_rtt_cwnd = BBRBDPMultiple(BBR.bw, BBRProbeRTTCwndGain) probe_rtt_cwnd = max(probe_rtt_cwnd, BBRMinPipeCwnd) return probe_rtt_cwnd BBRBoundCwndForProbeRTT(): if (BBR.state == ProbeRTT) cwnd = min(cwnd, BBRProbeRTTCwnd())
4.6.4.6. Core cwnd Adjustment Mechanism
The network path and traffic traveling over it can make sudden dramatic changes. To adapt to these changes smoothly and robustly, and reduce packet losses in such cases, BBR uses a conservative strategy. When cwnd is above the BBR.max_inflight derived from BBR's path model, BBR cuts the cwnd immediately to the BBR.max_inflight. When cwnd is below BBR.max_inflight, BBR raises the cwnd gradually and cautiously, increasing cwnd by no more than the amount of data acknowledged (cumulatively or selectively) upon each ACK.
Specifically, on each ACK that acknowledges "rs.newly_acked" packets as newly ACKed or SACKed, BBR runs the following BBRSetCwnd() steps to update cwnd:
BBRSetCwnd(): BBRUpdateMaxInflight() BBRModulateCwndForRecovery() if (!BBR.packet_conservation) { if (BBR.filled_pipe) cwnd = min(cwnd + rs.newly_acked, BBR.max_inflight) else if (cwnd < BBR.max_inflight || C.delivered < InitialCwnd) cwnd = cwnd + rs.newly_acked cwnd = max(cwnd, BBRMinPipeCwnd) } BBRBoundCwndForProbeRTT() BBRBoundCwndForModel()
There are several considerations embodied in the logic above. If BBR has measured enough samples to achieve confidence that it has filled the pipe (see the description of BBR.filled_pipe in the "Startup" section below), then it increases its cwnd based on the number of packets delivered, while bounding its cwnd to be no larger than the BBR.max_inflight adapted to the estimated BDP. Otherwise, if the cwnd is below the BBR.max_inflight, or the sender has marked so little data delivered (less than InitialCwnd) that it does not yet judge its BBR.max_bw estimate and BBR.max_inflight as useful, then it increases cwnd without bounding it to be below BBR.max_inflight. Finally, BBR imposes a floor of BBRMinPipeCwnd in order to allow pipelining even with small BDPs (see the "Minimum cwnd for Pipelining" section, above).
4.6.4.7. Bounding cwnd Based on Recent Congestion
Finally, BBR bounds the cwnd based on recent congestion, as outlined in the "Volume Cap" column of the table in the "Summary of Control Behavior in the State Machine" section:
BBRBoundCwndForModel(): cap = Infinity if (IsInAProbeBWState() and BBR.state != ProbeBW_CRUISE) cap = BBR.inflight_hi else if (BBR.state == ProbeRTT or BBR.state == ProbeBW_CRUISE) cap = BBRInflightWithHeadroom() /* apply inflight_lo (possibly infinite): */ cap = min(cap, BBR.inflight_lo) cap = max(cap, BBRMinPipeCwnd) cwnd = min(cwnd, cap)
参考文档:
Measuring bottleneck bandwidth and round-trip propagation time:https://queue.acm.org/detail.cfm?id=3022184
BBRV2:https://datatracker.ietf.org/doc/html/draft-cardwell-iccrg-bbr-congestion-control-02
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南