Linux kernel TCP smoothed-RTT estimation
https://strugglingcoder.info/index.php/linux-kernel-tcp-smoothed-rtt-estimation/
Posted: February 18th, 2018 | Author: hiren | Filed under: Linux, networking, tcp | Tags: linux, networking, rtt, srtt, tcp | Comments Off on Linux kernel TCP smoothed-RTT estimation
Recently I decided to look under the hood to see how exactly srtt is calculated in Linux. Actual (Exponentially Weighted Moving Average) srtt calculation is a rather straight-forward part but what goes in as input to that calculation under various scenarios is interesting and very important in getting correct rtt estimate.
Also useful to note the difference between Linux and FreeBSD in this regard. Linux doesn’t trust tcp packet Timestamps option provided value whenever possible as middle-boxes can meddle with it.
Basic algorithm is:
For non-retransmitted packets, use saved packet send timestamp and ack arrival time.
For retransmitted packets, use timestamp option and if that’s not enabled, rtt is not calculated for such packets.
Let’s look at the code. I am using net-next.
When a TCP sender sends packets, it has to wait for acks for those packets before throwing them away. It stores them in a queue called ‘retransmission queue’.
When sent packets get acked, tcp_clean_rtx_queue() gets called to clear those packets from the retransmission queue.
A few useful variables in that function are:
seq_rtt_us – uses first packet from ackd range
ca_rtt_us – uses last packet from ackd range (mainly used for congestion control)
sack_rtt_us – uses sacked ack
tcp_mstamp is a tcp_sock member which represents timestamp of most recent packet received/sent. It gets updated by tcp_mstamp_refresh().
For a clean ack (not sack), seq_rtt_us = ca_rtt_us (as there is no range)
If such a clean is also for a non-retransmitted packet,
[sourcecode language=”c”]seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);[/sourcecode]
and for a sack which is again for a non-retransmitted packet,
[sourcecode language=”c”]sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, sack->first_sackt);[/sourcecode]
Code that updates sack→first_sackt is in tcp_sacktag_one() where it gets populated when the sack is for a non-retransmitted packet.
tcp_stamp_us_delta() gets the difference with timestamp that the stack maintains.
Now tcp_ack_update_rtt() gets called which starts out with:
[sourcecode language=”c”]
/* Prefer RTT measured from ACK’s timing to TS-ECR. This is because
* broken middle-boxes or peers may corrupt TS-ECR fields. But
* Karn’s algorithm forbids taking RTT if some retransmitted data
* is acked (RFC6298).
*/
if (seq_rtt_us < 0)
seq_rtt_us = sack_rtt_us;
[/sourcecode]
For acks acking retransmitted packets, seq_rtt_us would be -ve.
But if there is a SACK timestamp from a non-retransmitted packet, it would use that as it carries valid and useful timestamps.
Then it takes TS-opt provided timestamps only if seq_rtt_us is -ve.
[sourcecode language=”c”]
if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
flag & FLAG_ACKED) {
u32 delta = tcp_time_stamp(tp) – tp->rx_opt.rcv_tsecr;
u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
seq_rtt_us = ca_rtt_us = delta_us;
}
[/sourcecode]
By this point, there is seq_rtt_us that can be fed into tcp_rtt_estimator() that’d generate smoothed-RTT (which is more or less based on SIGCOMM 88 paper by Van Jacobson).
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通