throughput latency
https://www.agner.org/optimize/instruction_tables.pdf
Integer multiply is at least 3c latency on all recent x86 CPUs (and higher on some older CPUs). On many CPUs it's fully pipelined, so throughput is 1 per clock, but you can only achieve that if you have three independent multiplies in flight. (FP multiply on Haswell is 5c latency, 0.5c throughput, so you need 10 in flight to saturate throughput). Division (div
and idiv
) is even worse: it's microcoded, and much higher latency than add
or shr
, and not even fully pipelined on any CPU. All of this is straight from Agner Fog's instruction tables, so it's a good thing you linked that.
Latency is the number of processor clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution.
Throughput is the number of processor clocks it takes for an instruction to execute or perform its calculations. An instruction with a throughput of 2 clocks would tie up its execution unit for that many cycles which prevents an instruction needing that execution unit from being executed. Only after the instruction is done with the execution unit can the next instruction enter.
Latency = time from the start of the instruction until the result is available. If your division has a latency of 26 cycles, and you calculate (((x / a) / b) / c), then the result of the division by a is available after 26 cycles. That's when the division by b can start, with the result available after 52 cycles, and the result of dividing by c is available after 78 cycles.
The throughput is one division every six cycles, which means you can start another division every six cycles. So if you want to calculate x/a, y/a, z/a, u/a, v/a, the five divisions can start at cycles 0, 6, 12, 18, 24, and the results are available at cycles 26, 32, 38, 44, and 50.
As an exercise, figure out how long it takes to evaluate (((x / a) / b) / c), (((y / a) / b) / c), (((z / a) / b) / c), (((u / a) / b) / c) and (((v / a) / b) / c)
I was reading an article on an alternative method of modulo reduction and i couldn't understand the following excerpt (Those in bold) :
"A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles."
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
2020-06-08 理解密码学中的双线性映射