throughput latency
https://www.agner.org/optimize/instruction_tables.pdf
Integer multiply is at least 3c latency on all recent x86 CPUs (and higher on some older CPUs). On many CPUs it's fully pipelined, so throughput is 1 per clock, but you can only achieve that if you have three independent multiplies in flight. (FP multiply on Haswell is 5c latency, 0.5c throughput, so you need 10 in flight to saturate throughput). Division (div
and idiv
) is even worse: it's microcoded, and much higher latency than add
or shr
, and not even fully pipelined on any CPU. All of this is straight from Agner Fog's instruction tables, so it's a good thing you linked that.
Latency is the number of processor clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution.
Throughput is the number of processor clocks it takes for an instruction to execute or perform its calculations. An instruction with a throughput of 2 clocks would tie up its execution unit for that many cycles which prevents an instruction needing that execution unit from being executed. Only after the instruction is done with the execution unit can the next instruction enter.
Latency = time from the start of the instruction until the result is available. If your division has a latency of 26 cycles, and you calculate (((x / a) / b) / c), then the result of the division by a is available after 26 cycles. That's when the division by b can start, with the result available after 52 cycles, and the result of dividing by c is available after 78 cycles.
The throughput is one division every six cycles, which means you can start another division every six cycles. So if you want to calculate x/a, y/a, z/a, u/a, v/a, the five divisions can start at cycles 0, 6, 12, 18, 24, and the results are available at cycles 26, 32, 38, 44, and 50.
As an exercise, figure out how long it takes to evaluate (((x / a) / b) / c), (((y / a) / b) / c), (((z / a) / b) / c), (((u / a) / b) / c) and (((v / a) / b) / c)
I was reading an article on an alternative method of modulo reduction and i couldn't understand the following excerpt (Those in bold) :
"A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles."