How does a single thread run on multiple cores?
How does a single thread run on multiple cores?
30
I am trying to understand, at a high-level, how single threads run across multiple cores. Below is my best understanding. I do not believe it is correct though.
Based on my reading of Hyper-threading, it seems the OS organizes the instructions of all threads in such a way that they are not waiting on each other. Then the front-end of the CPU further organizes those instructions by distributing one thread to each core, and distributes independent instructions from each thread among any open cycles.
So if there is only a single thread, then the OS will not do any optimization. However, the front-end of the CPU will distribute independent instruction sets among each core.
According to https://stackoverflow.com/a/15936270, a specific programming language may create more or less threads, but it is irrelevant when determining what to do with those threads. The OS and CPU handle this, so this happens regardless of the programming language used.
Just to clarify, I am asking about a single thread run across multiple cores, not about running multiple threads on a single core.
What is wrong with my summary? Where and how is a thread's instructions split up among multiple cores? Does the programming language matter? I know this is a broad subject; I am hoping for a high-level understanding of it.
multithreading hardware cpu multi-core
7,5612927
asked Jun 1 '17 at 14:09
4591814
-
6
A set of instructions for a single software thread may run on many cores, but not at once. – Kroltan Jun 1 '17 at 17:51
-
1
You're mixing software threads (which involve the OS scheduler) and hardware threads or HyperThreading (a CPU feature that makes one core behave like two). – ugoren Jun 1 '17 at 21:44
-
1
I have 20 drivers and 4 trucks. How is it possible that one driver can deliver packages with two trucks? How is it possible that one truck can have multiple drivers? The answer to both questions is the same. Take turns. – Eric Lippert Jun 9 '17 at 18:21
3 Answers
The operating system offers time slices of CPU to threads that are eligible to run.
If there is only one core, then the operating system schedules the most eligible thread to run on that core for a time slice. After a time slice is completed, or when the running thread blocks on IO, or when the processor is interrupted by external events, the operating system reevaluates what thread to run next (and it could choose the same thread again or a different one).
Eligibility to run consists of variations on fairness and priority and readiness, and by this method various threads get time slices, some more than others.
If there are multiple cores, N, then the operating system schedules the most eligible N threads to run on the cores.
Processor Affinity is an efficiency consideration. Each time a CPU runs a different thread than before, it tends to slow down a bit because its cache is warm for the previous thread, but cold to the new one. Thus, running the same thread on the same processor over numerous time slices is an efficiency advantage.
However, the operating system is free to offer one thread time-slices on different CPUs, and it could rotate through all the CPUs on different time slices. It cannot, however, as @gnasher729 says, run one thread on multiple CPUs simultaneously.
Hyperthreading is a method in hardware by which a single enhanced CPU core can support execution of two or more different threads simultaneously. (Such a CPU can offer additional threads at lower cost in silicon real-estate than additional full cores.) This enhanced CPU core needs to support additional state for the other threads, such as CPU register values, and also has coordination state & behavior that enables sharing of functional units within that CPU without conflating the threads.
Hyperthreading, while technically challenging from a hardware perspective, from the programmer's perspective, the execution model is merely that of additional CPU cores rather than anything more complex. So, the operating system sees additional CPU cores, though there are some new processor affinity issues as several hyperthreaded threads are sharing one CPU core's cache architecture.
We might naively think that two threads running on a hyperthreadded core each run half as fast as they would each with their own full core. But this is not necessarily the case, since a single thread's execution is full of slack cycles, and some amount of them can be used by the other hyperthreaded thread. Further, even during non-slack cycles, one thread may be using different functional units than the other so simultaneous execution can occur. The enhanced CPU for hyperthreading may have a few more of certain heavily used functional units specially to support that.
1034
answered Jun 1 '17 at 15:00
21.7k43155
-
3
"Thus, running the same thread on the same processor over numerous time slices is an efficiency advantage." Wouldn't it have to be contiguous time slices? Otherwise, the caches would be wiped by other threads, no? +1 for a nice explanation. – jpmc26 Jun 1 '17 at 21:53
-
2
@Luaan: HT is often good, but the situation isn't as simple as you describe. The front-end issue bandwidth (4 uops per clock on Intel, 6 on Ryzen) is equally shared between threads (unless one is stalled). If that's the bottleneck, then like I said HT won't help at all. It's not uncommon for Skylake to come close to that in a well-tuned loop, if there's a mix of loads, ALU and stores... Transistors are cheap (and can't all be switching at once or the CPU would melt), so modern x86 CPUs have more execution ports than the front-end can feed (with many execution units being replicated... – Peter Cordes Jun 2 '17 at 11:45
-
2
... on multiple ports)... This might seem like a waste, but often a loop will only use one kind of ALU execution unit at once, so having duplicates of everything means that whatever kind of code is running, there are multiple ports for its instructions. So the reason you cited for benefiting from HT is not that common, since most code has some loads and/or stores taking up front-end bandwidth, and what's left is often not enough to saturate the execution units. – Peter Cordes Jun 2 '17 at 11:45
-
2
@Luaan: Also, in Intel CPUs the integer and FP/vector execution units share the same execution ports. For example, the FP FMA/mul/add units are on ports 0/1. But the integer multiplier is also on port1, and simple integer ops can run on any of the 4 execution ports (diagram in my answer). A second thread using up issue bandwidth will slow them both down even if they don't compete for execution units, but often there's a net throughput gain if they don't compete too badly for cache. Even well-tuned high-throughput code like x264/x265 (video encoders) benefit about 15% on Skylake from HT. – Peter Cordes Jun 2 '17 at 11:49
-
3
@luaan In addition to what Peter said, your claim that "That was the original reasoning behind HT" is incorrect. The original reasoning behind HT was that the NetBurst microarchitecture had lengthened the pipeline to such an extreme extent (for the purposes of driving up the clock speed) that branch mispredictions and other pipeline bubbles absolutely killed performance. HT was one of Intel's solutions to minimizing the amount of time that this big expensive chip's execution units sat idle because of bubbles in the pipeline: code from other threads could be inserted and run in those holes. – Cody Gray Jun 2 '17 at 13:43
There is no such thing as a single thread running on multiple cores simultaneously.
It doesn't mean, however, that instructions from one thread cannot be executed in parallel. There are mechanisms called instruction pipelining and out-of-order execution that allow it. Each core has a lot of redundant resources that are not utilized by simple instructions, so multiple such instructions can be run together (as long as the next one doesn't depend on the previous result).
Hyper-threading is kind of extreme variant of this idea, in which one core not only executes instructions from one thread in parallel, but mixes instructions from two different threads to optimize resource usage even further.
Related Wikipedia entries: Instruction pipelining, out-of-order execution.
answered Jun 1 '17 at 21:26
1,314514
-
2
They cannot run simultaneously, but they can run in parallel? Are these not the same thing? – EvorlorJun 1 '17 at 23:42
-
7
@Evorlor The key thing here is the difference between a core and an execution unit. A single thread can only run on one core, but a processor can use dynamic analysis to work out which instructions being executed by a core do not depend on each other and execute these on different execution units simultaniously. One core may have several execution units. – user1937198 Jun 1 '17 at 23:55
-
2
@Evorlor: An out-of-order CPU can find and exploit the instruction-level parallelism within the instruction stream of a single thread. e.g. often the instructions that update a loop counter are independent from some of the other work a loop does. Or in an
a[i] = b[i] + c[i]
loop, each iteration is independent, so loads, adds, and stores from different iterations can be in flight at once. It has to preserve the illusion that the instructions executed in program order, but for example a store that misses in cache doesn't delay the thread (until it runs out of space in the store buffer). – Peter Cordes Jun 2 '17 at 0:09 -
2
@user1937198: The phrase "dynamic analysis" would better suit a JIT compiler. Out-of-order CPUs don't really analyze; it's more like a greedy algorithm that runs whatever instructions have been decoded&issued and have their inputs ready. (The out-of-order reordering window is limited by a few microarchitectural resources, for example Intel Sandybridge has a ReOrder Buffer size of 168 uops. See also measuring the ROB size experimentally). All implemented with hardware state-machines to handle 4 uops per clock. – Peter Cordes Jun 2 '17 at 0:17
-
2
@Luaan yeah, it was an interesting idea, but AOT compilers still aren't smart enough to fully exploit it. Also, Linus Torvalds (and others) have argued that exposing that much of the internals of the pipeline is a big constraint on future designs. e.g. you can't really increase the pipeline width without changing the ISA. Or you build a CPU that tracks dependencies the usual way, and maybe issues two VLIW groups in parallel, but then you've lost the CPU-complexity benefit of EPIC but still have the downsides (lost issue bandwidth when the compiler can't fill a word). – Peter Cordes Jun 2 '17 at 12:05
summary: Finding and exploiting the (instruction-level) parallelism in a single-threaded program is done purely in hardware, by the CPU core it's running on. And only over a window of a couple hundred instructions, not large-scale reordering.
Single-threaded programs get no benefit from multi-core CPUs, except that other things can run on the other cores instead of taking time away from the single-threaded task.
the OS organizes the instructions of all threads in such a way that they are not waiting on each other.
The OS does NOT look inside the instruction streams of threads. It only schedules threads to cores.
Actually, each core runs the OS's scheduler function when it needs to figure out what to do next. Scheduling is a distributed algorithm. To better understand multi-core machines, think of each core as running the kernel separately. Just like a multi-threaded program, the kernel is written so that its code on one core can safely interact with its code on other cores to update shared data structures (like the list of threads that are ready to run.
Anyway, the OS is involved in helping multi-threaded processes exploit thread-level parallelism which must be explicitly exposed by manually writing a multi-threaded program. (Or by an auto-parallelizing compiler with OpenMP or something).
Then the front-end of the CPU further organizes those instructions by distributing one thread to each core, and distributes independent instructions from each thread among any open cycles.
A CPU core is only running one stream of instructions, if it isn't halted (asleep until the next interrupt, e.g. timer interrupt). Often that's a thread, but it could also be a kernel interrupt handler, or miscellaneous kernel code if the kernel decided to do something other than just return to the previous thread after handling and interrupt or system call.
With HyperThreading or other SMT designs, a physical CPU core acts like multiple "logical" cores. The only difference from an OS perspective between a quad-core-with-hyperthreading (4c8t) CPU and a plain 8-core machine (8c8t) is that an HT-aware OS will try to schedule threads to separate physical cores so they don't compete with each other. An OS that didn't know about hyperthreading would just see 8 cores (unless you disable HT in the BIOS, then it would only detect 4).
The term "front-end" refers to the part of a CPU core that fetches machine code, decodes the instructions, and issues them into the out-of-order part of the core. Each core has its own front-end, and it's part of the core as a whole. Instructions it fetches are what the CPU is currently running.
Inside the out-of-order part of the core, instructions (or uops) are dispatched to execution ports when their input operands are ready and there's a free execution port. This doesn't have to happen in program order, so this is how an OOO CPU can exploit the instruction-level parallelism within a single thread.
If you replace "core" with "execution unit" in your idea, you're close to correct. Yes, the CPU does distribute independent instructions/uops to execution units in parallel. (But there's a terminology mix-up, since you said "front-end" when really it's the CPU's instruction-scheduler aka Reservation Station that picks instructions ready to execute).
Out-of-order execution can only find ILP on a very local level, only up to a couple hundred instructions, not between two independent loops (unless they're short).
For example, the asm equivalent of this
int i=0,j=0;
do {
i++;
j++;
} while(42);
will run about as fast as the same loop only incrementing one counter on Intel Haswell. i++
only depends on the previous value of i
, while j++
only depends on the previous value of j
, so the two dependency chains can run in parallel without breaking the illusion of everything being executed in program order.
On x86, the loop would look something like this:
top_of_loop:
inc eax
inc edx
jmp .loop
Haswell has 4 integer execution ports, and all of them have adder units, so it can sustain a throughput of up to 4 inc
instructions per clock if they're all independent. (With latency=1, so you only need 4 registers to max out the throughput by keeping 4 inc
instructions in flight. Contrast this with vector-FP MUL or FMA: latency=5 throughput=0.5 needs 10 vector accumulators to keep 10 FMAs in flight to max out the throughput. And each vector can be 256b, holding 8 single-precision floats).
The taken-branch is also a bottleneck: a loop always takes at least one whole clock per iteration, because taken-branch throughput is limited to 1 per clock. I could put one more instruction inside the loop without reducing performance, unless it also reads/writes eax
or edx
in which case it would lengthen that dependency chain. Putting 2 more instructions in the loop (or one complex multi-uop instruction) would create a bottleneck on the front-end, since it can only issue 4 uops per clock into the out-of-order core. (See this SO Q&A for some details on what happens for loops that aren't a multiple of 4 uops: the loop-buffer and uop cache make things interesting.)
In more complex cases, finding the parallelism requires looking at a larger window of instructions. (e.g. maybe there's a sequence of 10 instructions that all depend on each other, then some independent ones).
The Re-Order Buffer capacity is one of the factors that limits the out-of-order window size. On Intel Haswell, it's 192 uops. (And you can even measure it experimentally, along with register-renaming capacity (register-file size).) Low-power CPU cores like ARM have much smaller ROB sizes, if they do out-of-order execution at all.
Also note that CPUs need to be pipelined, as well as out-of-order. So it has to fetch&decode instructions well ahead of the ones being executed, preferably with enough throughput to refill buffers after missing any fetch cycles. Branches are tricky, because we don't know where to even fetch from if we don't know which way a branch went. This is why branch-prediction is so important. (And why modern CPUs use speculative execution: they guess which way a branch will go and start fetching/decoding/executing that instruction stream. When a misprediction is detected, they roll back to the last known-good state and execute from there.)
If you want to read more about CPU internals, there are some links in the Stackoverflow x86 tag wiki, including to Agner Fog's microarch guide, and to David Kanter's detailed writeups with diagrams of Intel and AMD CPUs. From his Intel Haswell microarchitecture writeup, this is the final diagram of the whole pipeline of a Haswell core (not the whole chip).
This is a block diagram of a single CPU core. A quad-core CPU has 4 of these on a chip, each with their own L1/L2 caches (sharing an L3 cache, memory controllers, and PCIe connections to the system devices).
I know this is overwhelmingly complicated. Kanter's article also shows parts of this to talk about the frontend separately from the execution units or the caches, for example.
answered Jun 2 '17 at 1:18
29619
-
1
"Finding and exploiting the (instruction-level) parallelism in a single-threaded program is done purely in hardware" Note that this only applies to conventional ISAs, not VLIWs in which ILP is determined completely by the compiler or programmer, or cooperatively between hardware and software. – Hadi BraisMay 25 at 18:51