How It Works: Maximizing Max Degree Of Parallelism (MAXDOP)
I was working on an index build issue for an 80 CPU system and kept seeing that only 64 CPUs were getting used. I had carefully studied sys.dm_os_spinlock_stats and sys.dm_os_wait_stats along with performance counters, memory usage pattern, and I/O activities. In fact, I had an 80 CPU, 2TB RAM, 4TB SSD system so I was convinced the SQL Server was CPU bound and adding more CPUs for the index build could be a benefit.
Note: I must caution you that adding more CPUs can lead to reduced performance because a bottleneck such as memory or I/O can become a larger problem.
I like to think of it like pumping gas at my favorite filling station. The pump at the storage tank can only move so much liquid through a finite pipe size. Adding more filling outlets for patrons to use does not mean the overall flow of gas (gallons/sec) increases. In fact, I like filling when no one else is filling because I maximize my flow and reduce my overall time at the pump.
TEST any MAXDOP setting well as you might find less goes faster.
After looking at various SQL Server Books Online references and then stepping though the code I realized our documentation is not as accurate as it could be. I hope I this post can reduce some of the confusion.
There are plenty of references for tuning MAXDOP to allow queries to run at their best while reducing the overhead of the parallelism. You have all seen the references for capping MAXDOP at 8, or number of schedulers of the NUMA node if smaller, or …. The fact is, these are all great and recommended best practices.
- http://support.microsoft.com/kb/2806535
- http://blogs.msdn.com/b/psssql/archive/2013/08/28/i-want-to-know-why-my-query-is-not-parallelized.aspx
- http://technet.microsoft.com/en-us/library/ms181007(v=SQL.105).aspx
- http://blogs.msdn.com/b/psssql/archive/2011/02/03/resource-governor-and-maxdop-parallel-plan-generation.aspx
- http://blogs.msdn.com/b/psssql/archive/2008/02/13/how-it-works-sql-server-per-query-degree-of-parallelism-worker-count-s.aspx
This post is not intended to contradict any of the current recommendations. This blog is solely focused on a specific maintenance target. The reset of my system has to be idle so I can safely consume the majority of schedulers. It is probably the middle of the night, users are sleeping and you want to schedule a job that can take full advantage of the overall system. I have reviewed various performance points and I believe using a high level of parallelism could allow my index build to complete quickly.
Warning: Index fragmentation with increased levels or parallelism: http://blogs.msdn.com/b/psssql/archive/2012/09/05/how-it-works-online-index-rebuild-can-cause-increased-fragmentation.aspx
There are several stages to determining the degree of parallelism (MAXDOP) a query can utilize.
Stage 1 – Compile
During complication SQL Server considers the hints, sp_configure and resource workgroup settings to see if a parallel plan should even be considered. Only if the query operations allow parallel execution:
If hint is present and > 1 then build a parallel plan
else if no hint or hint (MAXDOP = 0)
if sp_configure setting is 1 but workload group > 1 then build a parallel plan
else if sp_configure setting is 0 or > 1 then build parallel plan
Stage 2 – Query Execution
When the query begins execution the runtime, degree of parallelism is determined. This involves many factors, already outlined in SQL Server Books Online: http://technet.microsoft.com/en-US/library/ms178065(v=SQL.105).aspx
Before SQL Server looks at the idle workers and other factors it determines the target for the degree of parallelism.
if sp_configure or query hint forcing serial plan use (1)
else if resource workgroup set
if query hint present use min(hint, resource workgroup)
else use resource workgroup
If still 0 after the detailed calculations it is set to 64 (default max for SQL Server as documented in Books Online.) This fooled me some because on the 80 CPU system it has 2 Windows scheduler groups x 40 CPUs. I might have expected a 40 CPU cap to avoid crossing over a Windows scheduler group. This is not the case, SQL Server hard codes the 64 CPU target when the runtime target of MAXDOP is still 0 (default.)
The MAXDOP target is now adjusted for:
-
Actual CPU count (affinity settings from sp_configure and the resource pool).
-
Certain query types (index build for example) look at the partitions
-
Other query type limitations that may exist
Now SQL Server takes a look at the available workers (free workers for query execution.) You can loosely calculate the free worker count on a scheduler using (Free workers = Current_workers_count – current_tasks_count) from sys.dm_os_schedulers
Once the target is calculated the actual is determined by looking at the available resources to support a parallel execution. This involves determining the node(s) and CPUs with available workers.
Older versions of SQL Server used a polling mechanism every ~1 second to determine the node with the most free workers to target. This meant you could encounter race conditions from multiple queries, both going parallel on the same node when going parallel on separate nodes would have resulted in better CPU usage.
Newer builds of SQL Server actively track the free workers. This significantly reduces the possibility of assigning parallel queries to the same set of schedulers.
Trace flag 2466 - Force older version logic to determine number of available resources.
The worker location information is then used to target an appropriate set of CPUs to assign the parallel task to.
In general the placement decisions are:
-
SMP (FPlaceThreadsOneNodeSystem): If a single node system treat as SMP and only enqueue to a single node. SOFT NUMA and affinity may cause SQL to treat as SMP
-
CONNECTION (FPlaceThreadsOneNodeSystem): If trace flag 2479 is enabled force all parallel decisions to be limited to the node the connection is associated with. This may be helpful when using SOFT Numa or connection, node affinity.
-
FULL (FPlaceThreadsAllNodes): If MAXDOP target is equal to all schedulers enqueue work to all schedulers
-
LEAST (FPlaceThreadsWithinLeastLoadedNode): If target MAXDOP target is less than a single node can provide and if trace flag 2467 is enabled attempt to locate least loaded node
-
SPREAD (FPlaceThreadsMultipleNodes): The load is spread across any available node
Using XEvents you can monitor the MAXDOP decision logic. For example:
-
XeSqlPkg::calculate_dop_begin
-
XeSqlPkg::calculate_dop
Back to trying to get my index build to use all 80 CPUs. I can do several things:
1. Use MAXDOP=80 query hint
2. Set sp_configure ‘max degree of parallelism’, 80 -- Warning this applies to any query
3. Create resource pool/workload group and set MAXDOP=80 and assign only the index build connection to it using a resource governor classifier.
Testing Results
Here are the MAXDOP results on my 80 CPU system at different setting levels.
Query Hint | sp_configure | Workgroup | RUNTIME |
0 | 0 | 80 | 80 |
0 | 0 | 0 | 64 |
1 | 0 | 80 | 1 |
2 | 0 | 80 | 2 |
0 | 80 | 0 | 80 |
0 | 1 | 0 | 1 |
0 | 2 | 0 | 2 |
80 | 2 | 0 | 80 |
80 | 10 | 2 | 2 |
You can monitor the number of parallel workers by querying:
sys.dm_os_tasks
Note: Some configuration changes may require a flush of procedure cache (dbcc freeproccache) or a disconnect/connect pairing to take affect.