[No0000182]Parallel Programming with .NET-Partitioning in PLINQ

Every PLINQ query that can be parallelized starts with the same step: partitioning.  Some queries may even need to repartition in the middle.  Partitioning is a fairly simple concept at the high level: PLINQ takes a lock on the input data source, breaks it into multiple pieces, and then distributes these to the available processing cores on the machine.  Each of the cores will then process the data as appropriate and either merge, aggregate, or execute a function over the results from the partitions. 

Example query: (from x in D.AsParallel() where p(x) select x*x*x).Sum();

 

clip_image001

Here’s a simple way to look at it.  On a 4-core machine, take 4 million elements, divide this into 4 partitions of 1 million elements each, and give each of the 4 cores a million elements of data to process.  Assuming that the data and the processing of the data is uniform, that all of the cores operate with the same amount of effectiveness, that nothing else is using the cores, and that we can access all of the elements directly rather than only being able to access element N after accessing N-1 (i.e. indexing rather than iterating) this is an efficient algorithm for some straight-forward types of queries.  If you just counted those assumptions, there are a lot of factors that PLINQ takes into account when processing a query.  It’s not as simple as it appears to be.  There are many factors that come into account, some of which did not even factor as assumptions in that query.  How is the data being ordered?  Is there contention within the query itself?  Do we know how much data is in the query?  On and on, the considerations continue.  Meanwhile, we’re trying to design a multi-purpose system that can handle any of an infinite number of query shapes that you can throw at it.  (You might wonder how we test all of that.  Ask us about SLUG someday.)

Needless to say, we need to take a look over the query and the data source to make these decisions.  And we want to make these decisions fast, because the more time we spend making the decisions, the more performance we have wasted that could be spent processing the data.  There are a lot of things to balance here as we try to decide what best to do. 

Based on many factors, we have 4 primary algorithms that we use for partitioning alone.  They’re worth getting to know, because we’ll talk more about them and tweaks that we make to them in future technology discussions.

1. Range Partitioning – This is a pretty common partitioning scheme, similar to the one that I described in the example above.  This is amenable to many query shapes, though it only works with indexible data sources such as lists and arrays (i.e. IList<T> and T[]).  If you give PLINQ something typed as IEnumerable or IEnumerable<T>, PLINQ will query for the ILIst<T> interface, and if it’s found, will use that interface implementation with range partitioning. The benefits of these data sources is that we know the exact length and can access any of the elements within the arrays directly.  For the majority of cases, there are large performance benefits to using this type of partitioning.

clip_image002

2. Chunk Partitioning – This is a general purpose partitioning scheme that works for any data source, and is the main partitioning type for non-indexible data sources. In this scheme, worker threads request data, and it is served up to the thread in chunks. IEnumerables and IEnumerable<T>s do not have fixed Count properties (there is a LINQ extension method for this, but that is not the same), so there’s no way to know when or if the data source will enumerate completely.  It could be 3 elements, it could be 3 million elements, it could be infinite.  A single system needs to take all of these possibilities into account and factor in different delegate sizes, uneven delegate sizes, selectivity etc. The chunk partitioning algorithm is quite general and PLINQ’s algorithm had to be tuned for good performance on a wide range of queries.   We’ve experimented with many different growth patterns and currently use a plan that doubles after a certain number of requests. This is subject to change as we tune for performance, so don’t depend on this. Another important optimization is that chunk partitioning balances the load among cores, as the tasks per core dynamically request more work as needed. This ensures that all cores are utilized throughout the query and can all cross the finish line at the same time vs. a ragged, sequential entry to the end.

clip_image003

3. Striped Partitioning – This scheme is used for SkipWhile and TakeWhile and is optimized for processing items at the head of a data source (which obviously suits the needs of SkipWhile and TakeWhile). In striped partitioning, each of the n worker threads is allocated a small number of items (sometimes 1) from each block of n items.  The set of items belonging to a single thread is called a ‘stripe’, hence the name.  A useful feature of this scheme is that there is no inter-thread synchronization required as each worker thread can determine its data via simple arithmetic. This is really a special case of range partitioning and only works on arrays and types that implement IList<T>.

clip_image004

4. Hash Partitioning – Hash partitioning is a special type of partitioning that is used by the query operators that must compare data elements (these operators are: Join, GroupJoin, GroupBy, Distinct, Except, Union, Intersect).  When hash-partitioning occurs (which is just prior to any of the operators mentioned), all of the data is processed and channeled to threads such that items with identical hash-codes will be handled by the same thread.  This hash-partitioning work is costly, but it means that all the comparison work can then be performed without further synchronization.    Hash partitioning assigns every element to an output partition based on a hash computed from each element’s key. This can be an efficient way of building a hash-table on the fly concurrently, and can be used to accomplish partitioning and hashing for the hash join algorithm. The benefit is that PLINQ can now use the same hash partitioning scheme for the data source used for probing; this way all possible matches end up in the same partition, meaning less shared data and smaller hash table sizes (each partition has its own hash table). There’s a lot going on with hash-partitioning, so it’s not as speedy as the other types, especially when ordering is involved in the query.  As a result the query operators that rely upon it have additional overheads compared to simpler operators.

clip_image005

Hopefully that gives you a little better understanding of what goes on under the covers.   As we write more about the system, including performance tips and tricks, here’s a basis for understanding how things work.  If you can structure your applications to use specific types of partitioning, you may be able to receive greater performance gains by leveraging the more efficient partitioning algorithms.

可以并行化的每个PLINQ查询都以相同的步骤开始:分区有些查询甚至可能需要在中间进行重新分区。分区是一个相当简单的高级概念:PLINQ锁定输入数据源,将其分成多个部分,然后将它们分发到机器上的可用处理核心。然后,每个核心将根据需要处理数据,并通过分区的结果合并,聚合或执行功能。 

示例查询:(来自D.AsParallel()中的x,其中p(x)选择x * x * x).Sum();

 

clip_image001

这是一个简单的方法来看待它。在4核计算机上,需要400万个元素,将其分成4个分区,每个分区包含100万个元素,并为4个核心中的每个核心提供100万个要处理的数据元素。假设数据和数据处理是统一的,所有核心都以相同的有效性运行,没有其他任何东西在使用核心,并且我们可以直接访问所有元素,而不仅仅是能够在访问N-1之后访问元素N(即索引而不是迭代),对于某些直接类型的查询,这是一种有效的算法。如果您只计算了这些假设,PLINQ在处理查询时会考虑很多因素。它并不像看起来那么简单。有许多因素被考虑在内,其中一些甚至不考虑该查询中的假设。如何订购数据?查询本身内是否存在争用?我们知道查询中有多少数据吗?不断,考虑因素继续存在。与此同时,我们正在尝试设计一个多功能系统,它可以处理无数个查询形状中的任何一个。(你可能想知道我们如何测试所有这些。总有一天向我们询问SLUG。)

不用说,我们需要查看查询和数据源来做出这些决定。我们希望快速做出这些决策,因为我们花在决策上的时间越多,我们浪费的性能就越多,可能花在处理数据上。在我们尝试决定最佳做法时,有很多事情需要平衡。 

基于许多因素,我们有4个主要算法,我们仅用于分区。他们值得了解,因为我们会在未来的技术讨论中更多地讨论它们并调整它们。

1. 范围分区 - 这是一种非常常见的分区方案,类似于我在上面的例子中描述的方案。这适用于许多查询形状,但它仅适用于可索引数据源,如列表和数组(即IList <T>和T [])。如果你给PLINQ类型为IEnumerable或IEnumerable <T>,PLINQ将查询ILIst <T>接口,如果找到它,将使用带有范围分区的接口实现。这些数据源的好处是我们知道确切的长度,并且可以直接访问数组中的任何元素。对于大多数情况,使用此类分区有很大的性能优势。

clip_image002

2. 块分区 - 这是一种适用于任何数据源的通用分区方案,是非可索引数据源的主要分区类型在此方案中,工作线程请求数据,并以块的形式提供给线程。IEnumerables和IEnumerable <T>没有固定的Count属性(对此有一个LINQ扩展方法,但是不一样),因此无法知道数据源何时或是否将完全枚举。它可能是3个元素,可能是300万个元素,也可能是无限的。单个系统需要考虑所有这些可能性并考虑不同的委托大小,不均匀的委托大小,选择性等。块分区算法非常通用,并且必须调整PLINQ的算法以在各种查询上获得良好的性能。我们已经尝试了许多不同的增长模式,目前使用的计划在一定数量的请求之后会翻倍。当我们调整性能时,这可能会发生变化,所以不要依赖于此。另一个重要的优化是块分区在核心之间平衡负载,因为每个核心的任务动态地根据需要请求更多工作。这样可以确保在整个查询过程中使用所有核心,并且所有核心都可以同时穿过终点线,而不是粗糙的顺序进入终点。

clip_image003

3. 条带化分区 - 此方案用于SkipWhile和TakeWhile,并且针对处理数据源头部的项目进行了优化(这显然适合SkipWhile和TakeWhile的需要)。在条带分区中,n个工作线程中的每一个都从n个项目的每个块中分配少量项目(有时为1)。属于单个线程的项集称为“条带”,因此称为名称。该方案的一个有用特性是不需要线程间同步,因为每个工作线程可以通过简单的算法确定其数据。这实际上是范围分区的一种特殊情况,仅适用于实现IList <T>的数组和类型。

clip_image004

4. 哈希分区 -散列分区是一种特殊类型的分区,由必须比较数据元素的查询运算符使用(这些运算符为:Join,GroupJoin,GroupBy,Distinct,Except,Union,Intersect)。当发生散列分区时(恰好在所提到的任何运算符之前),处理所有数据并将其引导到线程,使得具有相同散列码的项目将由同一线程处理。这种散列分区工作成本很高,但这意味着可以在不进一步同步的情况下执行所有比较工作。散列分区基于从每个元素的键计算的散列将每个元素分配给输出分区。这可以是同时动态构建散列表的有效方式,并且可以用于完成散列连接算法的分区和散列。好处是PLINQ现在可以对用于探测的数据源使用相同的散列分区方案; 这样,所有可能的匹配最终都在同一个分区中,这意味着更少的共享数据和更小的哈希表大小(每个分区都有自己的哈希表)。散列分区有很多,所以它不像其他类型那么快,特别是在查询中涉及到排序时。因此,与更简单的运算符相比,依赖于它的查询运算符具有额外的开销。特别是在查询中涉及订购时。因此,与更简单的运算符相比,依赖于它的查询运算符具有额外的开销。特别是在查询中涉及订购时。因此,与更简单的运算符相比,依赖于它的查询运算符具有额外的开销。

clip_image005

希望这能让您更好地了解幕后发生的事情。当我们写更多关于系统的信息时,包括性能提示和技巧,这里是理解事物如何工作的基础。如果您可以构建应用程序以使用特定类型的分区,则可以通过利用更高效的分区算法来获得更高的性能提升。

 

posted @ 2018-09-29 13:55  CharyGao  阅读(212)  评论(0编辑  收藏  举报