并行编程的数据分解模型

How you divide data structures into contiguous subregions, or "chunks," of data?

  • an array. You can divide arrays along one or more of their dimensions.
  • Other structures that use an array as a component (e.g., graph implemented as an adjacency matrix)
  • list structures to the set of decomposable data structures, but only if there is an easy way to identify and access sublists of discrete elements.

the decomposition into chunks implies the division of computation into tasks that operate on elements of each chunk, and those tasks are assigned to threads. the update computations may require data from neighboring chunks. If so, we will have to share data between tasks. Accessing or retrieving essential data from neighboring chunks will require coordination between threads.

load balance is an important factor to take into consideration, especially when using chunks of variable sizes. If the data structure has a nice, regular organization and all the computations on that structure always take the same amount of execution time, you can simply decompose the structure into chunks with the same number of elements in some logical and efficient way. If your data isn't organized in a regular pattern or the amount of computation is different or unpredictable for each element in the structure, decomposing the structure into tasks that take roughly the same amount of execution time is a much less straightforward affair. Perhaps you should consider a dynamic scheduling of chunks to threads in this case.

How should you divide the data into chunks?

depend mainly on the type of data structure you are decomposing.

data decompositions have an additional dimension that you must consider when dividing data structures into chunks. This other dimension is the shape of the chunk.

The shape of a chunk determines what the neighboring chunks are and how any exchange of data will be handled during the course of the chunk's computations. Let's say we have the case that data must be exchanged across the border of each chunk (the term exchange refers to the retrieval of data that is not contained within a given chunk for the purpose of using that data in the update of elements that are in the local chunk).

  • Reducing the size of the overall border reduces the amount of exchange data required for updating local data elements;
  • reducing the total number of chunks that share a border with a given chunk will make the exchange operation less complicated to code and execute.

Large granularity can actually be a detriment with regard to the shape of a chunk. The more data elements there are within a chunk, the more elements that may require exchange of neighboring data, and the more overhead there that may be to perform that exchange. When deciding how to divide large data structures that will necessitate data exchanges, a good rule of thumb is to try to maximize the volume-to-surface ratio.

Irregular shapes may be necessary due to the irregular organization of the data. You need to be more vigilant with chunks of irregular shapes to ensure that a good load balance can be maintained, as well as a high enough granularity within the chunk to lessen the impact of unavoidable overheads.

You may need to revise your decomposition strategy after considering how the granularity and shape of the chunks affect the exchange of data between tasks. The division of data structures into chunks influences the need to access data that resides in another chunk. The next section develops ideas about accessing neighboring data that you should consider when deciding how to best decompose a data structure into chunks.

How can you ensure that the tasks for each chunk have access to all data required for updates?

when some data that is required by a given chunk is held within a neighboring chunk.

Two methods deal with it.

  • copy the data from the nearby chunk into structures local to the task (thread)
  • access the data as needed from the nearby chunk

方法1

The most obvious disadvantage for copying the necessary data not held in the assigned chunk is that each task will require extra local memory in order to hold the copy of data. However, once the data has been copied, there will be no further contention or synchronization needed between the tasks to access the copies. Copying the data is best used if the data is available before the update operation and won't change while being copied or during the update computations. This will likely mean some initial coordination between tasks to ensure that all copying has been done before tasks start updating.

The extra local memory resources that are allocated to hold copied data are often known as ghost cells. These cells are images of the structure and contents of data assigned to neighboring chunks.

If the update computation of an individual element required the data from the two elements on either side of it in the same row, the whole column from the neighboring chunk bordering the split would need to be accessible. Copying these data elements into ghost cells would allow the element to access that data without interfering in the updates of the neighboring chunk.

Another factor to consider when thinking about copying the required data is how many times copying will be necessary. This all depends on the nature of the update computation. Are repeated copies required for, say, an iterative algorithm that refines its solution over multiple updates? Or is the data copy only needed once at the outset of the computation? The more times the copy has to be carried out, the greater the overhead burden will be for the update
computations. And then there is the matter of the amount of data that needs to be copied. Too many copy operations or too much data per copy might be an indicator that simply accessing the required data directly from a neighboring chunk would be the better solution.

方法2

Accessing data as needed takes full advantage of shared memory communication between threads and the logic of the original serial algorithm. You also have the advantage of being able to delay any coordination between threads until the data is needed. The downside is that you must be able to guarantee that the correct data will be available at the right time. Data elements that are required but located within a neighboring chunk may be in the process of receiving
updates concurrently with the rest of the elements in the neighboring chunk. If the local chunk requires the "old" values of nonlocal data elements, how can your code ensure that those values are not the "new" values? To answer this question or to know whether we must even deal with such a situation, we must look at the possible interactions between the exchange of data from neighboring chunks and the update operation of local chunk elements.

If all data is available at the beginning of tasks and that data will not change during the update computation, the solution will be easier to program and more likely to execute efficiently. You can either copy relatively small amounts of data into ghost cells or access the unchanging data through shared memory. In order to perform the copy of nonlocal data, add a data gathering (exchange) phase before the start of the update computations. Try to minimize the execution time of the data-gathering phase, since this is pure overhead that was not part of the original serial code.

If nonlocal data will be accessed (or copied) during update computations, you will need to add code to ensure that the correct data will be found. Mixing exchange and update computations can complicate the logic of your application, especially to ensure correct data is retrieved. However, the serial application likely had this requirement, too, and the solution to the need for accessing correct data concurrently should simply follow the serial algorithm as much as possible.

How are the data chunks (and tasks) assigned to threads?

As with task decomposition, the tasks that are associated with the data chunks can be assigned to threads statically or dynamically. Static scheduling is simplest since the coordination needed for any exchange operations will be determined at the outset of the computations. Static scheduling is most appropriate when the amount of computations within tasks is uniform and predictable. Dynamic scheduling may be necessary to achieve a good load balance due to variability in the computation needed per chunk. This will require (many) more tasks than threads, but it also complicates the exchange operation and how you coordinate the exchange with neighboring chunks and their update schedules.

Being the sharp reader that you are, you have no doubt noticed that in most of the discussion over the last four pages or so I have used the term "task" rather than "thread." I did this on purpose. The tasks, defined by how the data structures are decomposed, identify what interaction is needed with other tasks regardless of which thread is assigned to execute what task. Additionally, if you are using a dynamic schedule of tasks, the number of tasks will
outnumber the total number of threads. In such a case, it will not be possible to run all tasks in parallel. You may then come up against the situation where some task needs data from another task that has not yet been assigned to execute on a thread. This raises the complexity of your concurrent design to a whole other level, and I'm going to leave it to you to avoid such a situation.

posted on 2010-09-11 10:44  胡是  阅读(773)  评论(0编辑  收藏  举报

导航