Graph Processing System Survey
Abstract
As a basic data type, graph is an abstraction of real-world objects and their associated relationships. In reality, many scientific problems can be modeled as graph problems, so it is important to analyze graph data. Graph data analysis has a wide range of applications in semantic Web analysis, social networks, biogenetic analysis, and information retrieval. With the development of information technology such as mobile Internet and IoT, the scale of graph data is in a state of continuous growth. Therefore, distributed graph processing systems that can efficiently analyze and compute large-scale graph data are of great importance. In this thesis, after investigating and analyzing several existing distributed graph processing systems, we first introduce the background and significance of graph processing systems, then introduce the operation mechanism of these systems, and finally list the current problems and challenges of graph processing systems, and also discuss and outlook the future research content and development direction.
Introduction
In the era of big data, many big data are presented as large-scale graphs or networks, such as social networks, infectious disease transmission routes, and the impact of traffic accidents on road networks. In addition, many non-graph structured big data are often converted into graph models before processing and analyzing. The size of graphs is getting larger and larger, with some graphs even having billions of vertices and hundreds of billions of edges. The analysis of these graph data is of great practical importance, but due to the huge scale and complex structure of graph data in reality, coupled with the high computational complexity of the graph analysis algorithms themselves, the analysis of large graph data has exceeded the storage and computing capabilities of a single computer, posing a huge challenge to graph analysis. Because one machine can no longer store all the data that need to be computed, a distributed computing environment is needed. The emergence of MapReduce, a simple distributed computing framework that facilitates the processing of large amounts of graph data by defining Map and Reduce functions, was once expected to meet the needs of large-scale graph computing. However, due to the design flaws of the MapReduce framework itself, the execution of tasks requires multiple reads and writes to the distributed file system, the intermediate results cannot be cached, and the execution results cannot be shared among tasks; in addition, most of the graph-related algorithms require multiple iterations to complete, and each iteration corresponds to a MapReduce job, so MapReduce cannot efficiently perform graph The algorithms related to graphs require multiple iterations to complete.
Therefore, new graph computation frameworks have emerged for large graphs. The current general graph processing systems are mainly graph vertex-centric, message-passing batch-based parallel engines, such as Pregel, GPS, PowerGraph, etc. These systems are described in detail in the next section.
Graph Processing System
BSP
Before introducing the graph processing system, we introduce the BSP (Bulk Synchronous Parallel Computing Model) , which is the parallel computing model used at the bottom of most graph processing systems today.
BSP is a parallel computing model proposed by Viliant of Harvard University and BillMc Coll of Oxford University . Unlike MapReduce, which is not good at doing iterative operations, BSP is particularly suitable for doing iterative operations on data. A BSP model consists of a large number of processors interconnected through a network, each with fast local memory and different computational threads. A BSP computational process consists of a series of global supersteps (a superstep is a single iteration in a computation), and each superstep consists of three main components.
-
Local computation. Each participating processor has its own computation task, and they only read the values stored in local memory, and the computation tasks of different processors are asynchronous and independent.
-
Communication. The processor groups exchange data with each other in the form of Put and Get operations initiated by one party.
-
Barrier Synchronization. When a processor encounters a "roadblock" (or barrier), it waits for all other processors to complete their computational steps; each synchronization is also the completion of one overshoot and the beginning of the next.
Pregel
After introducing BSP, since most graph processing systems nowadays are based on the improved implementation of Pregel's functionality, we next start to introduce the Pregel model.
In order to efficiently handle the large graph iterative computation problem, inspired by the BSP computation model, Google first proposed the vertex-centric distributed graph computation framework Pregel in 2010.
Pregel is a parallel graph processing system implemented based on the BSP model. To solve the problem of distributed computation of large graphs, Pregel builds a scalable and fault-tolerant platform that provides a very flexible set of APIs to describe a wide variety of graph computations.Pregel, as a computational framework for distributed graph computation, is mainly used for graph traversal, shortest path, PageRank computation, etc.
Pregel runs as follows: to start the computation, the vertices of the graph are distributed between the computation nodes. The computation consists of iterations of super-steps. In each super-step, similar to the map() and reduce() functions in the MapReduce framework, the user-specified vertex.compute() function is applied to each vertex in parallel. Inside vertex.compute(), the vertex updates its state information (possibly based on incoming messages), sends other vertex messages for use in the next iteration, and sets a flag indicating whether the vertex is ready to stop computing. At the end of each superstep, all compute nodes are synchronized before starting the next superstep. The iteration stops when all vertices vote to stop the computation. This model is more suitable for graph computation than Hadoop because it is inherently iterative and the graph can be kept in memory throughout the computation.
Pregel presents feasible solutions in graph slicing, communication management, synchronization control and fault-tolerant recovery, and architecturally eliminates the problem of repeated loading of graph data during computing. Although the Pregel framework has a simple computational model and good scalability, there are three limitations of the framework.
-
The computation process converges slowly, resulting in a large number of iterations. The vertex-based fine-grained and synchronous computation model limits the convergence speed of the whole task computation.
-
The high volume of messages leads to high communication cost. In distributed graph processing systems, the entire graph needs to be partitioned for computational parallelism, and each slice is assigned to a different computational node for processing. Since there are associations between vertices in different partitions, the denseness of the associations between vertices affects the small amount of large messages.
-
Uneven load and barrel effect exist. Unreasonable data partitioning, inconsistent convergence of different data slices during iteration, and differences in computational power between nodes cause uneven load of computation between nodes.
GPS
GPS uses Pregel's distributed messaging model, which is based on batch synchronization processing. the GPS input is a directed graph, where each vertex of the graph retains a user-defined value, and a flag indicating whether the vertex is active or not. The computation proceeds in SuperStep iterations, terminating when all vertices are inactive.
In superstep i, each active vertex u performs the following actions in parallel:
- View the message sent to u in superstep i-1.
- Modifies its value.
- sends the message to other vertices in the graph and selectively changes to an inactive state.
The message sent from vertex u to vertex v in super step i is available to v for use in super step i + 1. The behavior of each vertex is encapsulated in a function vertex.compute(), which is executed only once in each step.
GPS is an open source distributed messaging system for large-scale graphical computing. It is designed to be scalable, fault-tolerant, and easy to program with simple user-provided functions. Compared to the Pregel system, the GPS system has three new features.
These three new features is :
-
GPS has an extended API that makes global computation easier to express and more efficient; compared to the Pregel API, which can only implement "vertex-centric" algorithms, the GPS API has extended capabilities to efficiently implement algorithms consisting of one or more vertex-centric computations in conjunction with global computation. GPS extends Pregel's API to include an additional function master.compute(). The implementation of graph computation in vertex.compute() is well suited for certain algorithms, such as computing PageRank, finding shortest paths, etc., all of which can be performed under fully "vertex-centric" conditions. However, some algorithms are a combination of vertex-centric (parallel) and global (sequential) computation. For example, consider the following k-means graph clustering algorithm, which consists of four components.
(a) Selecting k random vertices as "clustering centers", which is the global computation of the entire graph.
(b) Assigning each vertex to a cluster center, a vertex-centered computation.
(c) Evaluate the merit of the cluster by counting the number of edges that pass through the cluster, a vertex-centric computation.
(d) Decide whether to stop (if the clustering is good enough) or return to (a) for global computation.
GPS can run them by specifying "master" vertices, thus implementing global computation in vertex.compute(). However, there are two problems with this approach.
Theset two problems is:- the master vertex performs each global computation in a superstep where all other vertices are idle, thus wasting resources.
- vertex.compute() code becomes more difficult to understand because it contains some parts written for all vertices and other parts written for special vertices. To easily and efficiently integrate global computations, GPS extends Pregel's API and adds an additional function master.compute().
-
GPS has a dynamic redistribution scheme, which is based on a message-passing model that redistributes vertices to different workers during the computation. Dynamically reassigning some vertices to other workers during graph computation may reduce the number of messages sent through the network.
-
GPS has LALP (Large Adjacency Table Partitioning) optimization, which divides the adjacency table of high-order vertices across all computation nodes to improve performance.
Many real-world graphs have skewness distributions where the adjacency tables of a few vertices contain a large fraction of all edges in the graph. For these graphs, LALP can significantly improve network traffic and runtime. GPS specifies the parameter τ when using this optimization. If a vertex u has more than τ adjacent neighbors, GPS divides the adjacency list of u into \(N_1(u), N_2(u)...N_k(u)\) and sends \(N_j(u)\) to staff \(W_j\) for initial partitioning of the graph among the staff. During execution, when u sends a message to all its neighbors, the GPS intercepts the message and sends a message to each worker \(W_j\), who passes the message to all vertices in \(N_j(u)\).
In many graph algorithms, such as the components of PageRank and Find Connections, each vertex sends the same message to all its neighbors. For example, if a high-order vertex v on compute node i has 1000 neighbors on compute node j, then v sends the same message 1000 times between compute nodes i and j. The LALP optimization of GPS stores a partitioned adjacency table for high vertices on compute nodes where adjacent neighbors are located,reducing the 1000 messages to 1.
PowerGraph
PowerGraph is designed to process huge amount of Natural Graphs, which are generated from different platforms or real applications. Existing distributed graph processing platforms are still inefficient in processing performance because Natural graphs have a high degree of points and low quality partitioning strategies.PowerGraph is designed to solve these two problems, making it achieve an order of magnitude performance improvement in processing Natural Graphs.
PowerGraph's strategy to solve the problem is twofold.
First, a GAS (Gather Apply Scatter) computational model is proposed to parallelize the high-dimensional points. PowerGraph proposes its own set of computational models called GAS decomposition. The GAS decomposition process is as follows. Gather: collect the neighbor information, first from the same machine, and then aggregate the information collected from different hosts. The final summation information is obtained. Apply: Apply the value of the collection point to the center point. Scatter: update the neighbor points and edges, and trigger the neighbor points for the next iteration round.
The second is to use a point cut strategy to ensure the equilibrium of the whole cluster, which is very efficient for a large number of dense rate graph partitions. PowerGraph proposes a balanced graph partitioning scheme to ensure load balancing while reducing the amount of communication in the computation. In fact, the communication overhead is linearly related to the number of machines spanned by nodes, but the point cut minimizes the number of machines spanned by each vertex. instead of edge cuts, which synchronize a large number of edges, PowerGraph uses point cuts, which only synchronize the node information of one point.
The paper proposes 3 types of allocation: machine edge allocation, greedy cooperative edge allocation, and non-greedy edge allocation (Oblivious oblivious). The first strategy is a random edge placement strategy, where edges are placed randomly according to the point cut. The second one is the greedy point cut approach, because under the random cut, although each subgraph is basically balanced, the internal connectivity of subgraphs is poor. Therefore, PowerGraph proposes a heuristic greedy algorithm with the following basic principle: if a newly added edge, one of its nodes already exists on a machine, the edge is divided to the corresponding machine. Greedy point slicing is able to minimize the number of machines spanned by the machine. The actual greedy placement strategy performs better than the random placement strategy.
Regarding the greedy edge slicing strategy, the authors give two implementations. The first one is the collaborative edge placement strategy, which requires maintaining a global u-vertex placement history table, which is queried before executing the greedy cut, and this table needs to be updated during the execution. The strategy of collaborative point slicing, which is characterized by slow but high quality of point slicing. The second way is the Oblivious greedy strategy, which is an approximate greedy strategy that does not need to do global collaboration. The greedy algorithm does not depend on each machine to run and does not need to maintain a global table of records, but each machine maintains this table itself and does not need to do inter-machine communication. This strategy is fast, but the cut quality is lower.
PowerGraph offers both synchronous and asynchronous computation.
Synchronous execution process is:
- Excange message phase, master accepts messages from mirror.
- Receive Message phase, master receives the message sent by the previous round of Scatter and the message sent by mirror, activates the master with the message, and for the activated vertex, master notifies mirror of the activation and synchronizes the vectex_program to mirrors.
- gather phase, multi-threaded parallel gather, who finishes first, multi-threaded parallel localgraph in the vertices, mirror will gather the results to master.
- apply phase, master executes apply, and synchronizes the result of apply to mirror
- Scatter phase, master and mirror update the edge data based on the new vertex data and notify the neighboring vertices in the form of signal.
Asynchronous execution process is:
- At the beginning of each round of execution, master gets a message from the global scheduler (Sceduler), and after getting the message, master gets a lock and enters the Locking state. At the same time, master notifies mirror to acquire the lock and enter the Locking state. 2.
- master and mirror perform the Gathering operation separately, and mirror reports the Gathering result to master, which completes the aggregation.
- after master finishes applying, it synchronizes the result to mirror. 4.
- master and mirror execute scattering independently, and then release the lock and enter the None state after the execution is finished, waiting for a new task to come.
- mirror may receive another locking request from master when it is in scattering state, in this case, mirror will not release the lock after finishing scattering and go directly to the next round of tasks.
PowerSwitch
PowerSwitch is a hybrid execution system that divides a series of executions into multiple time intervals and periodically evaluates the potential benefits of mode switching for the next period. The speed of the used mode is monitored and the speed of future modes is predicted. At each time interval, it uses either synchronous or asynchronous modes to schedule and update vertices.
The reason for doing this is that the authors conducted performance tests for different graph algorithms and found that different periods in a graph algorithm would be suitable for different execution modes. Some periods are suitable for synchronous, such as when the number of active vertices is high. Some periods are suitable for asynchronous mode. However, the existing work uses only synchronous execution mode or asynchronous execution mode.
Synchronous mode means that there is synchronous control between two adjacent iterations, and the next iteration of computation can be started only after all tasks have completed that step, and the message sent to the kth iteration is visible to the vertices only at the kth iteration. The synchronous execution mode has a Barrier restriction between two adjacent iterations, and the next iteration of computation can be started only after all tasks have completed their work at that step. Under the ongoing synchronous computation mode implementation, two bitmaps are used in synchronous mode to identify the active vertices in the current and next iteration accordingly. The bitmap is updated when a message activates a vertex. After the iteration, all messages have been processed and the two bitmaps in each Worker are flipped.
The advantages of synchronous mode is batch sending of messages, greatly improving network utilization. Since messages are sent in batches, synchronous mode is more suitable for algorithms with high message traffic (IO-sensitive), and the computation on each vertex is lightweight.
The disadvantages of synchronous mode is
- In most graph algorithms, synchronous mode suffers from the disadvantage of iterative computational convergence asymmetry. This means that most vertices will converge quickly after a small number of iterations. However, there are some vertices that converge very slowly and require many rounds of iterative computation. 2.
- The synchronization model is not applicable to some graph processing algorithms: graph coloring algorithms aim to assign different colors to adjacent vertices using the least number of colors. In a greedy implementation, all vertices choose the minimum color at the same time. This is because adjacent vertices with the same color will pick the same color back and forth at the same time based on the same previous color.
Asynchronous mode means that there is no synchronous control between two adjacent iterations, and each task performs iterative computation independently without waiting for each other. During the iteration, when the message arrives at the receiver, the destination vertex can immediately start the computation processing and broadcast the new computation result without waiting synchronously for all vertices to receive all the message data. For asynchronous mode, there is a global priority (e.g. FIFO) queue to schedule active vertices. Each Worker thread has a local pending queue to hold stalled active vertices, which may wait for responses to messages from neighboring vertices.
The advantages of asynchronous mode is
-
Asynchronous mode can accelerate the convergence of the program. And it is very suitable for CPU-sensitive algorithms.
-
and some graph processing algorithms are only suitable for asynchronous mode.
The disadvantagesof asynchronous mode is
-
Asynchronous mode has vertex lock contention overhead. For adjacent vertices, asynchronous mode avoids data competition by calling Gather, Apply, and Scatter phases separately, which means that one of the vertex programs will be locked and needs to wait for the execution of one of the adjacent vertex phases to complete.
-
Asynchronous mode sends messages frequently and the CPU constantly encapsulates TCP\IP packets, which wastes CPU utilization.
The process of PowerSwitch is switching from synchronous execution mode to asynchronous mode, all active vertices in the bitmap of the next iteration of the synchronous model are imported to the de-global queue in the asynchronous mode. And switching from asynchronous execution mode to synchronous mode, PowerSwitch disables all worker threads from fetching new vertices from the global queue and then imports all active vertices from all vertices in the global queue into the bitmap of the current vertex in the Sync engine until after all computations on the active vertices to be processed have been computed.
GraphHP
GraphHP (Graph Hybrid Processing) is a hybrid model-based execution platform designed to optimize synchronous iterations with inherent waiting and communication costs. It optimizes synchronization wait and communication costs by adding a series of pseudo-hypersteps based on local iterations to each global iteration.
While existing platforms (e.g., Pregel, Giraph, Hama) have achieved high scalability, the high frequency synchronization and communication load between hosts severely affects the efficiency of parallel computing. To solve this critical problem, GraphHP not only inherits the vertex-centric BSP programming interface, but also is able to significantly reduce synchronization and communication load. By building a hybrid execution model within and between graph partitions, GraphHP implements pseudo-hyperstep iterative computation, separating intra-partition computation from distributed synchronization and communication. This hybrid execution model can effectively reduce the synchronization and communication load without the need for heavy scheduling algorithms or graph-centric serial algorithms.
The hybrid computation model GraphHP builds on the traditional BSP and optimizes performance by implementing an asynchronous message communication mechanism. The hybrid computation model consists of a series of global iterations, each of which consists of three phases "computation-communication-synchronization", in which the computation is divided into two parts: global computation and local computation. The local computation consists of a series of consecutive internal iterations, and supports asynchronous operations during the internal iterations. After the local computation is completed, the messages sent to the critical vertices in the global and local computation phases are transmitted over the network to other computation nodes, and then the hybrid execution engine performs the global communication and synchronization process, and then starts the computation of the next global iteration until the algorithm terminates. It can be seen that the local computation does not require direct communication with other partitions, while the boundary computation requires remote communication on different partitions.
As shown in Figure 1(b), the hybrid execution model is abstract. Like the standard BSP model, the hybrid model requires the same initialization iterations. At the first initialization iteration (iteration 0), all vertices are active and the user assigns initialization values and sends messages to the neighbors. Starting from iteration 1, the global phase and the local phase are called repeatedly. In the global phase, each active boundary vertex uses the message sent to it in the previous superstep as input and executes compute(), ensuring that each boundary point participates in the computation with the latest message from its neighbors.
Classify
We classify the five systems investigated by dividing the following three optimization objectives.
-
Speed up the convergence of algorithms and reduce the number of iterations. The existing research mainly transitions from fine-grained computation in terms of vertices to coarse-grained computation in terms of paths and subgraphs, and from synchronous computation mode to asynchronous computation mode and hybrid computation mode.
-
Reduce the number of messages to reduce the network load. Existing researchers have proposed vertex-based partitioning strategies and computation models, shared memory-based communication methods, message merging techniques, and Receiver-side scatter techniques.
-
Eliminate the bucket effect and achieve load balancing. In order to solve this problem, researchers have proposed various feasible optimization schemes from graph partitioning strategy, dynamic load migration and scheduling model.
Name | Computational Granularity | Task Scheduling Mechanism | Communication Method | Communication Work Method | Graph Partitioning Method |
---|---|---|---|---|---|
Pregel | Vertex | Synchronous | Message Based | Push | Edge Partitioning |
GPS | Vertex | Synchronous | Message-based | Push | Edge Segmentation |
PowerGraph | Vertex | Synchronous/Asynchronous | Memory-Based | Pull | Point Slicing |
PowerSwitch | Vertex | Synchronous/Asynchronous | Memory-Based | Pull | Point Slicing |
GraphHp | Subgraph | Hybrid | Message-Based | Push | Edge Slicing |
Issues and Challenge
After introducing the above graph processing systems, let's review the challenges that still exist in graph processing systems.
- Unstructured
Graph data is usually irregular and unstructured, i.e., the degree (number of connected edges) of the vertices in the graph varies widely. Take the power-law graph, which is widely used in reality, as an example, the degree of each vertex in the graph shows a power-law distribution, with only a small number of vertices having a high degree, while most of them have a low degree. This feature causes difficulties in data partitioning and reduces the parallelism in graph computation, making the time for processing different parts of the graph data unequal and the memory occupied by storing data different, thus bringing problems such as unbalanced load and high data transfer overhead. - Poor data localization
In graph computation problems, accesses to a small number of vertices may spread throughout the memory space, and such accesses are random and unpredictable. Since the associated data cannot be guaranteed to always be aligned and read sequentially at storage, the poor localization of graph data makes it difficult to accelerate using strategies such as CPU caching as in traditional data structures, thus making it more difficult to optimize graph data systems. - High Transfer Computation Ratio
The ratio of data transfer to computation in graph computing tasks is high compared to that of computation. As can be seen from the above stages, the data transfer part of its input and output causes memory access latency and thus becomes a performance bottleneck. This is one of the points where graph computation problems differ from traditional computationally intensive problems. - Complex data dependency
In graph data structures, vertices and edges have different connectivity relationships, and graph tasks are highly dependent on these connectivity relationships, so some vertices and edges may be accessed frequently while others may not be accessed or accessed only a small number of times during the task run. The complex data dependencies present in the graph structure make it impossible to accurately predict the path and number of data accesses in the task, as well as to determine the computational structure that limits its parallelism.
Trends and Future discussion
-
Support for new hardware With the development of computer technology, hardware devices such as GPUs, multi-core and solid state drives have greatly changed the current computing environment. The old computation mode, communication method and storage method can no longer give full play to the performance of the new hardware, therefore, how to build a new efficient and unified distributed graph processing system in this environment will become an important research direction.
-
Avoid repeated loading of data Currently, most distributed graph processing systems start a new job for each batch job when processing graph data, which involves data loading, computation and output of results. When multiple batch jobs are submitted at the same time, the system will load the data repeatedly, which will seriously affect the number of jobs processed per unit time. Therefore, in future research, designing a reasonable task scheduling strategy that can reduce the number of data loads while sharing intermediate results among jobs will greatly improve the system throughput.
-
Transactional queries are currently divided into two main types of graph processing systems: batch systems and transactional systems. The separation of these two types of systems has caused great inconvenience to the management of graph data. How to integrate the two types of systems and develop a system that supports both transactional query and batch processing will greatly facilitate the use of graph data.