Lambda architecture and Kappa architecture

https://blog.csdn.net/hjw199089/article/details/84713095

 

Lambda architecture and kappa architecture.

From

Lambda Architecture

Lambda architecture was originally proposed by the creator of Apache Storm, Nathan Marz. In his book, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), he proposed a pipeline architecture that aims to reduce the com‐ plexity seen in real-time analytics pipelines by constraining any incremental compu‐ tation to only a small portion of this architecture.
In lambda architecture, there are two paths for data to flow in the pipeline
• A “hot” path where latency-sensitive data (e.g., the results need to be ready in
seconds or less) flows for rapid consumption by analytics clients
• A “cold” path where all data goes and is processed in batches that can tolerate greater latencies (e.g., the results can take minutes or even hours) until results are ready
在这里插入图片描述
When data flows into the “cold” path, this data is immutable. Any changes to the value of particular datum are reflected by a new, timestamped datum being stored in the system alongside any previous values. This approach enables the system to re- compute the then-current value of a particular datum for any point in time across the history of the data collected. Because the “cold” path can tolerate a greater latency until the results are ready, the computation can afford to run across large data sets,and the types of calculation performed can be time-intensive. The objective of the “cold” path can be summarized as: take the time you need, but make the results extremely accurate.

When data flows into the “hot” path, this data is mutable and can be updated in place. In addition, the hot path places a latency constraint on the data (as the results are typ‐ ically desired in near–real time). The impact of this latency constraint is that the types of calculations that can be performed are limited to those that can happen quickly enough. This might mean switching from an algorithm that provides perfect accuracy to one that provides an approximation. An example of this involves counting the number of distinct items in a data set (e.g., the number of visitors to your website): you can either count each individual datum (which can be very high latency if the volume is high) or you can approximate the count using algorithms like HyperLogLog. The objective of the hot path can be summarized as: trade off some amount of accuracy in the results in order to ensure that the data is ready as quickly as possible.

lambda architecture captures all data entering the pipeline into immutable storage, labeled “Master Data” in the diagram. is data is processed by the batch layer and output to a serving layer in the form of batch views. Latency-sensitive calcula‐ tions are applied on the input data by the speed layer and exposed as real-time views. Analytics clients can consume the data from either the speed layer views or the serving layer views depending on the time frame of the data required. In some implementations, the serving layer can host both the real-time views and the batch views.

The hot and cold paths ultimately converge at the analytics client application. The cli‐ ent must choose the path from which it acquires the result. It can choose to use the less accurate but most up-to-date result from the hot path, or it can use the less timely but more accurate result from the cold path. An important component of this deci‐ sion relates to the window of time for which only the hot path has a result, as the cold path has not yet computed the result. Looking at this another way, the hot path has results for only a small window of time, and its results will ultimately be updated by the more accurate cold path in time. This has the effect of minimizing the volume of data that components of the hot path have to deal with.

Kappa Architecture

Kappa architecture surfaced in response to a desire to simplify the lambda architecture dramatically by making a single change: eliminate the cold path and make all processing happen in a near–real-time streaming mode (Figure 1-3). Recomputation on the data can still occur when needed; it is in effect streamed through the kappa pipeline again. The kappa architecture was proposed by Jay Kreps based on his expe‐ riences at LinkedIn, and particularly his frustrations in dealing with the problem of “code sharing” in lambda architectures—that is, keeping in sync the logic that does the computation in the hot path with the logic that is doing the same calculation in the cold path.
在这里插入图片描述
In the kappa architecture, analytics clients get their data only from the speed layer, as all computation happens upon streaming data. Input events can be mirrored to long-term storage to enable recomputation on historical data should the need arise.

Kappa architecture centers on a unified log (think of it as a highly scalable queue), which ingests all data (which are considered events in this architecture). There is a single deployment of this log in the architecture, whereby each event datum collected is immutable, the events are ordered, and the current state of an event is changed only by a new event being appended.the unified log itself is designed to be distributed and fault tolerant, suitable to its place at that heart of the analytics topology. All processing of events is performed on the input streams and persisted as a real-time view (just as in the hot path of the lambda architecture). To support the human-fault-tolerant aspects, the data ingested from the unified log is typically persisted to a scalable, fault-tolerant persistent stor‐ age so that it can be recomputed even if the data has “aged out” of the unified log.

Moving Beyond Lambda Kappa Architectures with Apache Kudu

normal real-time data flow
在这里插入图片描述

solution with kudu
在这里插入图片描述

posted on 2020-04-05 21:37  大大的橙子  阅读(395)  评论(0编辑  收藏  举报

导航