The Log

来源:https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

中英对照

What every software engineer should know about real-time data's unifying abstraction

日志: 每个软件工程师都应该知道实时数据的统一抽象

I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems. This has been an interesting experience: we built, deployed, and run to this day a distributed graph database, a distributed search backend, a Hadoop installation, and a first and second generation key-value store.

大约六年前,在一个特别有趣的时间,我加入了 LinkedIn。我们刚刚开始遇到我们单一的、集中的数据库的限制,需要开始向专门的分布式系统组合的过渡。这是一次有趣的经历: 我们构建、部署并运行了一个分布式图形数据库、一个分布式搜索后端、一个 Hadoop 安装以及一个第一代和第二代键值存储。

One of the most useful things I learned in all this was that many of the things we were building had a very simple concept at their heart: the log. Sometimes called write-ahead logs or commit logs or transaction logs, logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.

我从中学到的最有用的东西之一就是,我们正在建造的许多东西,其核心都有一个非常简单的概念: 日志。日志有时被称为提前写入日志或提交日志或事务日志,它们的历史几乎与计算机一样长,是许多分布式数据系统和实时应用程序体系结构的核心。

You can't fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them. I'd like to change that. In this post, I'll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.

如果不了解日志,你就不能完全理解数据库、 NoSQL 存储、键值存储、复制、 paxos、 hadoop、版本控制,或者几乎任何软件系统; 然而,大多数软件工程师并不熟悉它们。我想改变这一点。在本文中,我将向您介绍您需要了解的有关日志的所有内容,包括什么是日志,以及如何使用日志进行数据集成、实时处理和系统构建。

Part One: What Is a Log?

第一部分: 什么是日志?

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. It looks like this: 日志可能是最简单的存储抽象。它是一个仅限于附加的、按时间顺序排列的记录序列。它看起来像这样:

img

Records are appended to the end of the log, and reads proceed left-to-right. Each entry is assigned a unique sequential log entry number.

记录被追加到日志的末尾,读取操作从左到右进行。为每个条目分配一个唯一的顺序日志条目编号。

The ordering of records defines a notion of "time" since entries to the left are defined to be older then entries to the right. The log entry number can be thought of as the "timestamp" of the entry. Describing this ordering as a notion of time seems a bit odd at first, but it has the convenient property that it is decoupled from any particular physical clock. This property will turn out to be essential as we get to distributed systems.

记录的顺序定义了“时间”的概念,因为左边的条目定义为比右边的条目更早的条目。日志条目编号可以看作是条目的“时间戳”。起初,将这种排序描述为时间的概念似乎有点奇怪,但它具有方便的特性,即它与任何特定的物理时钟解耦。随着我们进入分布式系统,这个属性将变得必不可少。

The contents and format of the records aren't important for the purposes of this discussion. Also, we can't just keep adding records to the log as we'll eventually run out of space. I'll come back to this in a bit.

对于本文的讨论,记录的内容和格式并不重要。此外,我们不能只是不断地向日志中添加记录,因为我们最终会耗尽空间。我一会儿再回来讲这个问题。

So, a log is not all that different from a file or a table. A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time.

因此,日志与文件或表没有多大区别。文件是一个字节数组,表是一个记录数组,日志实际上只是一种表或文件,其中记录按时间排序。

At this point you might be wondering why it is worth talking about something so simple? How is a append-only sequence of records in any way related to data systems? The answer is that logs have a specific purpose: they record what happened and when. For distributed data systems this is, in many ways, the very heart of the problem.

在这一点上,你可能想知道为什么这么简单的事情值得谈论?一个只追加的记录序列与数据系统有什么关系?答案是日志有一个特定的目的: 它们记录发生了什么以及什么时候发生的。对于分布式数据系统来说,在许多方面,这是问题的核心。

But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing. The biggest difference is that text logs are meant to be primarily for humans to read and the "journal" or "data logs" I'm describing are built for programmatic access.

但在我们走得太远之前,让我澄清一些有点令人困惑的事情。每个程序员都熟悉另一种日志记录的定义ー应用程序可能使用 syslog 或 log4j 写入本地文件的非结构化错误消息或跟踪信息。为了清晰起见,我将这称为“应用程序日志记录”。应用程序日志是我正在描述的日志概念的退化形式。最大的区别在于,文本日志主要是供人阅读的,而我所描述的“日志”或“数据日志”是为了编程访问而构建的。

(Actually, if you think about it, the idea of humans reading through logs on individual machines is something of an anachronism. This approach quickly becomes an unmanageable strategy when many services and servers are involved and the purpose of logs quickly becomes as an input to queries and graphs to understand behavior across many machines—something for which english text in files is not nearly as appropriate as the kind structured log described here.)

(实际上,如果你仔细想想,人类阅读单个机器上的日志的想法有点不合时宜。当涉及到许多服务和服务器时,这种方法很快就变成了一种无法管理的策略,日志的目的很快就变成了查询和图表的输入,以了解许多机器的行为ーー对于这种情况,文件中的英文文本远远不如本文描述的那种结构化日志适当。)

Logs in databases

数据库中的日志

I don't know where the log concept originated—probably it is one of those things like binary search that is too simple for the inventor to realize it was an invention. It is present as early as IBM's System R. The usage in databases has to do with keeping in sync the variety of data structures and indexes in the presence of crashes. To make this atomic and durable, a database uses a log to write out information about the records they will be modifying, before applying the changes to all the various data structures it maintains. The log is the record of what happened, and each table or index is a projection of this history into some useful data structure or index. Since the log is immediately persisted it is used as the authoritative source in restoring all other persistent structures in the event of a crash.

我不知道日志这个概念是从哪里来的ーー可能是二进制搜索之类的东西之一,它太简单了,以至于发明者无法意识到它是一项发明。它早在 IBM 的 System r 时就已经出现了。在数据库中的使用必须在崩溃时保持各种数据结构和索引的同步。为了使该数据库具有原子性和持久性,数据库在将更改应用于其维护的所有各种数据结构之前,使用日志写出有关它们将要修改的记录的信息。日志是所发生事情的记录,每个表或索引都是这个历史记录映射到一些有用的数据结构或索引中。由于日志被立即持久化,因此在发生崩溃时,它被用作恢复所有其他持久性结构的权威源。

Over-time the usage of the log grew from an implementation detail of ACID to a method for replicating data between databases. It turns out that the sequence of changes that happened on the database is exactly what is needed to keep a remote replica database in sync. Oracle, MySQL, and PostgreSQL include log shipping protocols to transmit portions of log to replica databases which act as slaves. Oracle has productized the log as a general data subscription mechanism for non-oracle data subscribers with their XStreams and GoldenGate and similar facilities in MySQL and PostgreSQL are key components of many data architectures.

随着时间的推移,日志的使用从 ACID 的实现细节发展为在数据库之间复制数据的方法。事实证明,发生在数据库上的更改序列正是保持远程副本数据库同步所需的。Oracle、 MySQL 和 PostgreSQL 包括日志传输协议,用于将日志的一部分传输到充当从数据库的副本数据库。Oracle 将日志作为非 Oracle 数据订阅用户的通用数据订阅机制,其 XStreams 和 GoldenGate 以及 MySQL 和 PostgreSQL 中的类似工具是许多数据架构的关键组件。

Because of this origin, the concept of a machine readable log has largely been confined to database internals. The use of logs as a mechanism for data subscription seems to have arisen almost by chance. But this very abstraction is ideal for supporting all kinds of messaging, data flow, and real-time data processing.

由于这个原因,机器可读日志的概念主要局限于数据库内部。将日志用作数据订阅机制似乎是偶然出现的。但是,这种抽象非常适合于支持各种消息传递、数据流和实时数据处理。

Logs in distributed systems

分布式系统中的日志

The two problems a log solves—ordering changes and distributing data—are even more important in distributed data systems. Agreeing upon an ordering for updates (or agreeing to disagree and coping with the side-effects) are among the core design problems for these systems.

在分布式数据系统中,日志解决的两个问题——排序变更和分布数据——更为重要。同意订购更新(或同意不同意并处理副作用)是这些系统的核心设计问题之一。

The log-centric approach to distributed systems arises from a simple observation that I will call the State Machine Replication Principle:

以日志为中心的分布式系统方法源于一个简单的观察,我称之为状态机复制原则:

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state. 如果两个相同的确定性过程以相同的状态开始并以相同的顺序获得相同的输入,它们将产生相同的输出并以相同的状态结束

This may seem a bit obtuse, so let's dive in and understand what it means.

这可能看起来有点迟钝,所以让我们一头扎进去,理解它的含义。

Deterministic means that the processing isn't timing dependent and doesn't let any other "out of band" input influence its results. For example a program whose output is influenced by the particular order of execution of threads or by a call to gettimeofday or some other non-repeatable thing is generally best considered as non-deterministic.

确定性意味着处理不依赖于时间,不会让任何其他“带外”输入影响其结果。例如,如果一个程序的输出受到线程执行的特定顺序、 gettimeofday 调用或其他一些不可重复的东西的影响,那么这个程序通常被认为是不确定的。

The state of the process is whatever data remains on the machine, either in memory or on disk, at the end of the processing.

进程的状态是在处理结束时机器上的任何数据,无论是在内存还是在磁盘上。

The bit about getting the same input in the same order should ring a bell—that is where the log comes in. This is a very intuitive notion: if you feed two deterministic pieces of code the same input log, they will produce the same output.

以相同的顺序获得相同的输入这一点应该会有所启发ーー这就是日志的用武之地。这是一个非常直观的概念: 如果向两段确定性代码提供相同的输入日志,它们将产生相同的输出。

The application to distributed computing is pretty obvious. You can reduce the problem of making multiple machines all do the same thing to the problem of implementing a distributed consistent log to feed these processes input. The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.

这个应用程序在分布式计算的应用是相当明显的。您可以将使多台机器都执行相同操作的问题简化为实现分布式一致日志以提供这些进程输入的问题。这里的日志的目的是从输入流中挤出所有非确定性,以确保处理这个输入的每个副本保持同步。

When you understand it, there is nothing complicated or deep about this principle: it more or less amounts to saying "deterministic processing is deterministic". Nonetheless, I think it is one of the more general tools for distributed systems design.

当你理解它的时候,这个原理并没有什么复杂或深奥的东西: 它或多或少等同于说“确定性处理是确定性的”。尽管如此,我认为它是分布式系统设计中较为通用的工具之一。

One of the beautiful things about this approach is that the time stamps that index the log now act as the clock for the state of the replicas—you can describe each replica by a single number, the timestamp for the maximum log entry it has processed. This timestamp combined with the log uniquely captures the entire state of the replica.

这种方法的优点之一是,索引日志的时间戳现在充当副本状态的时钟ーー您可以用一个单独的数字描述每个副本,即它处理的最大日志条目的时间戳。这个时间戳与日志结合,唯一地捕获副本的整个状态。

There are a multitude of ways of applying this principle in systems depending on what is put in the log. For example, we can log the incoming requests to a service, or the state changes the service undergoes in response to request, or the transformation commands it executes. Theoretically, we could even log a series of machine instructions for each replica to execute or the method name and arguments to invoke on each replica. As long as two processes process these inputs in the same way, the processes will remaining consistent across replicas.

根据日志记录的内容,有多种方法可以在系统中应用这个原则。例如,我们可以将传入的请求记录到服务,或者服务在响应请求时所经历的状态更改,或者它所执行的转换命令。理论上,我们甚至可以为每个副本记录一系列机器指令,或者记录每个副本上要调用的方法名和参数。只要两个进程以相同的方式处理这些输入,那么这两个进程在各个副本之间就会保持一致。

Different groups of people seem to describe the uses of logs differently. Database people generally differentiate between physical and logical logging. Physical logging means logging the contents of each row that is changed. Logical logging means logging not the changed rows but the SQL commands that lead to the row changes (the insert, update, and delete statements).

不同的人群似乎对日志的用途有不同的描述。数据库人员通常区分物理日志和逻辑日志。物理日志记录意味着记录已更改的每一行的内容。逻辑日志记录不是记录已更改的行,而是记录导致行更改的 SQL 命令(插入、更新和删除语句)。

The distributed systems literature commonly distinguishes two broad approaches to processing and replication. The "state machine model" usually refers to an active-active model where we keep a log of the incoming requests and each replica processes each request. A slight modification of this, called the "primary-backup model", is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests. The other replicas apply in order the state changes the leader makes so that they will be in sync and ready to take over as leader should the leader fail.

分布式系统文献通常将处理和复制分为两大类。“状态机模型”通常指的是一个活动-活动模型,其中我们保存传入请求的日志,每个副本处理每个请求。对此稍作修改,称为“主备份模型” ,即选择一个副本作为领导者,并允许这个领导者按照请求到达的顺序处理请求,并在处理请求时记录对其状态的更改。其他的副本应用于领导者所做的状态更改,以便他们能够同步并准备好在领导者失败时接管领导者的位置。

img

To understand the difference between these two approaches, let's look at a toy problem. Consider a replicated "arithmetic service" which maintains a single number as its state (initialized to zero) and applies additions and multiplications to this value. The active-active approach might log out the transformations to apply, say "+1", "*2", etc. Each replica would apply these transformations and hence go through the same set of values. The "active-passive" approach would have a single master execute the transformations and log out the result, say "1", "3", "6", etc. This example also makes it clear why ordering is key for ensuring consistency between replicas: reordering an addition and multiplication will yield a different result.

为了理解这两种方法之间的区别,让我们来看一个玩具问题。考虑一个复制的“算术服务” ,它维护一个数字作为其状态(初始化为零) ,并对该值应用加法和乘法。Active-active 方法可能会注销要应用的转换,比如“ + 1”、“ * 2”等等。每个副本都应用这些转换,因此通过相同的值集。“主动-被动”方法将由一个主机执行转换并注销结果,比如“1”、“3”、“6”等等。这个例子还清楚地说明了为什么排序对于确保副本之间的一致性至关重要: 对加法和乘法重新排序将产生不同的结果。

imgThe distributed log can be seen as the data structure which models the problem of consensus. A log, after all, represents a series of decisions on the "next" value to append. You have to squint a little to see a log in the Paxos family of algorithms, though log-building is their most common practical application. With Paxos, this is usually done using an extension of the protocol called "multi-paxos", which models the log as a series of consensus problems, one for each slot in the log. The log is much more prominent in other protocols such as ZAB, RAFT, and Viewstamped Replication, which directly model the problem of maintaining a distributed, consistent log.

分布式日志可以看作是模拟一致性问题的数据结构。毕竟,日志表示要追加的“下一个”值的一系列决定。虽然日志构建是 Paxos 算法家族中最常见的实际应用,但要查看日志,您必须稍微眯起眼睛。对于 Paxos,这通常是通过使用被称为“ multi-Paxos”的协议的扩展来完成的,该协议将日志建模为一系列一致性问题,每个日志插槽一个。日志在其他协议(如 ZAB、 RAFT 和 Viewstamped Replication)中更为突出,这些协议直接为维护分布的、一致的日志的问题建模。

My suspicion is that our view of this is a little bit biased by the path of history, perhaps due to the few decades in which the theory of distributed computing outpaced its practical application. In reality, the consensus problem is a bit too simple. Computer systems rarely need to decide a single value, they almost always handle a sequence of requests. So a log, rather than a simple single-value register, is the more natural abstraction.

我的怀疑是,我们对此的看法受到历史发展轨迹的影响,可能是因为在过去的几十年里,分布式计算理论的发展速度超过了它的实际应用。实际上,共识问题有点太简单了。计算机系统很少需要决定一个值,它们几乎总是处理一系列请求。因此,日志,而不是简单的单值寄存器,是更自然的抽象。

Furthermore, the focus on the algorithms obscures the underlying log abstraction systems need. I suspect we will end up focusing more on the log as a commoditized building block irrespective of its implementation in the same way we often talk about a hash table without bothering to get in the details of whether we mean the murmur hash with linear probing or some other variant. The log will become something of a commoditized interface, with many algorithms and implementations competing to provide the best guarantees and optimal performance.

此外,对算法的关注掩盖了底层日志抽象系统需要的内容。 我怀疑我们最终会把注意力更多地集中在日志上,将其作为商品化的构建块,而不考虑它的实现,就像我们经常讨论散列表一样,而不去费心去了解细节,比如我们是指带有线性探测的散列还是其他变体的散列。 日志将成为一种商品化的接口,许多算法和实现竞相提供最好的保证和最佳性能。

Changelog 101: Tables and Events are Dual

更新日志101: 表和事件是双重的

Let's come back to databases for a bit. There is a facinating duality between a log of changes and a table. The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables. (And yes, table can mean keyed data store for the non-relational folks.)img

让我们回到数据库。 在更改日志和表格之间存在着令人印象深刻的二元性。 日志类似于所有贷方、借方和银行进程的列表; 表格是所有经常账户余额。 如果有更改日志,则可以应用这些更改以创建捕获当前状态的表。 此表将记录每个键的最新状态(截至特定的日志时间)。 在某种意义上,日志是更基本的数据结构: 除了创建原始表之外,您还可以将其转换为创建各种派生表。 (是的,table 可以表示非关系型人员的键控数据存储。)

This process works in reverse too: if you have a table taking updates, you can record these changes and publish a "changelog" of all the updates to the state of the table. This changelog is exactly what you need to support near-real-time replicas. So in this sense you can see tables and events as dual: tables support data at rest and logs capture change. The magic of the log is that if it is a complete log of changes, it holds not only the contents of the final version of the table, but also allows recreating all other versions that might have existed. It is, effectively, a sort of backup of everyprevious state of the table.

这个过程也以相反的方式运行: 如果您有一个接收更新的表,那么您可以记录这些更改,并发布对该表状态的所有更新的“更改日志”。这个更改日志正是您需要支持的接近实时的副本。因此,在这个意义上,您可以将表和事件看作是双重的: 表支持静止的数据,日志捕获更改。日志的神奇之处在于,如果它是一个完整的更改日志,那么它不仅保存表的最终版本的内容,而且还允许重新创建可能已经存在的所有其他版本。它实际上是表的每个以前状态的一种备份。

This might remind you of source code version control. There is a close relationship between source control and databases. Version control solves a very similar problem to what distributed data systems have to solve—managing distributed, concurrent changes in state. A version control system usually models the sequence of patches, which is in effect a log. You interact directly with a checked out "snapshot" of the current code which is analogous to the table. You will note that in version control systems, as in other distributed stateful systems, replication happens via the log: when you update, you pull down just the patches and apply them to your current snapshot.

这可能会让您想起源代码版本控制。源代码管理和数据库之间有着密切的关系。版本控制解决的问题与分布式数据系统解决的问题非常相似ーー管理分布式、并发的状态变更。版本控制系统通常对补丁的顺序建模,这实际上是一个日志。您直接与当前代码的签出“快照”(与表类似)进行交互。您将注意到,在版本控制系统中,就像在其他分布式有状态系统中一样,复制是通过日志进行的: 当您进行更新时,您只需拉下补丁并将它们应用到当前快照。

Some people have seen some of these ideas recently from Datomic, a company selling a log-centric database. This presentation gives a great overview of how they have applied the idea in their system. These ideas are not unique to this system, of course, as they have been a part of the distributed systems and database literature for well over a decade.

一些人最近看到了 Datomic 的一些想法,该公司出售一个以日志为中心的数据库。这个演示给出了他们如何在他们的系统中应用这个想法的一个很好的概述。当然,这些想法并不是这个系统所独有的,因为它们已经成为分布式系统和数据库文献的一部分超过十年了。

This may all seem a little theoretical. Do not despair! We'll get to practical stuff pretty quickly.

这一切看起来似乎有点理论化。不要绝望! 我们很快就会讲到实际的东西。

What's next

接下来是什么

In the remainder of this article I will try to give a flavor of what a log is good for that goes beyond the internals of distributed computing or abstract distributed computing models. This includes:

在本文的其余部分,我将尝试给出一个关于日志有什么好处的概念,这个概念不仅仅局限于分布式计算或者抽象的分布式计算模型的内部。这包括:

  1. Data Integration 数据集成—Making all of an organization's data easily available in all its storage and processing systems. ー使一个组织的所有数据在其所有存储和处理系统中容易获得
  2. Real-time data processing 实时数据处理—Computing derived data streams. ー计算派生的数据流
  3. Distributed system design 分布式系统设计—How practical systems can by simplified with a log-centric design. ー以日志为中心的设计如何简化实际系统

These uses all resolve around the idea of a log as a stand-alone service.

这些方法都是围绕日志作为独立服务的概念进行解析的。

In each case, the usefulness of the log comes from simple function that the log provides: producing a persistent, re-playable record of history. Surprisingly, at the core of these problems is the ability to have many machines playback history at their own rate in a deterministic manner.

在每种情况下,日志的有用性都来自日志提供的简单功能: 生成一个持久的、可重复播放的历史记录。令人惊讶的是,这些问题的核心是能够让许多机器以自己的速率以确定的方式回放历史记录。

Part Two: Data Integration

第二部分: 数据集成

Let me first say what I mean by "data integration" and why I think it's important, then we'll see how it relates back to logs.

首先让我说明我所说的“数据集成”是什么意思,以及为什么我认为它很重要,然后我们将看到它与日志之间的关系。

Data integration is making all the data an organization has available in all its services and systems. 数据集成使组织的所有服务和系统中的所有数据都可用

This phrase "data integration" isn't all that common, but I don't know a better one. The more recognizable term ETL usually covers only a limited part of data integration—populating a relational data warehouse. But much of what I am describing can be thought of as ETL generalized to cover real-time systems and processing flows.img

“数据集成”这个短语并不常见,但我想不出更好的词了。更容易识别的术语 ETL 通常只涵盖数据集成的有限部分ーー填充关系数据仓库。但是,我所描述的大部分内容可以被认为是广义的 ETL,用于覆盖实时系统和处理流程。

You don't hear much about data integration in all the breathless interest and hype around the idea of big data, but nonetheless, I believe this mundane problem of "making the data available" is one of the more valuable things an organization can focus on.

在所有对大数据的关注和大肆宣传中,你不会听到太多关于数据集成的内容,但尽管如此,我认为这个“让数据可用”的普通问题是一个组织可以关注的更有价值的事情之一。

Effective use of data follows a kind of Maslow's hierarchy of needs. The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts). This data needs to be modeled in a uniform way to make it easy to read and process. Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.

有效地使用数据遵循一种马斯洛的需求层次。金字塔的底层包括捕获所有相关的数据,能够将它们放在一个适用的处理环境中(可以是一个奇特的实时查询系统,也可以只是文本文件和 python 脚本)。这些数据需要以统一的方式建模,以便于读取和处理。一旦这些以统一方式捕获数据的基本需求得到满足,就可以合理地在基础设施上以各种方式处理这些数据—— mapreduce、实时查询系统等。

It's worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater. Once data and processing are available, one can move concern on to more refined problems of good data models and consistent well understood semantics. Finally, concentration can shift to more sophisticated processing—better visualization, reporting, and algorithmic processing and prediction.

值得注意的是,显而易见的是: 如果没有可靠和完整的数据流,Hadoop 集群只不过是一个非常昂贵和难以组装的空间加热器。一旦数据和处理可用,人们就可以将注意力转向更精细的问题,即良好的数据模型和一致的、理解良好的语义。最后,注意力可以转移到更复杂的处理上ーー更好的可视化、报告、算法处理和预测。

In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques. This is completely backwards.

根据我的经验,大多数组织在这个金字塔的底部都有巨大的漏洞ーー他们缺乏可靠的完整的数据流ーー但他们希望直接采用先进的数据建模技术。这完全是倒退。

So the question is, how can we build reliable data flow throughout all the data systems in an organization?

因此,问题是,我们如何在组织中的所有数据系统中构建可靠的数据流?

Data Integration: Two complications

数据集成: 两个并发症

Two trends make data integration harder.

有两个趋势使得数据集成更加困难。

The event data firehose

事件数据消防喉

The first trend is the rise of event data. Event data records things that happen rather than things that are. In web systems, this means user activity logging, but also the machine-level events and statistics required to reliably operate and monitor a data center's worth of machines. People tend to call this "log data" since it is often written to application logs, but that confuses form with function. This data is at the heart of the modern web: Google's fortune, after all, is generated by a relevance pipeline built on clicks and impressions—that is, events.

第一个趋势是事件数据的增加。事件数据记录的是发生的事情,而不是正在发生的事情。在 web 系统中,这意味着用户活动日志,但也意味着机器级事件和统计信息,这些信息需要可靠地操作和监视数据中心的机器价值。人们倾向于称之为“日志数据” ,因为它通常被写入应用程序日志,但这混淆了形式和功能。这些数据是现代网络的核心: 毕竟,谷歌的财富是通过建立在点击和展示(也就是事件)基础上的相关性管道产生的。

And this stuff isn't limited to web companies, it's just that web companies are already fully digital, so they are easier to instrument. Financial data has long been event-centric. RFID adds this kind of tracking to physical objects. I think this trend will continue with the digitization of traditional businesses and activities.

这些东西并不仅限于网络公司,只是网络公司已经完全数字化了,所以他们更容易操作。长期以来,财务数据一直是以事件为中心的。射频识别技术将这种跟踪技术添加到物理对象中。我认为,随着传统企业和活动的数字化,这一趋势将继续下去。

This type of event data records what happened, and tends to be several orders of magnitude larger than traditional database uses. This presents significant challenges for processing.

这种类型的事件数据记录所发生的事件,并且比传统的数据库使用数量级大几百万。这给处理带来了巨大的挑战。

The explosion of specialized data systems

专业数据系统的爆炸式增长

The second trend comes from the explosion of specialized data systems that have become popular and often freely available in the last five years. Specialized systems exist for OLAP, search, simple online storage, batch processing, graph analysis, and so on.

第二个趋势来自专门数据系统的爆炸式增长,这些系统在过去五年中变得流行并经常可以免费获得。存在专门的 OLAP、搜索、简单的在线存储、批处理、图形分析等系统。

The combination of more data of more varieties and a desire to get this data into more systems leads to a huge data integration problem.

更多种类的更多数据和将这些数据放入更多系统的愿望的结合导致了一个巨大的数据集成问题。

Log-structured data flow

日志结构化数据流

The log is the natural data structure for handling data flow between systems. The recipe is very simple: 日志是处理系统间数据流的自然数据结构,处理方法非常简单:

Take all the organization's data and put it into a central log for real-time subscription. 获取所有组织的数据,并将其放入中央日志,以便进行实时订阅

Each logical data source can be modeled as its own log. A data source could be an application that logs out events (say clicks or page views), or a database table that accepts modifications. Each subscribing system reads from this log as quickly as it can, applies each new record to its own store, and advances its position in the log. Subscribers could be any kind of data system—a cache, Hadoop, another database in another site, a search system, etc.img

每个逻辑数据源都可以建模为自己的日志。数据源可以是注销事件的应用程序(例如单击或页面视图) ,也可以是接受修改的数据库表。每个订阅系统以最快的速度读取这个日志,将每个新记录应用到它自己的存储中,并在日志中提升它的位置。订阅者可以是任何类型的数据系统ーー缓存、 Hadoop、另一个站点的另一个数据库、搜索系统等。

For example, the log concept gives a logical clock for each change against which all subscribers can be measured. This makes reasoning about the state of the different subscriber systems with respect to one another far simpler, as each has a "point in time" they have read up to.

例如,日志概念为所有订阅者可以度量的每个更改提供一个逻辑时钟。这使得推理不同用户系统相对于彼此的状态变得简单得多,因为每个用户系统都有一个它们已经读过的“时间点”。

To make this more concrete, consider a simple case where there is a database and a collection of caching servers. The log provides a way to synchronize the updates to all these systems and reason about the point of time of each of these systems. Let's say we write a record with log entry X and then need to do a read from the cache. If we want to guarantee we don't see stale data, we just need to ensure we don't read from any cache which has not replicated up to X.

为了使其更具体,考虑一个简单的例子,其中有一个数据库和一个缓存服务器集合。日志提供了一种方法来同步对所有这些系统的更新,并推断每个系统的时间点。假设我们用日志条目 x 写一条记录,然后需要从缓存中读取。如果我们想要保证我们不会看到陈旧的数据,我们只需要确保我们不会读取任何没有复制到 x 的缓存。

The log also acts as a buffer that makes data production asynchronous from data consumption. This is important for a lot of reasons, but particularly when there are multiple subscribers that may consume at different rates. This means a subscribing system can crash or go down for maintenance and catch up when it comes back: the subscriber consumes at a pace it controls. A batch system such as Hadoop or a data warehouse may consume only hourly or daily, whereas a real-time query system may need to be up-to-the-second. Neither the originating data source nor the log has knowledge of the various data destination systems, so consumer systems can be added and removed with no change in the pipeline.

日志还充当缓冲区,使数据生产与数据消费异步。这一点很重要,原因有很多,特别是当有多个用户以不同的速度消费时。这意味着订阅系统可能会因维护而崩溃或下降,并在回来时赶上: 订阅者按照它控制的速度消耗。诸如 Hadoop 或数据仓库之类的批处理系统可能每小时或每天消耗一次,而实时查询系统可能需要最新的数据。原始数据源和日志都不了解各种数据目的地系统,因此可以添加和删除使用者系统,而不需要在管道中进行更改。

img

"Each working data pipeline is designed like a log; each broken data pipeline is broken in its own way."—Count Leo Tolstoy (translation by the author) ”每个工作数据管道的设计都像一个日志; 每个损坏的数据管道都以自己的方式被破坏。“ー列夫 · 托尔斯泰伯爵(作者译)

Of particular importance: the destination system only knows about the log and not any details of the system of origin. The consumer system need not concern itself with whether the data came from an RDBMS, a new-fangled key-value store, or was generated without a real-time query system of any kind. This seems like a minor point, but is in fact critical.

特别重要的是: 目的地系统只知道日志,而不知道原始系统的任何细节。使用者系统无需关心数据是来自 RDBMS (一种新的键值存储) ,还是在没有任何类型的实时查询系统的情况下生成的。这似乎是一个次要问题,但实际上是至关重要的。

I use the term "log" here instead of "messaging system" or "pub sub" because it is a lot more specific about semantics and a much closer description of what you need in a practical implementation to support data replication. I have found that "publish subscribe" doesn't imply much more than indirect addressing of messages—if you compare any two messaging systems promising publish-subscribe, you find that they guarantee very different things, and most models are not useful in this domain. You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics. In distributed systems, this model of communication sometimes goes by the (somewhat terrible) name of atomic broadcast.

我在这里使用术语“ log” ,而不是“ messaging system”或“ pub sub” ,因为它更具体地描述了语义,并且更贴切地描述了在实际实现中支持数据复制所需的内容。我发现,“发布/订阅”仅仅意味着消息的间接寻址ーー如果你比较任何两个承诺发布/订阅的消息传递系统,你会发现它们保证的东西非常不同,而且大多数模型在这个领域都没有用。您可以将日志看作是一种具有持久性保证和强有力的排序语义的消息传递系统。在分布式系统中,这种通信模式有时被称为原子广播(atomic broadcast)(有点可怕)。

It's worth emphasizing that the log is still just the infrastructure. That isn't the end of the story of mastering data flow: the rest of the story is around metadata, schemas, compatibility, and all the details of handling data structure and evolution. But until there is a reliable, general way of handling the mechanics of data flow, the semantic details are secondary.

值得强调的是,日志仍然只是基础设施。这并不是掌握数据流的故事的结尾: 故事的其余部分是关于元数据、模式、兼容性,以及处理数据结构和演变的所有细节。但是,除非有一种可靠的、通用的方法来处理数据流的机制,否则语义细节是次要的。

At LinkedIn

在 LinkedIn

I got to watch this data integration problem emerge in fast-forward as LinkedIn moved from a centralized relational database to a collection of distributed systems. 随着 LinkedIn 从一个集中的关系数据库管理系统转变为一系列分布式系统,我不得不眼睁睁地看着这个数据集成问题迅速出现

These days our major data systems include:

目前,我们的主要数据系统包括:

Each of these is a specialized distributed system that provides advanced functionality in its area of specialty.

每一个都是一个专门的分布式系统,在其专业领域提供高级功能。

img

This idea of using logs for data flow has been floating around LinkedIn since even before I got here. One of the earliest pieces of infrastructure we developed was a service called databus that provided a log caching abstraction on top of our early Oracle tables to scale subscription to database changes so we could feed our social graph and search indexes.

使用日志进行数据流的想法早在我来到这里之前就已经在 LinkedIn 上流传了。我们开发的最早的基础设施之一是一个名为 databus 的服务,它在我们早期的 Oracle 表之上提供了一个日志缓存抽象,以便对数据库更改进行规模化订阅,这样我们就可以提供我们的社交图和搜索索引。

I'll give a little bit of the history to provide context. My own involvement in this started around 2008 after we had shipped our key-value store. My next project was to try to get a working Hadoop setup going, and move some of our recommendation processes there. Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms. So began a long slog.

我将给出一点历史来提供上下文。我自己在这方面的参与始于2008年左右,当时我们已经发布了我们的关键价值商店。我的下一个项目是尝试启动一个工作的 Hadoop 设置,并将我们的一些推荐过程移动到那里。由于在这个领域没有什么经验,我们自然而然地把几周的时间用于获取数据的输入和输出,剩下的时间则用于实现奇特的预测算法。于是,一段漫长的跋涉开始了。

We originally planned to just scrape the data out of our existing Oracle data warehouse. The first discovery was that getting data out of Oracle quickly is something of a dark art. Worse, the data warehouse processing was not appropriate for the production batch processing we planned for Hadoop—much of the processing was non-reversable and specific to the reporting being done. We ended up avoiding the data warehouse and going directly to source databases and log files. Finally, we implemented another pipeline to load data into our key-value store for serving results.

我们最初计划只从现有的 Oracle 数据仓库中获取数据。第一个发现是,迅速从甲骨文获取数据是一门黑暗的艺术。更糟糕的是,数据仓库处理不适合我们为 hadoop 计划的生产批处理——大部分处理是不可逆转的,而且是针对所完成的报告的。我们最终避开了数据仓库,直接访问源数据库和日志文件。最后,我们实现了另一个管道,将数据加载到键值存储区以提供服务结果。

This mundane data copying ended up being one of the dominate items for the original development. Worse, any time there was a problem in any of the pipelines, the Hadoop system was largely useless—running fancy algorithms on bad data just produces more bad data.

这种普通的数据复制最终成为原始开发的主要项目之一。更糟糕的是,任何时候,只要管道出现问题,Hadoop 系统基本上都是无用的——在坏数据上运行花哨的算法只会产生更多坏数据。

Although we had built things in a fairly generic way, each new data source required custom configuration to set up. It also proved to be the source of a huge number of errors and failures. The site features we had implemented on Hadoop became popular and we found ourselves with a long list of interested engineers. Each user had a list of systems they wanted integration with and a long list of new data feeds they wanted.

虽然我们以一种相当通用的方式构建了一些内容,但是每个新的数据源都需要自定义配置才能设置。它也被证明是大量错误和失败的根源。我们在 Hadoop 上实现的站点功能变得流行起来,我们发现有一长串感兴趣的工程师。每个用户都有一个他们想要集成的系统列表和一个他们想要的新数据源的长列表。

img

ETL in Ancient Greece. Not much has changed. 古希腊的 ETL,没什么变化

A few things slowly became clear to me.

有些事情慢慢变得清晰起来。

First, the pipelines we had built, though a bit of a mess, were actually extremely valuable. Just the process of making data available in a new processing system (Hadoop) unlocked a lot of possibilities. New computation was possible on the data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems.

首先,我们修建的管道虽然有点乱,但是非常有价值。仅仅是在新的处理系统(Hadoop)中使数据可用的过程就释放了许多可能性。新的计算方法有可能使用以前很难处理的数据。许多新产品和分析只是把以前锁定在专门系统中的多个数据片段放在一起而已。

Second, it was clear that reliable data loads would require deep support from the data pipeline. If we captured all the structure we needed, we could make Hadoop data loads fully automatic, so that no manual effort was expanded adding new data sources or handling schema changes—data would just magically appear in HDFS and Hive tables would automatically be generated for new data sources with the appropriate columns.

其次,可靠的数据负载显然需要数据管道的大力支持。如果我们捕获了所有我们需要的结构,我们可以使 Hadoop 数据加载完全自动化,这样就不需要手动添加新的数据源或处理模式更改---- 数据将奇迹般地出现在 HDFS 中,并且 Hive 表将自动生成带有适当列的新数据源。

Third, we still had very low data coverage. That is, if you looked at the overall percentage of the data LinkedIn had that was available in Hadoop, it was still very incomplete. And getting to completion was not going to be easy given the amount of effort required to operationalize each new data source.

第三,我们的数据覆盖率仍然很低。也就是说,如果你看看 LinkedIn 在 Hadoop 中可用的数据的总体百分比,你会发现它仍然是非常不完整的。鉴于操作每个新数据源所需的工作量,要完成这些任务并不容易。

The way we had been proceeding, building out custom data loads for each data source and destination, was clearly infeasible. We had dozens of data systems and data repositories. Connecting all of these would have lead to building custom piping between each pair of systems something like this:

我们一直在进行的为每个数据源和目标构建自定义数据加载的方法显然是不可行的。我们有几十个数据系统和数据仓库。将所有这些连接起来,就可以在每对系统之间建立定制的管道,如下所示:

img

Note that data often flows in both directions, as many systems (databases, Hadoop) are both sources and destinations for data transfer. This meant we would end up building two pipelines per system: one to get data in and one to get data out.

请注意,数据通常是双向流动的,因为许多系统(数据库、 Hadoop)既是数据传输的源,也是数据传输的目的地。这意味着我们最终将为每个系统建立两个管道: 一个用于获取数据,另一个用于获取数据。

This clearly would take an army of people to build and would never be operable. As we approached fully connectivity we would end up with something like O(N2) pipelines.

这显然需要大量的人力来建造,而且永远不可操作。当我们接近完全连接时,我们最终会得到类似于 o (N2)管道的东西。

Instead, we needed something generic like this:

相反,我们需要这样一些通用的东西:

img

As much as possible, we needed to isolate each consumer from the source of the data. They should ideally integrate with just a single data repository that would give them access to everything.

我们需要尽可能地将每个消费者与数据源隔离开来。理想情况下,他们应该只集成一个单一的数据存储库,这样他们就可以访问任何东西。

The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of each consumer of data.

其想法是,添加一个新的数据系统(无论是数据源还是数据目的地)应该创建集成工作,只将其连接到单个管道,而不是每个数据消费者。

This experience lead me to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. We wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, etc.

这次经历使我专注于创建 Kafka,将我们在消息传递系统中看到的与流行于数据库和分布式系统内部的日志概念结合起来。我们想要的东西,作为一个中央管道首先为所有活动数据,并最终为许多其他用途,包括数据部署出 Hadoop,监视数据等。

For a long time, Kafka was a little unique (some would say odd) as an infrastructure product—neither a database nor a log file collection system nor a traditional messaging system. But recently Amazon has offered a service that is very very similar to Kafka called Kinesis. The similarity goes right down to the way partitioning is handled, data is retained, and the fairly odd split in the Kafka API between high- and low-level consumers. I was pretty happy about this. A sign you've created a good infrastructure abstraction is that AWS offers it as a service! Their vision for this seems to be exactly similar to what I am describing: it is the piping that connects all their distributed systems—DynamoDB, RedShift, S3, etc.—as well as the basis for distributed stream processing using EC2.

很长一段时间以来,Kafka 作为一个基础设施产品有点独特(有些人会说奇怪) ,它既不是数据库,也不是日志文件收集系统,也不是传统的消息传递系统。但是最近亚马逊提供了一个非常类似于卡夫卡的服务,叫做 Kinesis。相似之处在于分区的处理方式、数据的保留以及 Kafka API 中高级用户和低级用户之间相当奇怪的分离。我对此感到很高兴。您创建了一个良好的基础设施抽象的标志是 AWS 将其作为一种服务提供!他们对此的看法似乎与我所描述的完全相似: 管道连接了所有的分布式系统ー dynamodb、 RedShift、 S3等等,也是使用 EC2进行分布式流处理的基础。

Relationship to ETL and the Data Warehouse

与 ETL 和数据仓库的关系

Let's talk data warehousing for a bit. The data warehouse is meant to be a repository of the clean, integrated data structured to support analysis. This is a great idea. For those not in the know, the data warehousing methodology involves periodically extracting data from source databases, munging it into some kind of understandable form, and loading it into a central data warehouse. Having this central location that contains a clean copy of all your data is a hugely valuable asset for data-intensive analysis and processing. At a high level, this methodology doesn't change too much whether you use a traditional data warehouse like Oracle or Teradata or Hadoop, though you might switch up the order of loading and munging.

让我们稍微谈谈数据仓库。数据仓库意味着一个清晰的、集成的数据结构的存储库,以支持分析。这是个好主意。对于那些不了解的人来说,数据仓库方法包括定期从源数据库中提取数据,将其转换成某种可理解的形式,并将其加载到中央数据仓库中。对于数据密集型分析和处理来说,拥有一个包含所有数据的干净副本的中心位置是一个非常有价值的资产。在高层次上,无论使用传统的数据仓库(如 Oracle、 Teradata 或 Hadoop) ,这种方法都不会改变太多,尽管可能会改变加载和吞吐的顺序。

A data warehouse containing clean, integrated data is a phenomenal asset, but the mechanics of getting this are a bit out of date.

包含干净、集成的数据的数据仓库是一种显著的资产,但是获得这种数据仓库的机制有点过时。

img

The key problem for a data-centric organization is coupling the clean integrated data to the data warehouse. A data warehouse is a piece of batch query infrastructure which is well suited to many kinds of reporting and ad hoc analysis, particularly when the queries involve simple counting, aggregation, and filtering. But having a batch system be the only repository of clean complete data means the data is unavailable for systems requiring a real-time feed—real-time processing, search indexing, monitoring systems, etc.

以数据为中心的组织的关键问题是将干净的集成数据耦合到数据仓库。数据仓库是一种批量查询基础设施,非常适合于多种报告和特定分析,特别是当查询涉及简单的计数、聚合和过滤时。但是,将批处理系统作为保存完整数据的唯一储存库意味着,对于需要实时反馈(即实时处理、搜索索引、监控系统等)的系统来说,这些数据是不可用的。

In my view, ETL is really two things. First, it is an extraction and data cleanup process—essentially liberating data locked up in a variety of systems in the organization and removing an system-specific non-sense. Secondly, that data is restructured for data warehousing queries (i.e. made to fit the type system of a relational DB, forced into a star or snowflake schema, perhaps broken up into a high performance column format, etc). Conflating these two things is a problem. The clean, integrated repository of data should be available in real-time as well for low-latency processing as well as indexing in other real-time storage systems.

在我看来,ETL 实际上是两件事。首先,它是一个提取和数据清理过程ーー实质上是释放锁定在组织中各种系统中的数据,并去除系统特定的无意义数据。其次,为数据仓库查询重新构造数据(例如,为适应关系数据库的类型系统而构造,强制转换为星型或雪花型模式,可能分解为高性能列格式等等)。把这两件事混为一谈是个问题。干净的、集成的数据存储库应该是实时的,并且可用于低延迟处理以及其他实时存储系统中的索引。

I think this has the added benefit of making data warehousing ETL much more organizationally scalable. The classic problem of the data warehouse team is that they are responsible for collecting and cleaning all the data generated by every other team in the organization. The incentives are not aligned: data producers are often not very aware of the use of the data in the data warehouse and end up creating data that is hard to extract or requires heavy, hard to scale transformation to get into usable form. Of course, the central team never quite manages to scale to match the pace of the rest of the organization, so data coverage is always spotty, data flow is fragile, and changes are slow.

我认为这还有一个额外的好处,那就是使数据仓库 ETL 在组织上更具可伸缩性。数据仓库团队的经典问题是,他们负责收集和清理组织中其他团队生成的所有数据。激励机制并不一致: 数据生产者往往不太清楚数据仓库中数据的使用,最终生成难以提取或需要大量难以缩放的转换才能转换成可用的形式的数据。当然,中心团队从来没有设法按照组织其他部门的速度进行扩展,所以数据覆盖总是参差不齐,数据流很脆弱,变化很慢。

A better approach is to have a central pipeline, the log, with a well defined API for adding data. The responsibility of integrating with this pipeline and providing a clean, well-structured data feed lies with the producer of this data feed. This means that as part of their system design and implementation they must consider the problem of getting data out and into a well structured form for delivery to the central pipeline. The addition of new storage systems is of no consequence to the data warehouse team as they have a central point of integration. The data warehouse team handles only the simpler problem of loading structured feeds of data from the central log and carrying out transformation specific to their system.

更好的方法是使用一个中央管道,即日志,以及定义良好的用于添加数据的 API。与此管道集成并提供干净、结构良好的数据提要的责任在于此数据提要的生成者。这意味着,作为系统设计和实现的一部分,他们必须考虑将数据输出到结构良好的形式以便传递到中央管道的问题。新存储系统的添加对数据仓库团队没有影响,因为它们具有集成的中心点。数据仓库团队只处理从中央日志中加载结构化数据提要并执行特定于其系统的转换这一较为简单的问题。

img

This point about organizational scalability becomes particularly important when one considers adopting additional data systems beyond a traditional data warehouse. Say, for example, that one wishes to provide search capabilities over the complete data set of the organization. Or, say that one wants to provide sub-second monitoring of data streams with real-time trend graphs and alerting. In either of these cases, the infrastructure of the traditional data warehouse or even a Hadoop cluster is going to be inappropriate. Worse, the ETL processing pipeline built to support database loads is likely of no use for feeding these other systems, making bootstrapping these pieces of infrastructure as large an undertaking as adopting a data warehouse. This likely isn't feasible and probably helps explain why most organizations do not have these capabilities easily available for all their data. By contrast, if the organization had built out feeds of uniform, well-structured data, getting any new system full access to all data requires only a single bit of integration plumbing to attach to the pipeline.

当考虑采用传统数据仓库之外的其他数据系统时,组织可伸缩性这一点就变得尤为重要。例如,希望在组织的完整数据集上提供搜索功能。或者,假设有人希望为数据流提供亚秒级的实时趋势图和警报监视。在这两种情况下,传统数据仓库甚至 Hadoop 集群的基础设施都是不合适的。更糟糕的是,为支持数据库加载而构建的 ETL 处理管道可能无法为这些其他系统提供服务,因此启动这些基础设施就像采用数据仓库一样需要大量的工作。这可能是不可行的,并且可能有助于解释为什么大多数组织没有这些能力可以容易地获得他们所有的数据。相比之下,如果组织已经构建了统一的、结构良好的数据提要,那么获得任何新系统对所有数据的完全访问只需要连接到管道的一个单位的集成管道。

This architecture also raises a set of different options for where a particular cleanup or transformation can reside:

这个体系结构还提供了一组不同的选项,用于特定的清理或转换可以驻留在哪里:

  1. It can be done by the data producer prior to adding the data to the company wide log. 可以由数据生成器在将数据添加到公司范围的日志之前完成
  2. It can be done as a real-time transformation on the log (which in turn produces a new, transformed log) 它可以作为日志上的实时转换来完成(这反过来又会生成一个新的经过转换的日志)
  3. It can be done as part of the load process into some destination data system 它可以作为加载过程的一部分被加载到目标数据系统中

The best model is to have cleanup done prior to publishing the data to the log by the publisher of the data. This means ensuring the data is in a canonical form and doesn't retain any hold-overs from the particular code that produced it or the storage system in which it may have been maintained. These details are best handled by the team that creates the data since they know the most about their own data. Any logic applied in this stage should be lossless and reversible.

最好的模式是在数据发布者将数据发布到日志之前进行清理。这意味着要确保数据是规范形式的,并且不会保留产生数据的特定代码或维护数据的存储系统的任何内容。这些细节最好由创建数据的团队来处理,因为他们最了解自己的数据。在这个阶段应用的任何逻辑都应该是无损的和可逆的。

Any kind of value-added transformation that can be done in real-time should be done as post-processing on the raw log feed produced. This would include things like sessionization of event data, or the addition of other derived fields that are of general interest. The original log is still available, but this real-time processing produces a derived log containing augmented data.

任何可以实时完成的增值转换都应该作为日志饲料的后处理来完成。这将包括事件数据的会话,或者添加其他一般感兴趣的派生字段。原始日志仍然可用,但这种实时处理将生成包含增强数据的派生日志。

Finally, only aggregation that is specific to the destination system should be performed as part of the loading process. This might include transforming data into a particular star or snowflake schema for analysis and reporting in a data warehouse. Because this stage, which most naturally maps to the traditional ETL process, is now done on a far cleaner and more uniform set of streams, it should be much simplified.

最后,只有特定于目标系统的聚合应该作为加载过程的一部分执行。这可能包括将数据转换为特定的星型或雪花模式,以便在数据仓库中进行分析和报告。因为这个最自然地映射到传统 ETL 过程的阶段,现在是在一组更清洁、更统一的流上完成的,所以应该简化得多。

Log Files and Events

日志文件和事件

Let's talk a little bit about a side benefit of this architecture: it enables decoupled, event-driven systems.

让我们稍微谈谈这种体系结构的一个附带好处: 它支持解耦的、事件驱动的系统。

The typical approach to activity data in the web industry is to log it out to text files where it can be scrapped into a data warehouse or into Hadoop for aggregation and querying. The problem with this is the same as the problem with all batch ETL: it couples the data flow to the data warehouse's capabilities and processing schedule.

在 web 行业中,活动数据的典型方法是将其记录到文本文件中,然后将其放入数据仓库或 Hadoop 中进行聚合和查询。这个问题与所有批量 ETL 的问题是一样的: 它将数据流与数据仓库的功能和处理调度相耦合。

At LinkedIn, we have built our event data handling in a log-centric fashion. We are using Kafka as the central, multi-subscriber event log. We have defined several hundred event types, each capturing the unique attributes about a particular type of action. This covers everything from page views, ad impressions, and searches, to service invocations and application exceptions.

在 LinkedIn,我们以日志为中心构建了事件数据处理。我们使用 Kafka 作为中心,多用户的事件日志。我们已经定义了几百种事件类型,每种类型都捕获了特定类型操作的独特属性。这涵盖了从页面浏览、广告展示和搜索到服务调用和应用程序异常的所有内容。

To understand the advantages of this, imagine a simple event—showing a job posting on the job page. The job page should contain only the logic required to display the job. However, in a fairly dynamic site, this could easily become larded up with additional logic unrelated to showing the job. For example let's say we need to integrate the following systems:

为了理解这种做法的好处,可以想象一个简单的事件ーー在求职页面上显示一个招聘广告。作业页应该只包含显示作业所需的逻辑。然而,在一个相当动态的站点中,这可能很容易被与显示作业无关的其他逻辑所覆盖。例如,我们需要集成以下系统:

  1. We need to send this data to Hadoop and data warehouse for offline processing purposes 我们需要将这些数据发送到 Hadoop 和数据仓库,以便进行离线处理
  2. We need to count the view to ensure that the viewer is not attempting some kind of content scraping 我们需要计算视图数量,以确保查看者不是在尝试某种内容抓取
  3. We need to aggregate this view for display in the Job poster's analytics page 我们需要聚合这个视图显示在招聘海报的分析页面
  4. We need to record the view to ensure we properly impression cap any job recommendations for that user (we don't want to show the same thing over and over) 我们需要记录视图,以确保我们正确的印象上限的任何工作建议,为该用户(我们不想显示相同的事情一遍又一遍)
  5. Our recommendation system may need to record the view to correctly track the popularity of that job 我们的推荐系统可能需要记录视图,以正确跟踪该工作的受欢迎程度
  6. Etc 等等

Pretty soon, the simple act of displaying a job has become quite complex. And as we add other places where jobs are displayed—mobile applications, and so on—this logic must be carried over and the complexity increases. Worse, the systems that we need to interface with are now somewhat intertwined—the person working on displaying jobs needs to know about many other systems and features and make sure they are integrated properly. This is just a toy version of the problem, any real application would be more, not less, complex.

很快,显示作业的简单行为变得相当复杂。当我们加上其他工作显示的地方ーー移动应用程序等等ーー这个逻辑必须继续下去,复杂性也必须增加。更糟糕的是,我们需要与之接口的系统现在已经有些纠缠在一起了ーー负责显示作业的人需要了解许多其他系统和特性,并确保它们被正确集成。这只是一个玩具版本的问题,任何真正的应用程序都会更复杂,而不是更简单。

The "event-driven" style provides an approach to simplifying this. The job display page now just shows a job and records the fact that a job was shown along with the relevant attributes of the job, the viewer, and any other useful facts about the display of the job. Each of the other interested systems—the recommendation system, the security system, the job poster analytics system, and the data warehouse—all just subscribe to the feed and do their processing. The display code need not be aware of these other systems, and needn't be changed if a new data consumer is added.

“事件驱动”风格提供了一种简化方法。作业显示页面现在只显示作业,并记录作业显示的事实以及作业的相关属性、查看器和任何其他有关作业显示的有用事实。其他感兴趣的系统ーー推荐系统、安全系统、招聘广告分析系统和数据仓库ーー都只是订阅 feed 并进行处理。显示代码不需要知道这些其他系统,如果添加了新的数据使用者,也不需要更改。

Building a Scalable Log

构建可伸缩的日志

Of course, separating publishers from subscribers is nothing new. But if you want to keep a commit log that acts as a multi-subscriber real-time journal of everything happening on a consumer-scale website, scalability will be a primary challenge. Using a log as a universal integration mechanism is never going to be more than an elegant fantasy if we can't build a log that is fast, cheap, and scalable enough to make this practical at scale.

当然,将出版商和订阅者分开并不是什么新鲜事。但是,如果您希望保留一个提交日志,作为一个多用户实时日志,记录消费者级网站上发生的所有事情,那么可伸缩性将是一个主要的挑战。如果我们不能构建一个快速、廉价、可伸缩的日志,使其成为一种实用的通用集成机制,那么将日志作为通用集成机制永远不会成为一种优雅的幻想。

Systems people typically think of a distributed log as a slow, heavy-weight abstraction (and usually associate it only with the kind of "metadata" uses for which Zookeeper might be appropriate). But with a thoughtful implementation focused on journaling large data streams, this need not be true. At LinkedIn we are currently running over 60 billion unique message writes through Kafka per day (several hundred billion if you count the writes from mirroring between datacenters).

系统人员通常认为分布式日志是一个缓慢的、重量级的抽象(并且通常只将它与 Zookeeper 可能适合的那种“元数据”用法联系起来)。但是,如果一个周到的实现专注于日志记录大型数据流,那么这就不一定是真实的。目前,LinkedIn 每天有超过600亿条卡夫卡式的独立信息(如果你把数据中心之间的镜像写入计算在内,就是几千亿条)。

We used a few tricks in Kafka to support this kind of scale:

我们在 Kafka 使用了一些技巧来支持这种规模:

  1. Partitioning the log 对日志进行分区
  2. Optimizing throughput by batching reads and writes 通过批量读写优化吞吐量
  3. Avoiding needless data copies 避免不必要的数据拷贝

In order to allow horizontal scaling we chop up our log into partitions:

为了允许水平扩展,我们将日志分割成几个分区:

imgEach partition is a totally ordered log, but there is no global ordering between partitions (other than perhaps some wall-clock time you might include in your messages). The assignment of the messages to a particular partition is controllable by the writer, with most users choosing to partition by some kind of key (e.g. user id). Partitioning allows log appends to occur without co-ordination between shards and allows the throughput of the system to scale linearly with the Kafka cluster size.

每个分区是一个完全有序的日志,但是分区之间没有全局排序(除了您可能在消息中包含的一些挂钟时间之外)。将消息分配给特定分区是由编写器控制的,大多数用户选择按某种键(如用户 id)进行分区。分区允许在分片之间无需协调就可以添加日志,并允许系统的吞吐量按 Kafka 集群大小线性增长。

Each partition is replicated across a configurable number of replicas, each of which has an identical copy of the partition's log. At any time, a single one of them will act as the leader; if the leader fails, one of the replicas will take over as leader.

每个分区都跨一个可配置的副本数复制,每个副本都有一个与分区日志完全相同的副本。在任何时候,他们中的一个将担任领导者; 如果领导者失败,其中的一个副本将接任领导者。

Lack of a global order across partitions is a limitation, but we have not found it to be a major one. Indeed, interaction with the log typically comes from hundreds or thousands of distinct processes so it is not meaningful to talk about a total order over their behavior. Instead, the guarantees that we provide are that each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent.

缺乏跨分区的全局顺序是一个限制,但我们没有发现这是一个主要限制。实际上,与日志的交互通常来自成百上千个不同的进程,因此讨论它们的行为的总次序是没有意义的。相反,我们提供的保证是,每个分区都是保持顺序的,Kafka 保证从单个发送方追加到特定分区的分区将按照发送的顺序发送。

A log, like a filesystem, is easy to optimize for linear read and write patterns. The log can group small reads and writes together into larger, high-throughput operations. Kafka pursues this optimization aggressively. Batching occurs from client to server when sending data, in writes to disk, in replication between servers, in data transfer to consumers, and in acknowledging committed data.

日志类似于文件系统,很容易对线性读写模式进行优化。日志可以将小的读写操作组合成更大的、高吞吐量的操作。卡夫卡积极地追求这种优化。当发送数据、写入磁盘、在服务器之间进行复制、向使用者传输数据以及确认已提交的数据时,从客户机到服务器进行批处理。

Finally, Kafka uses a simple binary format that is maintained between in-memory log, on-disk log, and in network data transfers. This allows us to make use of numerous optimizations including zero-copy data transfer.

最后,Kafka 使用一种简单的二进制格式,在内存日志、磁盘日志和网络数据传输之间进行维护。这允许我们使用许多优化,包括零拷贝数据传输。

The cumulative effect of these optimizations is that you can usually write and read data at the rate supported by the disk or network, even while maintaining data sets that vastly exceed memory.

这些优化的累积效果是,您通常可以以磁盘或网络支持的速率写入和读取数据,即使维护的数据集大大超过了内存。

This write-up isn't meant to be primarily about Kafka so I won't go into further details. You can read a more detailed overview of LinkedIn's approach here and a thorough overview of Kafka's design here.

这篇文章主要不是写卡夫卡的,所以我就不进一步详述了。你可以在这里阅读更多关于 LinkedIn 方法的详细概述,以及卡夫卡设计的全面概述。

Part Three: Logs & Real-time Stream Processing

第三部分: 日志和实时流处理

So far, I have only described what amounts to a fancy method of copying data from place-to-place. But shlepping bytes between storage systems is not the end of the story. It turns out that "log" is another word for "stream" and logs are at the heart of stream processing.

到目前为止,我只描述了一种从一个地方到另一个地方复制数据的奇特方法。但是在存储系统之间切换字节并不是故事的结尾。事实证明,“ log”是“ stream”的另一个词,而 logs 是流处理的核心。

But, wait, what exactly is stream processing?

但是,等等,流处理到底是什么?

If you are a fan of late 90s and early 2000s database literature or semi-successful data infrastructure products, you likely associate stream processing with efforts to build a SQL engine or "boxes and arrows" interface for event driven processing.

如果您是90年代末和21世纪初数据库文献或半成功的数据基础设施产品的粉丝,您可能会将流处理与为事件驱动处理构建 SQL 引擎或“盒子和箭头”接口的工作联系起来。

If you follow the explosion of open source data systems, you likely associate stream processing with some of the systems in this space—for example, Storm, Akka, S4, and Samza. But most people see these as a kind of asynchronous message processing system not that different from a cluster-aware RPC layer (and in fact some things in this space are exactly that).

如果你关注开源数据系统的爆炸性增长,你可能会把流处理和这个领域的一些系统联系起来,例如 Storm,Akka,S4和 Samza。但是大多数人认为它们是一种异步消息处理系统,与集群感知 RPC 层没有多大区别(事实上,这个空间中的某些东西正是如此)。

Both these views are a little limited. Stream processing has nothing to do with SQL. Nor is it limited to real-time processing. There is no inherent reason you can't process the stream of data from yesterday or a month ago using a variety of different languages to express the computation.img

这两种观点都有一点局限性。流处理与 SQL 无关。它也不仅限于实时处理。没有什么固有的理由不能处理昨天或一个月前的数据流,使用各种不同的语言来表达计算。

I see stream processing as something much broader: infrastructure for continuous data processing. I think the computational model can be as general as MapReduce or other distributed processing frameworks, but with the ability to produce low-latency results.

我认为流处理是更广泛的东西: 连续数据处理的基础结构。我认为计算模型可以像 MapReduce 或其他分布式处理框架一样通用,但具有产生低延迟结果的能力。

The real driver for the processing model is the method of data collection. Data which is collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.

处理模型的真正驱动力是数据采集的方法。批量收集的数据自然会批量处理。当数据被连续地收集,它自然地被连续地处理。

The US census provides a good example of batch data collection. The census periodically kicks off and does a brute force discovery and enumeration of US citizens by having people walking around door-to-door. This made a lot of sense in 1790 when the census was first begun. Data collection at the time was inherently batch oriented, it involved riding around on horseback and writing down records on paper, then transporting this batch of records to a central location where humans added up all the counts. These days, when you describe the census process one immediately wonders why we don't keep a journal of births and deaths and produce population counts either continuously or with whatever granularity is needed.

美国的人口普查提供了一个很好的批量数据收集的例子。人口普查定期开始,通过让人们挨家挨户地走来走去,对美国公民进行蛮力式的发现和统计。这在1790年人口普查开始的时候很有意义。当时的数据收集本质上是面向批处理的,它包括骑在马背上到处跑,在纸上写下记录,然后将这批记录传输到一个中心地点,在那里人们把所有的数据加起来。如今,当你描述人口普查的过程时,人们会立即感到奇怪,为什么我们不记录出生和死亡的日记,不连续地或按照任何粒度进行人口统计。

This is an extreme example, but many data transfer processes still depend on taking periodic dumps and bulk transfer and integration. The only natural way to process a bulk dump is with a batch process. But as these processes are replaced with continuous feeds, one naturally starts to move towards continuous processing to smooth out the processing resources needed and reduce latency.

这是一个极端的例子,但是许多数据传输过程仍然依赖于定期转储以及批量传输和集成。处理批量转储的唯一正常方法是使用批处理。但是,当这些进程被连续的提要所取代时,人们自然而然地开始转向连续处理,以平滑所需的处理资源并减少延迟。

LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously. In fact, when you think about any business, the underlying mechanics are almost always a continuous process—events happen in real-time, as Jack Bauer would tell us. When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process. Transmitting and reacting to data used to be very slow when the mechanics were mail and humans did the processing. A first pass at automation always retains the form of the original process, so this often lingers for a long time.

例如,LinkedIn 几乎没有任何批量数据收集。我们的大多数数据是活动数据或数据库更改,这两者都是连续发生的。事实上,当你考虑任何业务时,其基本机制几乎总是一个连续的过程——正如杰克•鲍尔(Jack Bauer)告诉我们的那样,事件是实时发生的。当数据成批收集时,几乎总是由于某些手工步骤或缺乏数字化,或者是某些非数字化过程自动化遗留下来的历史遗迹。传输和反应数据过去是非常缓慢的,当机械师是邮件和人做的处理。自动化的第一步总是保留原始流程的形式,所以这通常会持续很长一段时间。

Production "batch" processing jobs that run daily are often effectively mimicking a kind of continuous computation with a window size of one day. The underlying data is, of course, always changing. These were actually so common at LinkedIn (and the mechanics of making them work in Hadoop so tricky) that we implemented a whole framework for managing incremental Hadoop workflows.

每天运行的生产“批处理”作业常常有效地模拟一种窗口大小为一天的连续计算。当然,基础数据总是在变化的。这些实际上在 LinkedIn 中非常普遍(以及在 Hadoop 中使用它们的机制非常复杂) ,以至于我们实现了一个管理增量 Hadoop 工作流的整体框架。

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the "end" of the data set to be reached. In this sense, stream processing is a generalization of batch processing, and, given the prevalence of real-time data, a very important generalization.

从这个角度来看,很容易对流处理有不同的看法: 它只是处理包括正在处理的基础数据中的时间概念的数据,不需要数据的静态快照,以便能够以用户控制的频率产生输出,而不是等待数据集的”结束”。从这个意义上讲,流处理是批处理的一种推广,而且,考虑到实时数据的流行,流处理是一种非常重要的推广。

So why has the traditional view of stream processing been as a niche application? I think the biggest reason is that a lack of real-time data collection made continuous processing something of an academic concern.

那么,为什么传统的流处理视图是一个小生境应用程序呢?我认为最大的原因是缺乏实时数据收集使得持续处理成为学术关注的问题。

I think the lack of real-time data collection is likely what doomed the commercial stream-processing systems. Their customers were still doing file-oriented, daily batch processing for ETL and data integration. Companies building stream processing systems focused on providing processing engines to attach to real-time data streams, but it turned out that at the time very few people actually had real-time data streams. Actually, very early at my career at LinkedIn, a company tried to sell us a very cool stream processing system, but since all our data was collected in hourly files at that time, the best application we could come up with was to pipe the hourly files into the stream system at the end of the hour! They noted that this was a fairly common problem. The exception actually proves the rule here: finance, the one domain where stream processing has met with some success, was exactly the area where real-time data streams were already the norm and processing had become the bottleneck.

我认为缺乏实时数据收集可能是商业流处理系统注定失败的原因。他们的客户仍然在为 ETL 和数据集成进行面向文件的日常批处理。建立流处理系统的公司专注于提供附加到实时数据流的处理引擎,但事实证明,当时很少有人真正拥有实时数据流。事实上,在我在 LinkedIn 工作的早期,有一家公司试图向我们推销一个非常酷的流处理系统,但是由于我们所有的数据都是以小时为单位收集的,所以我们能想到的最好的应用程序就是在一个小时结束的时候把小时的文件导入流系统!他们注意到这是一个相当普遍的问题。这个例外实际上证明了这里的规则: 金融这个流处理取得了一些成功的领域,正是实时数据流已经成为规范和处理已经成为瓶颈的领域。

Even in the presence of a healthy batch processing ecosystem, I think the actual applicability of stream processing as an infrastructure style is quite broad. I think it covers the gap in infrastructure between real-time request/response services and offline batch processing. For modern internet companies, I think around 25% of their code falls into this category.

即使存在一个健康的批处理生态系统,我认为流处理作为一种基础设施风格的实际适用性也是相当广泛的。我认为它弥补了实时请求/响应服务和离线批处理之间在基础设施方面的差距。对于现代互联网公司来说,我认为他们大约25% 的代码属于这一类。

It turns out that the log solves some of the most critical technical problems in stream processing, which I'll describe, but the biggest problem that it solves is just making data available in real-time multi-subscriber data feeds. For those interested in more details, we have open sourced Samza, a stream processing system explicitly built on many of these ideas. We describe a lot of these applications in more detail in the documentation here.

事实证明,日志解决了流处理中的一些最关键的技术问题,我将对此进行描述,但它解决的最大问题只是使数据在实时多用户数据传输中可用。对于那些对更多细节感兴趣的人,我们有一个开源的 Samza,这是一个明确基于许多这些想法的流处理系统。我们将在这里的文档中更详细地描述这些应用程序。

Data flow graphs

数据流图

img

The most interesting aspect of stream processing has nothing to do with the internals of a stream processing system, but instead has to do with how it extends our idea of what a data feed is from the earlier data integration discussion. We discussed primarily feeds or logs of primary data—the events and rows of data produced in the execution of various applications. But stream processing allows us to also include feeds computed off other feeds. These derived feeds look no different to consumers then the feeds of primary data from which they are computed. These derived feeds can encapsulate arbitrary complexity.

流处理最有趣的方面与流处理系统的内部结构无关,而是与它如何扩展我们从早期的数据集成讨论中得出的数据提要的概念有关。我们主要讨论了主要数据的提要或日志ーー各种应用程序执行过程中产生的事件和数据行。但是流处理允许我们也包括从其他提要中计算出的提要。这些派生的提要对于使用者来说与计算它们的主要数据提要没有什么不同。这些派生提要可以封装任意复杂性。

Let's dive into this a bit. A stream processing job, for our purposes, will be anything that reads from logs and writes output to logs or other systems. The logs they use for input and output join these processes into a graph of processing stages. Indeed, using a centralized log in this fashion, you can view all the organization's data capture, transformation, and flow as just a series of logs and processes that write to them.

让我们深入研究一下。对于我们来说,流处理作业是指任何从日志中读取数据并将输出写入日志或其他系统的作业。他们用于输入和输出的日志将这些流程连接到一个处理阶段图中。实际上,以这种方式使用集中式日志,您可以将所有组织的数据捕获、转换和流视为一系列写入它们的日志和流程。

A stream processor need not have a fancy framework at all: it can be any process or set of processes that read and write from logs, but additional infrastructure and support can be provided for helping manage processing code.

一个流处理器根本不需要一个花哨的框架: 它可以是任何进程或一组从日志中读写的进程,但是可以提供额外的基础设施和支持来帮助管理处理代码。

The purpose of the log in the integration is two-fold.

日志在集成中的作用是双重的。

First, it makes each dataset multi-subscriber and ordered. Recall our "state replication" principle to remember the importance of order. To make this more concrete, consider a stream of updates from a database—if we re-order two updates to the same record in our processing we may produce the wrong final output. This order is more permanent than what is provided by something like TCP as it is not limited to a single point-to-point link and survives beyond process failures and reconnections.

首先,它使每个数据集成为多用户数据集并进行排序。回顾我们的“状态复制”原则,记住秩序的重要性。为了使这一点更具体,可以考虑从数据库中进行一系列更新ーー如果我们在处理过程中对同一个记录重新排序两次更新,就可能产生错误的最终输出。这个顺序比 TCP 之类的东西提供的顺序更加持久,因为它不局限于单个点对点链接,并且能够在进程失败和重连接之后继续存在。

Second, the log provides buffering to the processes. This is very fundamental. If processing proceeds in an unsynchronized fashion it is likely to happen that an upstream data producing job will produce data more quickly than another downstream job can consume it. When this occurs processing must block, buffer or drop data. Dropping data is likely not an option; blocking may cause the entire processing graph to grind to a halt. The log acts as a very, very large buffer that allows process to be restarted or fail without slowing down other parts of the processing graph. This isolation is particularly important when extending this data flow to a larger organization, where processing is happening by jobs made by many different teams. We cannot have one faulty job cause back-pressure that stops the entire processing flow.

其次,日志提供了对进程的缓冲。这是非常基本的。如果处理以不同步的方式进行,那么很可能发生的情况是,上游数据生成作业生成数据的速度比下游作业消耗数据的速度更快。当这种情况发生时,处理必须阻塞、缓冲或删除数据。删除数据可能不是一个选项; 阻塞可能导致整个处理图停止运行。日志作为一个非常非常大的缓冲区,允许进程重新启动或失败,而不会减慢处理图的其他部分。当将这种数据流扩展到更大的组织时,这种隔离特别重要,在这种组织中,处理是由许多不同的团队制作的作业进行的。我们不能有一个错误的作业导致背压停止整个处理流程。

Both Storm and Samza are built in this fashion and can use Kafka or other similar systems as their log.

和 Samza 都是以这种方式构建的,可以使用 Kafka 或其他类似的系统作为日志。

Stateful Real-Time Processing

状态实时处理

Some real-time stream processing is just stateless record-at-a-time transformation, but many of the uses are more sophisticated counts, aggregations, or joins over windows in the stream. One might, for example, want to enrich an event stream (say a stream of clicks) with information about the user doing the click—in effect joining the click stream to the user account database. Invariably, this kind of processing ends up requiring some kind of state to be maintained by the processor: for example, when computing a count, you have the count so far to maintain. How can this kind of state be maintained correctly if the processors themselves can fail?

一些实时流处理只是无状态的一次一条记录的转换,但其中许多用途是更复杂的计数、聚合或通过流中的窗口连接。例如,有人可能希望用点击用户的信息来丰富事件流(比如一系列点击)——实际上是将点击流连接到用户帐户数据库。通常,这种处理最终都需要处理器维护某种状态: 例如,当计算一个计数时,到目前为止您需要维护的计数。如果处理器本身可能发生故障,那么如何正确地维护这种状态?

The simplest alternative would be to keep state in memory. However if the process crashed it would lose its intermediate state. If state is only maintained over a window, the process could just fall back to the point in the log where the window began. However, if one is doing a count over an hour, this may not be feasible.

最简单的替代方案是将状态保存在内存中。然而,如果这个过程崩溃了,它将失去它的居间态。如果状态仅在窗口上维护,则进程可以退回到该窗口开始的日志中的点。然而,如果一个人正在做一个多小时的计数,这可能是不可行的。

An alternative is to simply store all state in a remote storage system and join over the network to that store. The problem with this is that there is no locality of data and lots of network round-trips.

另一种方法是将所有状态存储在远程存储系统中,然后通过网络连接到该存储。这样做的问题在于没有数据的本地化和大量的网络往返。

How can we support something like a "table" that is partitioned up with our processing?

我们如何支持类似于“表”这样的东西,它是根据我们的处理进行分区的?

Well recall the discussion of the duality of tables and logs. This gives us exactly the tool to be able to convert streams to tables co-located with our processing, as well as a mechanism for handling fault tolerance for these tables.

我们回想一下关于表和日志的对偶性的讨论。这正好为我们提供了一个工具,可以将流转换为与我们的处理相同位置的表,以及处理这些表的容错机制。

A stream processor can keep it's state in a local "table" or "index"—a bdb, leveldb, or even something more unusual such as a Lucene or fastbit index. The contents of this this store is fed from its input streams (after first perhaps applying arbitrary transformation). It can journal out a changelog for this local index it keeps to allow it to restore its state in the event of a crash and restart. This mechanism allows a generic mechanism for keeping co-partitioned state in arbitrary index types local with the incoming stream data.

流处理器可以保持它在本地“表”或“索引”中的状态—— bdb、 leveldb,甚至更不寻常的东西,如 Lucene 或 fastbit 索引。这个存储的内容是从其输入流中提供的(在首先可能应用任意转换之后)。它可以记录保存的本地索引的变更日志,以便在崩溃和重新启动时恢复其状态。这种机制允许使用一种通用机制,在任意索引类型中保持共分区状态与传入的流数据本地化。

When the process fails, it restores its index from the changelog. The log is the transformation of the local state into a sort of incremental record at a time backup.

当进程失败时,它从更改日志中恢复其索引。日志是在一次备份中将本地状态转换为一种增量记录的过程。

This approach to state management has the elegant property that the state of the processors is also maintained as a log. We can think of this log just like we would the log of changes to a database table. In fact, the processors have something very like a co-partitioned table maintained along with them. Since this state is itself a log, other processors can subscribe to it. This can actually be quite useful in cases when the goal of the processing is to update a final state and this state is the natural output of the processing.

这种状态管理方法具有优雅的属性,即处理器的状态也被维护为日志。我们可以把这个日志看作是对数据库表的更改日志。实际上,处理器有一些非常类似于与它们一起维护的共分区表的东西。由于这种状态本身是一个日志,其他处理器可以订阅它。当处理的目标是更新最终状态,而该状态是处理的自然输出时,这实际上非常有用。

When combined with the logs coming out of databases for data integration purposes, the power of the log/table duality becomes clear. A change log may be extracted from a database and indexed in different forms by various stream processors to join against event streams.

为了数据集成的目的,当结合数据库中的日志时,日志/表二元性的威力就变得明显了。可以从数据库中提取变更日志,并由不同的流处理器以不同的形式进行索引,以便根据事件流进行连接。

We give more detail on this style of managing stateful processing in Samza and a lot more practical examples here.

我们在这里给出了更多关于 Samza 管理有状态处理的这种风格的细节和更多的实际例子。

Log Compaction

日志压缩

imgOf course, we can't hope to keep a complete log for all state changes for all time. Unless one wants to use infinite space, somehow the log must be cleaned up. I'll talk a little about the implementation of this in Kafka to make it more concrete. In Kafka, cleanup has two options depending on whether the data contains keyed updates or event data. For event data, Kafka supports just retaining a window of data. Usually, this is configured to a few days, but the window can be defined in terms of time or space. For keyed data, though, a nice property of the complete log is that you can replay it to recreate the state of the source system (potentially recreating it in another system).

当然,我们不能希望一直保留所有状态更改的完整日志。除非一个人想使用无限的空间,否则必须以某种方式清理日志。为了使它更具体,我会谈一点卡夫卡对此的实施。在 Kafka 中,根据数据是否包含键控更新或事件数据,清理有两个选项。对于事件数据,Kafka 只支持保留一个数据窗口。通常,这是配置为几天,但窗口可以根据时间或空间来定义。但是,对于键控数据,完整日志的一个很好的属性是,您可以重新播放它,以重新创建源系统的状态(可能在另一个系统中重新创建它)。

However, retaining the complete log will use more and more space as time goes by, and the replay will take longer and longer. Hence, in Kafka, we support a different type of retention. Instead of simply throwing away the old log, we remove obsolete records—i.e. records whose primary key has a more recent update. By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. We call this feature log compaction.

然而,随着时间的推移,保留完整的日志将会使用越来越多的空间,而重播将会花费越来越长的时间。因此,在 Kafka,我们支持一种不同类型的留任。我们不是简单地丢弃旧日志,而是删除过时的记录ーー即。主键有最近更新的记录。通过这样做,我们仍然可以保证日志包含源系统的完整备份,但是现在我们不能再重新创建源系统以前的所有状态,只能重新创建更新的状态。我们称这个特性为 log compaction。

Part Four: System Building

第四部分: 制度建设

The final topic I want to discuss is the role of the log in data system design for online data systems.

我想讨论的最后一个主题是日志在联机数据系统的数据系统设计中的作用。

There is an analogy here between the role a log serves for data flow inside a distributed database and the role it serves for data integration in a larger organization. In both cases, it is responsible for data flow, consistency, and recovery. What, after all, is an organization, if not a very complicated distributed data system?

这里有一个类比,一个日志服务于分布式数据库内部的数据流和它服务于更大的组织中的数据集成的角色。在这两种情况下,它都负责数据流、一致性和恢复。毕竟,如果不是一个非常复杂的分布式数据系统,那么什么是组织呢?

Unbundling?

拆分?

So maybe if you squint a bit, you can see the whole of your organization's systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view the stream processing systems like Storm or Samza as just a very well-developed trigger and view materialization mechanism. Classical database people, I have noticed, like this view very much because it finally explains to them what on earth people are doing with all these different data systems—they are just different index types!

因此,如果你稍微斜视一下,你可以看到你的组织的整个系统和数据流作为一个单独的分布式数据库。您可以将所有面向查询的单个系统(Redis、 SOLR、 Hive 表等)视为数据上的特定索引。您可以将像 Storm 或 Samza 这样的流处理系统视为一种非常发达的触发器和视图物化机制。我注意到,古典数据库的人们非常喜欢这个观点,因为它最终向他们解释了人们究竟在用这些不同的数据系统做什么ーー它们只是不同的索引类型!

There is undeniably now an explosion of types of data systems, but in reality, this complexity has always existed. Even in the heyday of the relational database, organizations had lots and lots of relational databases! So perhaps real integration hasn't existed since the mainframe when all the data really was in one place. There are many motivations for segregating data into multiple systems: scale, geography, security, and performance isolation are the most common. But these issues can be addressed by a good system: it is possible for an organization to have a single Hadoop cluster, for example, that contains all the data and serves a large and diverse constituency.

不可否认,现在数据系统的类型正在激增,但实际上,这种复杂性一直存在。即使在关系数据库数据库的全盛时期,组织机构也拥有大量的关系数据库!因此,也许自从大型机出现以来,真正的集成就不存在了,因为所有的数据实际上都集中在一个地方。将数据分离到多个系统的动机有很多: 规模、地理位置、安全性和性能隔离是最常见的。但是这些问题可以通过一个好的系统来解决: 例如,一个组织可以有一个单一的 Hadoop 集群,它包含所有的数据并服务于一个庞大而多样化的用户群。

So there is already one possible simplification in the handling of data that has become possible in the move to distributed systems: coalescing lots of little instances of each system into a few big clusters. Many systems aren't good enough to allow this yet: they don't have security, or can't guarantee performance isolation, or just don't scale well enough. But each of these problems is solvable.

因此,在向分布式系统转移的过程中,数据处理已经有了一种可能的简化方式: 将每个系统的许多小实例合并为几个大集群。许多系统还不够好,不能允许这样做: 它们没有安全性,或者不能保证性能隔离,或者只是不能很好地伸缩。但是这些问题都是可以解决的。

My take is that the explosion of different systems is caused by the difficulty of building distributed data systems. By cutting back to a single query type or use case each system is able to bring its scope down into the set of things that are feasible to build. But running all these systems yields too much complexity.

我的看法是,不同系统的爆炸式增长是由构建分布式数据系统的困难造成的。通过减少到单一的查询类型或用例,每个系统都能够将其范围缩小到可以构建的事物集合中。但是运行所有这些系统会产生太多的复杂性。

I see three possible directions this could follow in the future.

我认为未来有三个可能的发展方向。

The first possibility is a continuation of the status quo: the separation of systems remains more or less as it is for a good deal longer. This could happen either because the difficulty of distribution is too hard to overcome or because this specialization allows new levels of convenience and power for each system. As long as this remains true, the data integration problem will remain one of the most centrally important things for the successful use of data. In this case, an external log that integrates data will be very important.

第一种可能性是现状的延续: 系统的分离或多或少地保持下去,因为它是一个相当长的时间。这可能发生,要么是因为分配的困难太难克服,要么是因为这种专门化允许每个系统的便利性和能力达到新的水平。只要这种情况仍然存在,数据集成问题将仍然是成功使用数据最重要的问题之一。在这种情况下,集成数据的外部日志将非常重要。

The second possibility is that there could be a re-consolidation in which a single system with enough generality starts to merge back in all the different functions into a single uber-system. This uber-system could be like the relational database superficially, but it's use in an organization would be far different as you would need only one big one instead of umpteen little ones. In this world, there is no real data integration problem except what is solved inside this system. I think the practical difficulties of building such a system make this unlikely.

第二种可能性是,可能存在一种重新整合,即一个具有足够普遍性的单一系统开始将所有不同功能合并回一个单一的超级系统。这个超级系统表面上可能类似于关系数据库,但是它在组织中的应用将会大不相同,因为你只需要一个大系统而不是无数个小系统。在这个世界上,没有真正的数据集成问题,除了在这个系统内部解决的问题。我认为建立这样一个系统的实际困难使得这种情况不太可能发生。

There is another possible outcome, though, which I actually find appealing as an engineer. One interesting facet of the new generation of data systems is that they are virtually all open source. Open source allows another possibility: data infrastructure could be unbundled into a collection of services and application-facing system apis. You already see this happening to a certain extent in the Java stack:

不过,还有另一种可能的结果,作为一名工程师,我实际上觉得它很有吸引力。新一代数据系统的一个有趣的方面是,它们实际上都是开源的。开放源码允许另一种可能性: 数据基础设施可以分解为一组服务和面向应用程序的系统 api。你已经在 Java 栈中看到这种情况在一定程度上发生:

If you stack these things in a pile and squint a bit, it starts to look a bit like a lego version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems. This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve. If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.

如果你把这些东西堆成一堆,稍微斜视一下,它开始看起来有点像分布式数据系统工程的乐高版本。你可以把这些成分组合在一起,创造出一系列可能的系统。这显然不是一个与最终用户相关的故事,因为最终用户可能更关心 API,而不是它是如何实现的,但它可能是一条通向在一个更加多样化和模块化的世界中实现单一系统的简单性的道路,这个世界将继续演变。如果一个分布式系统的实现时间因为可靠、灵活的构建模块的出现而从几年延长到几周,那么合并成一个单独的单层系统/服务器的压力就消失了。

The place of the log in system architecture

登录系统体系结构的位置

A system that assumes an external log is present allows the individual systems to relinquish a lot of their own complexity and rely on the shared log. Here are the things I think a log can do:

假定存在外部日志的系统允许各个系统放弃自己的许多复杂性,依赖于共享日志。以下是我认为日志可以做到的事情:

  • Handle data consistency (whether eventual or immediate) by sequencing concurrent updates to nodes 通过对节点进行并发更新的排序来处理数据一致性(无论是最终的还是立即的)
  • Provide data replication between nodes 在节点之间提供数据复制
  • Provide "commit" semantics to the writer (i.e. acknowledging only when your write guaranteed not to be lost) 为写作者提供“提交”的语义(也就是说,只有当你的写作保证不会丢失时才承认)
  • Provide the external data subscription feed from the system 提供来自系统的外部数据订阅提要
  • Provide the capability to restore failed replicas that lost their data or bootstrap new replicas 提供恢复丢失数据的失败副本或引导新副本的能力
  • Handle rebalancing of data between nodes. 处理节点之间的数据重新平衡

This is actually a substantial portion of what a distributed data system does. In fact, the majority of what is left over is related to the final client-facing query API and indexing strategy. This is exactly the part that should vary from system to system: for example, a full-text search query may need to query all partitions whereas a query by primary key may only need to query a single node responsible for that key's data.

这实际上是分布式数据系统工作的一个重要部分。实际上,剩下的大部分内容都与最终面向客户机的查询 API 和索引策略有关。这正是因系统而异的部分: 例如,全文搜索查询可能需要查询所有分区,而按主键进行的查询可能只需要查询负责该键数据的单个节点。

Here is how this works. The system is divided into two logical pieces: the log and the serving layer. The log captures the state changes in sequential order. The serving nodes store whatever index is required to serve queries (for example a key-value store might have something like a btree or sstable, a search system would have an inverted index). Writes may either go directly to the log, though they may be proxied by the serving layer. Writing to the log yields a logical timestamp (say the index in the log). If the system is partitioned, and I assume it is, then the log and the serving nodes will have the same number of partitions, though they may have very different numbers of machines.img

下面是它的工作原理。该系统分为两个逻辑部分: 日志和服务层。日志按顺序捕获状态更改。服务节点存储服务查询所需的任何索引(例如,键值存储可能具有类似 btree 或 sstable 的内容,搜索系统将具有反向索引)。写操作可以直接进入日志,尽管它们可以由服务层代理。写入日志会产生一个逻辑时间戳(比如日志中的索引)。如果系统是分区的,我假设它是分区的,那么日志节点和服务节点将拥有相同数量的分区,尽管它们可能拥有非常不同数量的机器。

The serving nodes subscribe to the log and apply writes as quickly as possible to its local index in the order the log has stored them.

服务节点订阅日志,并按照日志存储它们的顺序,尽可能快地将数据写入其本地索引。

The client can get read-your-write semantics from any node by providing the timestamp of a write as part of its query—a serving node receiving such a query will compare the desired timestamp to its own index point and if necessary delay the request until it has indexed up to at least that time to avoid serving stale data.

客户机可以从任何节点获得读写语义,方法是提供写入的时间戳作为查询的一部分ー接收这种查询的服务节点将比较所需的时间戳和它自己的索引点,如果有必要,将延迟请求,直到它至少索引到该时间,以避免提供陈旧数据。

The serving nodes may or may not need to have any notion of "mastership" or "leader election". For many simple use cases, the serving nodes can be completely without leaders, since the log is the source of truth.

服务节点可能需要也可能不需要任何“主人权”或“领导者选举”的概念。对于许多简单的用例,服务节点可以完全没有领导者,因为日志是真理的来源。

One of the trickier things a distributed system must do is handle restoring failed nodes or moving partitions from node to node. A typical approach would have the log retain only a fixed window of data and combine this with a snapshot of the data stored in the partition. It is equally possible for the log to retain a complete copy of data and garbage collect the log itself. This moves a significant amount of complexity out of the serving layer, which is system-specific, and into the log, which can be general purpose.

分布式系统必须做的一件棘手的事情是处理恢复失败的节点或将分区从一个节点移动到另一个节点。典型的方法是让日志只保留一个固定的数据窗口,并将其与存储在分区中的数据的快照结合起来。日志同样可以保留数据的完整副本,而垃圾收集日志本身。这将大量的复杂性从服务层(这是系统特定的)转移到日志中,这可能是通用的。

By having this log system, you get a fully developed subscription API for the contents of the data store which feeds ETL into other systems. In fact, many systems can share the same the log while providing different indexes, like this:

通过使用这个日志系统,您可以获得一个完全开发的数据存储内容的订阅 API,这个数据存储将 ETL 提供给其他系统。事实上,许多系统可以共享相同的日志,同时提供不同的索引,如下所示:

img

Note how such a log-centric system is itself immediately a provider of data streams for processing and loading in other systems. Likewise, a stream processor can consume multiple input streams and then serve them via another system that indexes that output.

请注意,这样一个以日志为中心的系统本身如何立即成为数据流的提供者,以便在其他系统中进行处理和加载。同样,流处理器可以使用多个输入流,然后通过另一个对输出进行索引的系统为它们提供服务。

I find this view of systems as factored into a log and query api to very revealing, as it lets you separate the query characteristics from the availability and consistency aspects of the system. I actually think this is even a useful way to mentally factor a system that isn't built this way to better understand it.

我发现系统的这种视图被分解为日志和查询 api,非常有启发性,因为它允许您将查询特征与系统的可用性和一致性方面分离开来。事实上,我认为这甚至是一个有用的方法,可以从心理上分析一个系统,而这个系统并不是以这种方式构建的,以便更好地理解它。

It's worth noting that although Kafka and Bookeeper are consistent logs, this is not a requirement. You could just as easily factor a Dynamo-like database into an eventually consistent AP log and a key-value serving layer. Such a log is a bit tricky to work with, as it will redeliver old messages and depends on the subscriber to handle this (much like Dynamo itself).

值得注意的是,虽然 Kafka 和 Bookeeper 是一致的日志,但这并不是必需的。您可以轻松地将类似于 dynamo 的数据库分解为最终一致的 AP 日志和键值服务层。使用这样的日志有点棘手,因为它将重新传递旧的消息,并且依赖于订阅者来处理(很像 Dynamo 本身)。

The idea of having a separate copy of data in the log (especially if it is a complete copy) strikes many people as wasteful. In reality, though there are a few factors that make this less of an issue. First, the log can be a particularly efficient storage mechanism. We store over 75TB per datacenter on our production Kafka servers. Meanwhile many serving systems require much more memory to serve data efficiently (text search, for example, is often all in memory). The serving system may also use optimized hardware. For example, most our live data systems either serve out of memory or else use SSDs. In contrast, the log system does only linear reads and writes, so it is quite happy using large multi-TB hard drives. Finally, as in the picture above, in the case where the data is served by multiple systems, the cost of the log is amortized over multiple indexes. This combination makes the expense of an external log pretty minimal.

在日志中有一个单独的数据副本(特别是如果它是一个完整的副本)的想法打击了许多人的浪费。实际上,尽管有一些因素使得这个问题不那么严重。首先,日志可以是一种特别有效的存储机制。我们在 Kafka 产品服务器上每个数据中心存储超过75tb 的数据。与此同时,许多服务系统需要更多的内存来有效地服务数据(例如,文本搜索通常都在内存中)。服务系统还可以使用优化的硬件。例如,我们的大多数实时数据系统要么内存不足,要么使用 ssd。相比之下,日志系统只执行线性读写操作,因此使用大型的多 tb 硬盘驱动器相当令人满意。最后,如上图所示,在数据由多个系统提供的情况下,日志的成本将在多个索引中摊销。这种组合使得外部日志的开销非常小。

This is exactly the pattern that LinkedIn has used to build out many of its own real-time query systems. These systems feed off a database (using Databus as a log abstraction or off a dedicated log from Kafka) and provide a particular partitioning, indexing, and query capability on top of that data stream. This is the way we have implemented our search, social graph, and OLAP query systems. In fact, it is quite common to have a single data feed (whether a live feed or a derived feed coming from Hadoop) replicated into multiple serving systems for live serving. This has proven to be an enormous simplifying assumption. None of these systems need to have an externally accessible write api at all, Kafka and databases are used as the system of record and changes flow to the appropriate query systems through that log. Writes are handled locally by the nodes hosting a particular partition. These nodes blindly transcribe the feed provided by the log to their own store. A failed node can be restored by replaying the upstream log.

这正是 LinkedIn 用来建立自己的实时查询系统的模式。这些系统依赖于数据库(使用 Databus 作为日志抽象,或者依赖 Kafka 的专用日志) ,并在数据流之上提供特定的分区、索引和查询功能。这是我们实现搜索、社交图和 OLAP 查询系统的方式。实际上,将单个数据提要(无论是来自 Hadoop 的实时提要还是派生提要)复制到多个服务系统中以提供实时服务是很常见的。这已被证明是一个巨大的简化假设。这些系统根本不需要有外部可访问的写 api,Kafka 和数据库被用作记录系统,并通过该日志将更改流转到适当的查询系统。写操作由承载特定分区的节点在本地处理。这些节点盲目地将日志提供的提要转录到它们自己的存储中。可以通过重放上游日志来恢复失败的节点。

The degree to which these systems rely on the log varies. A fully reliant system could make use of the log for data partitioning, node restore, rebalancing, and all aspects of consistency and data propagation. In this setup, the actual serving tier is actually nothing less than a sort of "cache" structured to enable a particular type of processing with writes going directly to the log.

这些系统对日志的依赖程度各不相同。一个完全可信赖的系统可以利用日志进行数据分区、节点恢复、再平衡,以及一致性和数据传播的所有方面。在这种设置中,实际的服务层实际上就是一种结构化的“缓存” ,用于支持特定类型的处理,将写操作直接写入日志。

The End

结束

If you made it this far you know most of what I know about logs.

如果你走到这一步,你就会知道我所知道的大部分关于日志的事情。

Here are a few interesting references you may want to check out.

这里有一些有趣的参考资料,你可能想要查看一下。

Everyone seems to uses different terms for the same things so it is a bit of a puzzle to connect the database literature to the distributed systems stuff to the various enterprise software camps to the open source world. Nonetheless, here are a few pointers in the general direction.

每个人似乎都对同样的事情使用不同的术语,因此将数据库文献与分布式系统、各种企业软件阵营与开放源码世界联系起来有点令人困惑。尽管如此,这里有一些大方向的指示。

Academic papers, systems, talks, and blogs:

学术论文、系统、演讲和博客:

  • A good overview of 一个很好的概述state machine 状态机 and 及primary-backup主备份 replication 复制

  • PacificA is a generic framework for implementing log-based distributed storage systems at Microsoft. 是 Microsoft 实现基于日志的分布式存储系统的通用框架

  • Spanner 扳手—Not everyone loves logical time for their logs. Google's new database tries to use physical time and models the uncertainty of clock drift directly by treating the timestamp as a range. ー不是每个人都喜欢用合乎逻辑的时间记录自己的日志。谷歌的新数据库试图使用物理时间和模型的不确定性的时钟漂移直接处理的时间戳作为一个范围

  • Datanomic 数据的: Deconstructing the database 分解数据库 is a great presentation by Rich Hickey, the creator of Clojure, on his startup's database product. 是 Clojure 的创建者 Rich Hickey 在他的初创公司的数据库产品上做的一个很棒的演示

  • A Survey of Rollback-Recovery Protocols in Message-Passing Systems 消息传递系统中回卷恢复协议研究综述. I found this to be a very helpful introduction to fault-tolerance and the practical application of logs to recovery outside databases. .我发现这是一个非常有用的容错介绍,以及日志在数据库之外的恢复中的实际应用

  • Reactive Manifesto 反应性宣言—I'm actually not quite sure what is meant by reactive programming, but I think it means the same thing as "event driven". This link doesn't have much info, but ーー我实际上不太确定响应式编程是什么意思,但我认为它与“事件驱动”的意思是一样的。这个链接没有太多的信息,但是this class 这个班 by Martin Odersky (of Scala fame) looks facinating. 马丁 · 奥德斯基(著名的 Scala 语言学家)的作品看起来令人印象深刻

  • Paxos!

    帕克索斯

    • Original paper is 原始纸张是here 这里. Leslie Lamport has an interesting 。莱斯利兰波特有一个有趣的history 历史 of how the algorithm was created in the 1980s but not published until 1998 because the reviewers didn't like the Greek parable in the paper and he didn't want to change it. 这个算法是如何在20世纪80年代创建的,但直到1998年才发表,因为评论家不喜欢论文中的希腊寓言,他不想改变它
    • Even once the original paper was published it wasn't well understood. Lamport 即使最初的论文发表了,也没有得到很好的理解tries again 再试一次and this time even includes a few of the "uninteresting details" of how to put it to use using these new-fangled automatic computers. It is still not widely understood. 这一次甚至包括一些“无趣的细节” ,如何使用这些新奇的自动计算机。人们对此仍然没有广泛的理解
    • Fred Schneider 弗雷德 · 施耐德 and 及Butler Lampson 巴特勒 · 兰普森 each give more detailed overview of applying Paxos in real systems. 每一个都给出了更详细的关于在实际系统中应用 Paxos 的概述
    • A few Google engineers summarize 一些谷歌的工程师总结道their experience 他们的经验 implementing Paxos in Chubby. 在 Chubby 实施 Paxos
    • I actually found all the Paxos papers pretty painful to understand but dutifully struggled through. But you don't need to because 事实上,我发现所有的帕克索斯论文理解起来都相当痛苦,但我还是尽职尽责地努力读完了。但你不需要这么做,因为this video 这个视频 by 作者John Ousterhout 约翰 · 奥斯特豪特(of log-structured filesystem fame!) will make it all very simple. Somehow these consensus algorithms are much better presented by drawing them as the communication rounds unfold, rather than in a static presentation in a paper. Ironically, this video was created in an attempt to show that Paxos was hard to understand. (日志结构文件系统的名声!)会让一切变得很简单。不知怎么的,这些一致性算法通过在交流过程中描绘它们而不是在论文的静态演示中得到更好的呈现。具有讽刺意味的是,制作这个视频是为了表明帕克索斯难以理解
    • Using Paxos to Build a Scalable Consistent Data Store 利用 Paxos 构建可扩展的一致性数据存储: This is a cool paper on using a log to build a data store, by Jun, one of the co-authors is also one of the earliest engineers on Kafka.当前位置: 这是一篇关于用日志建立数据存储的很酷的论文,作者之一 Jun,也是最早研究 Kafka 的工程师之一
  • Paxos has competitors! Actually each of these map a lot more closely to the implementation of a log and are probably more suitable for practical implementation:

    帕克索斯有竞争者!实际上,这些映射更接近于日志的实现,并且可能更适合于实际的实现:

    • Viewstamped Replication 复制 by Barbara Liskov is an early algorithm to directly model log replication. 是一个早期的算法,可以直接对日志复制建模
    • Zab 男名男子名 is the algorithm used by Zookeeper. 是 Zookeeper 使用的算法
    • RAFT 救生筏 is an attempt at a more understandable consensus algorithm. The 是一个更易理解的一致性算法的尝试video presentation 视频演示, also by John Ousterhout, is great too. 也是 John Ousterhout 写的,也很棒
  • You can see the role of the log in action in different real distributed databases.

    您可以看到日志在不同实际分布式数据库中的作用

    • PNUTS is a system which attempts to apply to log-centric design of traditional distributed databases at large scale. 是一个试图应用于传统分布式数据库大规模以日志为中心的设计的系统
    • HBase and 及Bigtable both give another example of logs in modern databases. 两者都给出了现代数据库中日志的另一个例子
    • LinkedIn's own distributed database 自己的分布式数据库Espresso 意式浓缩咖啡, like PNUTs, uses a log for replication, but takes a slightly different approach using the underlying table itself as the source of the log. 像 PNUTs 一样,使用日志进行复制,但是使用基础表本身作为日志源的方法略有不同
  • If you find yourself comparison shopping for a replication algorithm, 如果您发现自己在为复制算法进行比较,this paper 这张纸 may help you out. 也许能帮到你

  • Replication: Theory and Practice 重复: 理论与实践 is a great book that collects a bunch of summary papers on replication in distributed systems. Many of the chapters are online (e.g. 是一本收集了大量关于分布式系统复制的摘要论文的好书。许多章节都是在线的(例如:1, 4, 5, 6, 7, 8).

  • Stream processing. This is a bit too broad to summarize, but here are a few things I liked.

    流处理。这个概括起来有点过于宽泛,但是这里有一些我喜欢的东西

Enterprise software has all the same problems but with different names, a smaller scale, and XML. Ha ha, just kidding. Kind of. 企业软件有相同的问题,但名称不同,规模较小,而且使用 XML。哈哈,开个玩笑。算是吧

  • Event Sourcing 活动采购—As far as I can tell this is basically the enterprise software engineer's way of saying "state machine replication". It's interesting that the same idea would be invented again in such a different context. Event sourcing seems to focus on smaller, in-memory use cases. This approach to application development seems to combine the "stream processing" that occurs on the log of events with the application. Since this becomes pretty non-trivial when the processing is large enough to require data partitioning for scale I focus on stream processing as a separate infrastructure primitive. ー据我所知,这基本上是企业软件工程师说“状态机复制”的方式。有趣的是,同样的想法会在如此不同的背景下被重新发明。事件来源似乎集中在较小的内存中的用例上。这种应用程序开发方法似乎将事件日志上发生的“流处理”与应用程序结合起来。因为当处理足够大以至于需要数据分区来达到规模时,这就变得非常重要,所以我将重点放在流处理作为一个单独的基础结构原语
  • Change Data Capture 更改数据捕获—There is a small industry around getting data out of databases, and this is the most log-friendly style of data extraction. ー从数据库中提取数据是一个小型行业,这是最适合日志记录的数据提取方式
  • Enterprise Application Integration 企业应用集成 seems to be about solving the data integration problem when what you have is a collection of off-the-shelf enterprise software like CRM or supply-chain management software. 当你拥有的是现成的企业软件,比如 CRM 或者供应链管理软件时,它似乎是为了解决数据集成问题
  • Complex Event Processing (CEP) 复杂事件处理(CEP): Fairly certain nobody knows what this means or how it actually differs from stream processing. The difference seems to be that the focus is on unordered streams and on event filtering and detection rather than aggregation, but this, in my opinion is a distinction without a difference. I think any system that is good at one should be good at another. : 相当肯定的是,没有人知道这意味着什么,或者它与流处理有什么不同。区别似乎在于,重点是无序的流和事件过滤和检测,而不是聚合,但在我看来,这是一个没有区别的区别。我认为任何擅长一个系统应该擅长另一个系统
  • Enterprise Service Bus 企业服务总线—I think the enterprise service bus concept is very similar to some of the ideas I have described around data integration. This idea seems to have been moderately successful in enterprise software communities and is mostly unknown among web folks or the distributed data infrastructure crowd. ー我认为企业服务总线的概念与我所描述的有关数据集成的一些想法非常相似。这个想法似乎在企业软件社区中取得了一定程度的成功,但在 web 人员或分布式数据基础设施人群中几乎不为人所知

Interesting open source stuff: 有趣的开源内容:

  • Kafka 卡夫卡 Is the "log as a service" project that is the basis for much of this post. “日志即服务”项目是本文大部分内容的基础
  • Bookeeper and 及Hedwig 女名女子名 comprise another open source "log as a service". They seem to be more targeted at data system internals then at event data. 组成了另一个开放源码“ log as a service”。他们似乎更加关注数据系统内部而不是事件数据
  • Databus 数据库 is a system that provides a log-like overlay for database tables.是一个为数据库表提供类似日志的覆盖的系统
  • Akka 阿卡 is an actor framework for Scala. It has an add on, 是 Scala 的一个 actor 框架,它有一个附加组件,eventsourced 事件来源, that provides persistence and journaling. 提供了持久性和日志记录
  • Samza 女名女子名 is a stream processing framework we are working on at LinkedIn. It uses a lot of the ideas in this article as well as integrating with Kafka as the underlying log. 是一个流处理框架,我们正在 LinkedIn 上工作。它使用了本文中的许多观点,并且将 Kafka 作为底层日志
  • Storm 暴风雨 is popular stream processing framework that integrates well with Kafka. 是一个流行的流处理框架,可以很好地与 Kafka 集成
  • Spark Streaming 火花流 is a stream processing framework that is part of 是流处理框架的一部分Spark 火花.
  • Summingbird 森明鸟 is a layer on top of Storm or Hadoop that provides a convenient computing abstraction. 是一个位于 Storm 或 Hadoop 之上的层,它提供了一个方便的计算抽象

I try to keep up on this area so if you know of some things I've left out, let me know.

我努力跟上这个领域,所以如果你知道一些事情,我已经遗漏了,让我知道。

posted @ 2020-09-10 16:38  它山之玉  阅读(391)  评论(0编辑  收藏  举报