5 reasons why Spark Streaming’s batch processing of data streams is not stream processing

There are undoubtedly several approaches to the way systems deal with real-time data before it is persisted in a database. For example, two of the most common open source platforms for this are Apache Storm and Apache Spark (with its Spark Streaming framework), and both take a very different approach to processing data streams. Storm, like SQLstream Blaze, IBM InfoSphere Streams and many others, are true record-by-record stream processing engines. Others such as Apache Spark take a different approach and collect events together for processing in batches. I’ve summarized here the main considerations when considering which paradigm is most appropriate.

目前存在着很多种在实时数据被保存进数据库之前系统对这些数据进行处理的方式。例如,现在最常见的两个开源平台就是Apache Storm 和 Apache Spark(带有它自身Spark Streaming框架),它们都拥有一种非常特有的处理数据流的方法。Storm,像SQLstream Blaze、IBM InfoSphere Streams还有其它一些产品,是真正的record-by-record流处理引擎。而其它的像Apache Spark则使用一种不同的方式,通过把事件收集起来,然后进行批处理。在考虑这些非常容易混淆的范例的时候,我对需要重点考虑的事项做了如下总结:

#1 Stream Processing versus batch-based processing of data streams

There are two fundamental attributes of data stream processing. First, each and every record in the system must have a timestamp, which in 99% of cases is the time at which the data were created. Second, each and every record is processed as it arrives. These two attributes ensure a system that can react to the contents of every record, and can correlate across multiple records over time, even down to millisecond latency. In contrast, approaches such as Spark Streaming process data streams in batches, where each batch contains a collection of events that arrived over the batch period (regardless of when the data were actually created). This is fine for some applications such as simple counts and ETL into Hadoop, but the lack of true record-by-record processes makes stream processing and time-series analytics impossible.

#1流处理方式与批处理方式在数据流处理的对比

在流处理的过程中有两个基本的特征。第一,系统里面的每一条记录都要有一个时间戳,99%的记录都是把数据产生的那个时间作为时间戳;第二,每一条记录都是当它到达的时候就被处理。这两个特性可以保证系统能够根据时间的先后对每一条记录进行反应,并且把延迟降低到毫秒级。相比而言,像Spark Streaming这种把数据流批处理的方法,每一个批处理操作都包含了一个批处理周期到达的事件的集合(不管这些数据实际是什么时候产生的)。这种方法虽然对简单计数或者ETL数据进Hadoop是有好处的,但是,由于缺乏record-by-record的处理机制,则无法实现流处理以及时间序列分析。

#2 Data arriving out of time order is a problem for batch-based processing

Processing data in the real world is a messy business. Data is often of poor quality, records can be missing, and streams arrive with data out of (creation) time order. Data from multiple remote sources may be generated at the same time, but due to network or other issues, some streams may be delayed. A corollary of stored batch processing of data streams is that these real-time factors cannot be addressed easily, making it impossible or at best expensive (computing resources and therefore performance) to detect missing data, data gaps, correct out of time order data etc. This is a simple problem to overcome for a record-by-record stream processing platform, where each record has its own timestamp, and is processed individually.

#2处理不按时间到达的数据对于基于批处理的方法来说是一个大问题
在真实世界中处理数据是一个非常麻烦的事情。数据一般的质量都不高,记录可能会丢失,并且数据到达的时间可能不按正常的时间(产生)顺序。从不同的远程数据源到来的数据可能是在同一时间产生的,但是由于网络或者其他一些原因,一些数据流可能延迟到达。对于批处理的储存来说,这些数据流的问题不能够很容易被解决,对于检测丢失数据、数据间隙还有纠正数据到达的顺序等操作会无法实现或者会花费大量的代价(计算资源并且影响性能)。但对于record-by-record流处理平台,由于每个记录都有自己的时间戳,并且是单独处理,因此上面的问题很容易得到解决。

#3 Batch length restricts Window-based analytics

Any system that uses batch-based processing of data streams is limiting the granularity of response to the batch length. Window operations can be simulated by iterating repeatedly over a series of micro batches, in much the same way as static queries operate over stored data. However, this is expensive in terms of processing resources and adds further to the computation overheads. And processing is still limited to the arrival time of the data (rather than the time at which the data were created).

#3 批的长度约束了基于窗口的分析
任何一个使用基于批处理数据流方式的系统都会由于批的长度而被限制了处理的粒度。Window 操作可以通过对一系列的微批次进行迭代来进行模拟,很大程度上就像对存储数据的静态查询操作。但是,这样对处理资源来说代价非常大并且会进一步加大计算开销。另外,处理过程也会被数据的到达时间限制(而不是数据产生的时间)。

#4 Spark claims to be faster than Storm but is still performance limited

Spark Streaming’s Java or Scala-based execution architecture is claimed to be 4X to 8X faster than Apache Storm using the WordCount benchmark. However, Apache Storm offers limited performance per server by stream processing standards these days, although does scale out over large numbers of servers to gain overall system performance. (This can make larger systems expensive, both in terms of server, power and cooling costs, but also a factor of the additional distributed system complexity.) The point here is that Spark Streaming’s performance can be improved by using larger batches, which may explain the performance increase, but larger batches moves further away from real-time processing towards stored batch mode, and exacerbates the stream processing and real-time, time-based analytics issues.

#4 Spark声称是比Storm要快但还是会有性能限制
Spark Streaming的Java 或者Scala-based的运行宣称要比使用WordCount基准的Apache Storm快4到8倍。尽管Apache Storm可以通过向外扩展大量的服务器来提升总体性能,但是由于它的流处理使用标准,它的每一台服务器的性能是有限的。(而不断扩展服务器的这个操作会增加花销,包括在服务器成本、能源成本、冷却成本上,另外,这也是增加分布式系统复杂性的一个因素)。目前,Spark Streaming的性能可以通过使用大量的批次来改善,这可以增加性能,但是大量的批次会让SparkStreaming远离实时分析,而成为了批处理存储模式,并且会减弱处理流处理、实时处理、时序分析等问题的能力。

#5 Writing stream processing operations from scratch is not easy

Batch-based platforms such as Spark Streaming typically offer limited libraries of stream functions that are called programmatically to perform aggregation and counts on the arriving data. Developing a streaming analytics application on Spark Streaming for example requires writing code in Java or Scala. Processing data streams is a a different paradigm, and moreover, Java is typicaly 50X less compact than say SQL – significantly more code required. Java and Scala require significant garbage collection which is particularly inefficient and troublesome for in-memory processing.

#5 从scratch编写流处理操作并不容易
基于批处理的平台例如Spark Streaming,对于到达数据的聚合以及计数只是提供了很有限的一些流处理函数的库。在Spark Streaming上写一个流分析应用需要在Java or Scala上去写代码。处理数据流使用的是不同的范式,而且,Java的紧凑程度比SQL低50倍,这意味着需要写更多的代码。Java 和 Scala 需要有效的垃圾收集机制,而这些机制在内存处理上面是非常低效以及麻烦的。

posted @ 2017-08-08 09:38  静若清池  阅读(165)  评论(0编辑  收藏  举报