Flume官方文档翻译——Flume 1.7.0 User Guide (unreleased version)(一)

Flume 1.7.0 User Guide

 

 

Introduction

Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Apache Flume is a top level project at the Apache Software Foundation.

There are currently two release code lines available, versions 0.9.x and 1.x.

Documentation for the 0.9.x track is available at the Flume 0.9.x User Guide.

This documentation applies to the 1.4.x track.

New and existing users are encouraged to use the 1.x releases so as to leverage the performance improvements and configuration flexibilities available in the latest architecture.

Apache Flume是一个分布式、高可靠和高可用的收集、集合和将大量来自不同来源的日志数据移动到一个中央数据仓库。

Apache Flume不仅局限于数据的聚集。因为数据是可定制的,所以Flume可以用于运输大量时间数据包括不限于网络传输数据,社交媒体产生的数据,电子邮件信息和几乎任何数据源。

Apache Flume是Apache软件基金会的顶级项目。

目前有两个可用的发布版本,0.9.x和1.x。

我们鼓励新老用户使用1.x发布版本来提高性能和利用新结构的配置灵活性。

System Requirements

    1. Java Runtime Environment - Java 1.7 or later(Java运行环境-Java1.7或者以后的版本)
    2. Memory - Sufficient memory for configurations used by sources, channels or sinks(内存——足够的内存来配置souuces,channels和sinks)
    3. Disk Space - Sufficient disk space for configurations used by channels or sinks(磁盘空间-足够的磁盘空间来配置channels或者sinks)
    4. Directory Permissions - Read/Write permissions for directories used by agent(目录权限-代理所使用的目录读/写权限)

Architecture(架构)

Data flow model(数据流动模型)

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

一个Flume event被定义为拥有一个字节的有效负载的一个数据流单元和一个可选的字符串属性配置。Flume agent是一个JVM进程来控制组件完成事件流从一个外部来源传输到下一个目的地。

 

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

Flume source消费外部来源像web server传输给他的事件。外部来源发送以目标Flume source定义好的格式的event给Flume。例如,Avro Flume source用于接收Avro客户端或者流中的其他Flume中Avro sink发来的Avro events。一个相似的流可以用Thrift Flume Source 来接收来自Flume sink或者FluemThrift Rpc客户端或者一个用任何语言写的遵守Flume Thrift 协议的Thrift客户端的事件。当一个Flume Source接收一个事件时,它将事件存储在一个或者多个Cannel中。Channel是一个被动仓库用来保存事件直到它被Flume Sink消费掉。File channel就是个例子-它背靠着本地的文件系统。Sink将事件从Channel中移除并且将事件放到一个外部的仓库像HDFS(通过Flume HDFS sink)或者向前传输到流中另一个Flume Agent。Agent中Source和Sink异步地执行Channel中events。

Complex flows(复杂流)

Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume允许一些用户建立multi-hop流当事件在到达最终目的地时要经过多个Agent。它也允许扇入和扇出流,上下文路由和失效hop的恢复路由。

Reliability(可靠性)

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.

事件都是(存储)在每个代理中Channel。事件会被传送到下一个Agent或者流中的最终目的地像HDFS。事件会在被储存在另一个Agent的Channel中或者终点仓库之后从原来的Agent中移除。这是一个单hop在流中信息传输定义,以此提供了端对端的流的可靠性。

Flume用一个事务性方案来保证事件传递的可靠性。source、sink和channel分别提供不同的事务机制,source和sink是封装事件的存储/恢复在一个事务机制中,channel封装事件的位置和提供在一个事务机制中。这个保证了事件集合可靠地从流中的一个点传到另一个点。在多个hop的流中,前一个hop的sink和后一个hop的source都有其事务机制来保证数据能够安全得存储在下一个hop中。

Recoverability(可恢复性)

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.

Channel中存储着事件,并且负责失效恢复。Flume支持一个持久的依赖于本地文件系统的文件Channel。同样吃一个内存Channel简单地将事件存储在一个内存队列,处理速度快但当Agent挂掉时内存中存留的事件将会丢失并且没办法恢复。

Setup(设置)

Setting up an agent(设置Agent)

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

Flume agent配置存储在一个本地配置文件中。这是一个跟Java 属性文件格式一样的文本文件。一个或者多个agent可以指定同一个配置文件来进行配置。配置文件包括每个source的属性,agent中的sink和channel以及它们是如何连接构成数据流。

Configuring individual components(单个组件的配件)

Each component (source, sink or channel) in the flow has a name, type, and set of properties that are specific to the type and instantiation. For example, an Avro source needs a hostname (or IP address) and a port number to receive data from. A memory channel can have max queue size (“capacity”), and an HDFS sink needs to know the file system URI, path to create files, frequency of file rotation (“hdfs.rollInterval”) etc. All such attributes of a component needs to be set in the properties file of the hosting Flume agent.

流中的每个组件(source,sink或者channel)都有名字,类型和用来指定类型的属性集和实例化。例如,一个avro source需要一个主机名(或者IP地址)和端口来接收数据,内存channel有最大队列值(“capacity”),和HDFS sink需要知道文件系统的URI,来创建路径,轮询文件的频率(hdfs.roollInterval)等.组件的所有属性都必须在Flume agetnt的属性文件里配置。

Wiring the pieces together(碎片集合)

The agent needs to know what individual components to load and how they are connected in order to constitute the flow. This is done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the connecting channel for each sink and source. For example, an agent flows events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a file channel called file-channel. The configuration file will contain names of these components and file-channel as a shared channel for both avroWeb source and hdfs-cluster1 sink.

agent需要知道每个组件加载什么和它们是怎样连接构成流。这通过列出agent中每个source、sink和channel和指定每个sink和source连接的channel。例如,一个agent流事件从一个称为avroWeb的Avro sources通过一个称为file-channel的文件channel流向一个称为hdfs-cluster1的HDFS sink。配置文档将包含这些组件的名字和avroWeb source和hdfs-cluster1 sink中间共享的file-channel。

Starting an agent(开始一个agent)

An agent is started using a shell script called flume-ng which is located in the bin directory of the Flume distribution. You need to specify the agent name, the config directory, and the config file on the command line:

agent通过一个称为flume-ngshell位于Flume项目中bin目录下的脚本来启动。你必须在命令行中指定一个agent名字,配置目录和配置文档

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

 

Now the agent will start running source and sinks configured in the given properties file.

现在agent将会开始运行给定的属性文档中的cource和sink。

A simple example(一个简单的例子)

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.

这里我们给出一个配置文件的例子,阐述一个单点Flume的部署,这个配置让一个用户产生一个事件和随后把事件打印在控制台。

# example.conf: A single-node Flume configuration

 

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

这个配置信息定义了一个名字为a1的单点agent。a1拥有一个监听数据端口为44444的source,一个内存channel和一个将事件打印在控制台的sink。配置文档给多个组件命名,并且描述它们的类型和配置参数。一个给定的配置文档可以定义多个agent;当一个给定的Flume进程加载时,一个标志会传递告诉他具体运行哪个agent。

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

 

Note that in a full deployment we would typically include one more option: --conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the console and we go without a custom environment script.

需要说明的是在一个完整的部署中我们应该通常会包含多一个选项:--conf=<conf-dir>.<conf-dir>目录包含一个shell脚本 flume-env.sh和一个潜在的log4j属性文档。在这个例子中,我们通过一个Java选项来强制Flume打印信息到控制台和没有自定义一个环境脚本。

From a separate terminal, we can then telnet port 44444 and send Flume an event:

通过一个独立的终端,我们可以telnet 端口4444和发送一个事件:

$ telnet localhost 44444

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

Hello world! <ENTER>

OK

 

The original Flume terminal will output the event in a log message.

原来的Flume终端将会在控制台将事件打印出来:

12/06/19 15:32:19 INFO source.NetcatSource: Source starting

12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

 

Congratulations - you’ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.

恭喜-你已经成功配置和部署了一个Flume agent!接下来的部分会覆盖agent配置的更多细节。

posted @ 2016-12-05 18:31  孙朝和  阅读(943)  评论(0编辑  收藏  举报