apache flink
apache flink 介绍
1,apache flink 是为分布式,高性能,随时可用以及准确的流处理应用程序打造的开源流处理框架
2,apache flink 是一个框架和分布式处理引擎,用于对有界和无界数据进行状态计算,以内存速度和任意规模来执行计算
批处理与流处理:
1,批处理是从数据仓库读取某一天,某一个月数据进行处理,需要访问全套记录才能完成计算工作,关注高吞吐,高效处理
2,流处理无需对整个数据集进行操作,主要是针对系统传输的每个数据项执行操作,一般用于实时统计,要求低延迟与精确一致保证
3,flink 既可以流处理也可以批处理(当做特殊的流处理)
无界数据流:
1, 无界数据流 只有开始但是没有结束,划分区间进行统计操作
2, 通常要求以特定的顺序(事件发生顺序)获取 event,以便于推断结果完整性
有界数据流:
1,有明确的定义的开始和结束
flink 架构:
1,JobManager(JVM 进程) 处理器:
master(driver), 用于协调分布式执行,调度 task ,协调检查点(失败恢复). flink 运行时至少存在一个 master(可多个,一个 leader 其他standby)
2,TaskManger(JVM 进程) 处理器:
1).work(executor) 用于执行一个 dataflow 的 task(或者特殊的 subtask), 数据缓冲和 data stream, flink 运行至少需要一个 worker
2),TaskManger 会定时向 JobManager 进行心跳汇报
一,standalone 安装
1,首先在 JobManager 进行解压, 在 flink-conf.yaml 配置jobmanager.rpc.address
################################################################################ # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ################################################################################ #============================================================================== # Common #============================================================================== # The external address of the host on which the JobManager runs and can be # reached by the TaskManagers and any clients which want to connect. This setting # is only used in Standalone mode and may be overwritten on the JobManager side # by specifying the --host <hostname> parameter of the bin/jobmanager.sh executable. # In high availability mode, if you use the bin/start-cluster.sh script and setup # the conf/masters file, this will be taken care of automatically. Yarn/Mesos # automatically configure the host name based on the hostname of the node where the # JobManager runs. jobmanager.rpc.address: master # The RPC port where the JobManager is reachable. jobmanager.rpc.port: 6123 # The heap size for the JobManager JVM jobmanager.heap.size: 1024m # The heap size for the TaskManager JVM taskmanager.heap.size: 1024m # The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline. taskmanager.numberOfTaskSlots: 1 # The parallelism used for programs that did not specify and other parallelism. parallelism.default: 1 # The default file system scheme and authority. # # By default file paths without scheme are interpreted relative to the local # root file system 'file:///'. Use this to override the default and interpret # relative paths relative to a different file system, # for example 'hdfs://mynamenode:12345' # # fs.default-scheme #============================================================================== # High Availability #============================================================================== # The high-availability mode. Possible options are 'NONE' or 'zookeeper'. # # high-availability: zookeeper # The path where metadata for master recovery is persisted. While ZooKeeper stores # the small ground truth for checkpoint and leader election, this location stores # the larger objects, like persisted dataflow graphs. # # Must be a durable file system that is accessible from all nodes # (like HDFS, S3, Ceph, nfs, ...) # # high-availability.storageDir: hdfs:///flink/ha/ # The list of ZooKeeper quorum peers that coordinate the high-availability # setup. This must be a list of the form: # "host1:clientPort,host2:clientPort,..." (default clientPort: 2181) # # high-availability.zookeeper.quorum: localhost:2181 # ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes # It can be either "creator" (ZOO_CREATE_ALL_ACL) or "open" (ZOO_OPEN_ACL_UNSAFE) # The default value is "open" and it can be changed to "creator" if ZK security is enabled # # high-availability.zookeeper.client.acl: open #============================================================================== # Fault tolerance and checkpointing #============================================================================== # The backend that will be used to store operator state checkpoints if # checkpointing is enabled. # # Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the # <class-name-of-factory>. # # state.backend: filesystem # Directory for checkpoints filesystem, when using any of the default bundled # state backends. # # state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints # Default target directory for savepoints, optional. # # state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints # Flag to enable/disable incremental checkpoints for backends that # support incremental checkpoints (like the RocksDB state backend). # # state.backend.incremental: false #============================================================================== # Web Frontend #============================================================================== # The address under which the web-based runtime monitor listens. # #jobmanager.web.address: 0.0.0.0 # The port under which the web-based runtime monitor listens. # A value of -1 deactivates the web server. rest.port: 8081 # Flag to specify whether job submission is enabled from the web-based # runtime monitor. Uncomment to disable. #jobmanager.web.submit.enable: false #============================================================================== # Advanced #============================================================================== # Override the directories for temporary files. If not specified, the # system-specific Java temporary directory (java.io.tmpdir property) is taken. # # For framework setups on Yarn or Mesos, Flink will automatically pick up the # containers' temp directories without any need for configuration. # # Add a delimited list for multiple directories, using the system directory # delimiter (colon ':' on unix) or a comma, e.g.: # /data1/tmp:/data2/tmp:/data3/tmp # # Note: Each directory entry is read from and written to by a different I/O # thread. You can include the same directory multiple times in order to create # multiple I/O threads against that directory. This is for example relevant for # high-throughput RAIDs. # # io.tmp.dirs: /tmp # Specify whether TaskManager's managed memory should be allocated when starting # up (true) or when memory is requested. # # We recommend to set this value to 'true' only in setups for pure batch # processing (DataSet API). Streaming setups currently do not use the TaskManager's # managed memory: The 'rocksdb' state backend uses RocksDB's own memory management, # while the 'memory' and 'filesystem' backends explicitly keep data as objects # to save on serialization cost. # # taskmanager.memory.preallocate: false # The classloading resolve order. Possible values are 'child-first' (Flink's default) # and 'parent-first' (Java's default). # # Child first classloading allows users to use different dependency/library # versions in their application than those in the classpath. Switching back # to 'parent-first' may help with debugging dependency issues. # # classloader.resolve-order: child-first # The amount of memory going to the network stack. These numbers usually need # no tuning. Adjusting them may be necessary in case of an "Insufficient number # of network buffers" error. The default min is 64MB, teh default max is 1GB. # # taskmanager.network.memory.fraction: 0.1 # taskmanager.network.memory.min: 67108864 # taskmanager.network.memory.max: 1073741824 #============================================================================== # Flink Cluster Security Configuration #============================================================================== # Kerberos authentication for various components - Hadoop, ZooKeeper, and connectors - # may be enabled in four steps: # 1. configure the local krb5.conf file # 2. provide Kerberos credentials (either a keytab or a ticket cache w/ kinit) # 3. make the credentials available to various JAAS login contexts # 4. configure the connector to use JAAS/SASL # The below configure how Kerberos credentials are provided. A keytab will be used instead of # a ticket cache if the keytab path and principal are set. # security.kerberos.login.use-ticket-cache: true # security.kerberos.login.keytab: /path/to/kerberos/keytab # security.kerberos.login.principal: flink-user # The configuration below defines which JAAS login contexts # security.kerberos.login.contexts: Client,KafkaClient #============================================================================== # ZK Security Configuration #============================================================================== # Below configurations are applicable if ZK ensemble is configured for security # Override below configuration to provide custom ZK service name if configured # zookeeper.sasl.service-name: zookeeper # The configuration below must match one of the values set in "security.kerberos.login.contexts" # zookeeper.sasl.login-context-name: Client #============================================================================== # HistoryServer #============================================================================== # The HistoryServer is started and stopped via bin/historyserver.sh (start|stop) # Directory to upload completed jobs to. Add this directory to the list of # monitored directories of the HistoryServer as well (see below). #jobmanager.archive.fs.dir: hdfs:///completed-jobs/ # The address under which the web-based HistoryServer listens. #historyserver.web.address: 0.0.0.0 # The port under which the web-based HistoryServer listens. #historyserver.web.port: 8082 # Comma separated list of directories to monitor for completed jobs. #historyserver.archive.fs.dir: hdfs:///completed-jobs/ # Interval in milliseconds for refreshing the monitored directories. #historyserver.archive.fs.refresh-interval: 10000
2,配置 slaves(只配置 TaskManger)
slave1
slave2
3,在 flink JobManager 启动集群
#StandaloneSessionClusterEntrypoint 进程与 TaskManagerRunner 进程启动 bin/start-cluster.sh
4,打开 UI 界面 IP:8081,并分发
二,yarn 模式安装(前四步同上)
5,开启hdfs 与 yarn
sbin/start-all.sh
6,在 JobManager 提交 yarn-session
bin/yarn-session.sh -n 2 -s 6 -jm 1024 -tm 1024 -nm test -d ''' -n 指的是 TaskManager 数量 -s 值得是 一个 TaskManager 的 slots(task) 数量, slots 平均分配内存 -jm JobManager 内存(MB) -tm 每个TaskManager 内存(MB) -nm yarn 的 appName -d 后台运行 '''
7,提交 jar 包到 集群
bin/flink run -m yarn-cluster examples/batch/WordCount.jar