通过LogStash迁移ES

logstash基础

1.架构图

logStash整体架构如上图,箭头代表数据的流向,可以配置多个input, 中间的 queue 负责将数据发送到不同的pipeline, 每个pipline由 batcher, filter, output 节点构成,

batcher 的作用为批量从queue中取出数据,可以设置每次取出的批次数量。

2.logStash数据处理流程

logstash 数据处理过程,总体来看,可分为三个过程, input → filter → output, 对应着, 数据收集 → 数据分析/输出 → 数据输出。

3.logstash 的消息队列, queue

由于logstash 处理过程中需要进行数据分析,规格化。所以,中间一定由相应的存储介质,logstash使用queue 类型的数据结构来存储从input 输入到 logstash 进程中的数据,

logstash 中,使用 ACK机制来确保 output 消费了从 queue 中获取到的消息数据。

3.1.queue的类型

In Memory: 在内存中,速度较快,但是无法处理进程 crash, 机器宕机等,极端情况下会导致数据丢失。

Persistent Queue (PQ), 也就是 将数据持久化至 DIsk, 在PQ 收到 input 端的ACK之后,删除Disk上的备份数据。

性能上,PQ 的磁盘操作基本上都是顺序操作,所以和 In Memory 类型的没有太大的差别,除非对速度由极致的追求,一般建议 打开 PQ.

4.核心配置文件

4.1-4.3 为 logStash 本身的配置文件;

4.4-4.5 为任务的配置文件;

4.1 jvm.options

## JVM configuration
 
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
 
-Xms5g
-Xmx5g
 
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
 
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
 
## Locale
# Set the locale language
#-Duser.language=en
 
# Set the locale country
#-Duser.country=US
 
# Set the locale variant, if any
#-Duser.variant=
 
## basic
 
# set the I/O temp directory
#-Djava.io.tmpdir=$HOME
 
# set to headless, just in case
-Djava.awt.headless=true
 
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
 
# use our provided JNA always versus the system one
#-Djna.nosys=true
 
# Turn on JRuby invokedynamic
-Djruby.compile.invokedynamic=true
# Force Compilation
-Djruby.jit.threshold=0
 
## heap dumps
 
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
 
# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${LOGSTASH_HOME}/heapdump.hprof
 
## GC logging
#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime
 
# log GC status to a file with time stamps
# ensure the directory exists
#-Xloggc:${LS_GC_LOG_FILE}
 
# Entropy source for randomness
-Djava.security.egd=file:/dev/urandom

核心关注 -Xms, -Xms, 以及 垃圾回收器。

主要配置 堆的大小即可,其他的一般默认即可。

4.2 logstash.yml

# Settings file in YAML
#
# Settings can be specified either in hierarchical form, e.g.:
#
#   pipeline:
#     batch:
#       size: 125
#       delay: 5
#
# Or as flat keys:

#  pipeline.batch.size: 125
#  pipeline.batch.delay: 5

# ------------  Node identity ------------
#
# Use a descriptive name for the node:
#
# node.name: test
#
# If omitted the node name will default to the machine's host name
#
# ------------ Data path ------------------
#
# Which directory should be used by logstash and its plugins
# for any persistent needs. Defaults to LOGSTASH_HOME/data
#
# path.data:
#
# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
#
pipeline.id: note_comment_main
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#
pipeline.workers: 4
#
# How many events to retrieve from inputs before sending to filters+workers
#
pipeline.batch.size: 1000
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
pipeline.batch.delay: 50
#
# Force Logstash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: enabling this can lead to data loss during shutdown
#
pipeline.unsafe_shutdown: false
#
# ------------ Pipeline Configuration Settings --------------
#
# Where to fetch the pipeline configuration for the main pipeline
#
# path.config:
#
# Pipeline configuration string for the main pipeline
#
# config.string:
#
# At startup, test if the configuration is valid and exit (dry run)
#
# config.test_and_exit: false
#
# Periodically check if the configuration has changed and reload the pipeline
# This can also be triggered manually through the SIGHUP signal
#
# config.reload.automatic: false
#
# How often to check if the pipeline configuration has changed (in seconds)
#
# config.reload.interval: 3s
#
# Show fully compiled configuration as debug log message
# NOTE: --log.level must be 'debug'
#
# config.debug: false
#
# When enabled, process escaped characters such as \n and \" in strings in the
# pipeline configuration files.
#
# config.support_escapes: false
#
# ------------ Module Settings ---------------
# Define modules here.  Modules definitions must be defined as an array.
# The simple way to see this is to prepend each `name` with a `-`, and keep
# all associated variables under the `name` they are associated with, and
# above the next, like this:
#
# modules:
#   - name: MODULE_NAME
#     var.PLUGINTYPE1.PLUGINNAME1.KEY1: VALUE
#     var.PLUGINTYPE1.PLUGINNAME1.KEY2: VALUE
#     var.PLUGINTYPE2.PLUGINNAME1.KEY1: VALUE
#     var.PLUGINTYPE3.PLUGINNAME3.KEY1: VALUE
#
# Module variable names must be in the format of
#
# var.PLUGIN_TYPE.PLUGIN_NAME.KEY
#
# modules:
#
# ------------ Cloud Settings ---------------
# Define Elastic Cloud settings here.
# Format of cloud.id is a base64 value e.g. dXMtZWFzdC0xLmF3cy5mb3VuZC5pbyRub3RhcmVhbCRpZGVudGlmaWVy
# and it may have an label prefix e.g. staging:dXMtZ...
# This will overwrite 'var.elasticsearch.hosts' and 'var.kibana.host'
# cloud.id: <identifier>
#
# Format of cloud.auth is: <user>:<pass>
# This is optional
# If supplied this will overwrite 'var.elasticsearch.username' and 'var.elasticsearch.password'
# If supplied this will overwrite 'var.kibana.username' and 'var.kibana.password'
# cloud.auth: elastic:<password>
#
# ------------ Queuing Settings --------------
#
# Internal queuing model, "memory" for legacy in-memory based queuing and
# "persisted" for disk-based acked queueing. Defaults is memory
#
queue.type: persisted
#
# If using queue.type: persisted, the directory path where the data files will be stored.
# Default is path.data/queue
#
# path.queue:
#
# If using queue.type: persisted, the page data files size. The queue data consists of
# append-only data files separated into pages. Default is 64mb
#
queue.page_capacity: 64mb
#
# If using queue.type: persisted, the maximum number of unread events in the queue.
# Default is 0 (unlimited)
#
queue.max_events: 0
#
# If using queue.type: persisted, the total capacity of the queue in number of bytes.
# If you would like more unacked events to be buffered in Logstash, you can increase the
# capacity using this setting. Please make sure your disk drive has capacity greater than
# the size specified here. If both max_bytes and max_events are specified, Logstash will pick
# whichever criteria is reached first
# Default is 1024mb or 1gb
#
queue.max_bytes: 5gb
#
# If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
queue.checkpoint.acks: 1024
#
# If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
queue.checkpoint.writes: 1024
#
# If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
# Default is 1000, 0 for no periodic checkpoint.
#
queue.checkpoint.interval: 1000
#
# ------------ Dead-Letter Queue Settings --------------
# Flag to turn on dead-letter queue.
#
# dead_letter_queue.enable: false

# If using dead_letter_queue.enable: true, the maximum size of each dead letter queue. Entries
# will be dropped if they would increase the size of the dead letter queue beyond this setting.
# Default is 1024mb
# dead_letter_queue.max_bytes: 1024mb

# If using dead_letter_queue.enable: true, the directory path where the data files will be stored.
# Default is path.data/dead_letter_queue
#
# path.dead_letter_queue:
#
# ------------ Metrics Settings --------------
#
# Bind address for the metrics REST endpoint
#
# http.host: "127.0.0.1"
#
# Bind port for the metrics REST endpoint, this option also accept a range
# (9600-9700) and logstash will pick up the first available ports.
#
# http.port: 9600-9700
#
# ------------ Debugging Settings --------------
#
# Options for log.level:
#   * fatal
#   * error
#   * warn
#   * info (default)
#   * debug
#   * trace
#
# log.level: info
# path.logs:
#
# ------------ Other Settings --------------
#
# Where to find custom plugins
# path.plugins: []
#
# ------------ X-Pack Settings (not applicable for OSS build)--------------
#
# X-Pack Monitoring
# https://www.elastic.co/guide/en/logstash/current/monitoring-logstash.html
#xpack.monitoring.enabled: false
#xpack.monitoring.elasticsearch.username: logstash_system
#xpack.monitoring.elasticsearch.password: password
#xpack.monitoring.elasticsearch.hosts: ["https://es1:9200", "https://es2:9200"]
#xpack.monitoring.elasticsearch.ssl.certificate_authority: [ "/path/to/ca.crt" ]
#xpack.monitoring.elasticsearch.ssl.truststore.path: path/to/file
#xpack.monitoring.elasticsearch.ssl.truststore.password: password
#xpack.monitoring.elasticsearch.ssl.keystore.path: /path/to/file
#xpack.monitoring.elasticsearch.ssl.keystore.password: password
#xpack.monitoring.elasticsearch.ssl.verification_mode: certificate
#xpack.monitoring.elasticsearch.sniffing: false
#xpack.monitoring.collection.interval: 10s
#xpack.monitoring.collection.pipeline.details.enabled: true
#
# X-Pack Management
# https://www.elastic.co/guide/en/logstash/current/logstash-centralized-pipeline-management.html
#xpack.management.enabled: false
#xpack.management.pipeline.id: ["main", "apache_logs"]
#xpack.management.elasticsearch.username: logstash_admin_user
#xpack.management.elasticsearch.password: password
#xpack.management.elasticsearch.hosts: ["https://es1:9200", "https://es2:9200"]
#xpack.management.elasticsearch.ssl.certificate_authority: [ "/path/to/ca.crt" ]
#xpack.management.elasticsearch.ssl.truststore.path: /path/to/file
#xpack.management.elasticsearch.ssl.truststore.password: password
#xpack.management.elasticsearch.ssl.keystore.path: /path/to/file
#xpack.management.elasticsearch.ssl.keystore.password: password
#xpack.management.elasticsearch.ssl.verification_mode: certificate
#xpack.management.elasticsearch.sniffing: false
#xpack.management.logstash.poll_interval: 5s

该模块主要 关注 pipeline 的相关配置,例如 pipeline id , queue 的类型,最大容量,批次提交大小。

4.3 pipeline.yml

# List of pipelines to be loaded by Logstash
#
# This document must be a list of dictionaries/hashes, where the keys/values are pipeline settings.
# Default values for omitted settings are read from the `logstash.yml` file.
# When declaring multiple pipelines, each MUST have its own `pipeline.id`.
#
# Example of two pipelines:
#
# - pipeline.id: test
#   pipeline.workers: 1
#   pipeline.batch.size: 1
#   config.string: "input { generator {} } filter { sleep { time => 1 } } output { stdout { codec => dots } }"
# - pipeline.id: another_test
#   queue.type: persisted
#   path.config: "/tmp/logstash/*.config"
#
# Available options:
#
#   # name of the pipeline
#   pipeline.id: mylogs
#
#   # The configuration string to be used by this pipeline
#   config.string: "input { generator {} } filter { sleep { time => 1 } } output { stdout { codec => dots } }"
#
#   # The path from where to read the configuration text
#   path.config: "/etc/conf.d/logstash/myconfig.cfg"
#
#   # How many worker threads execute the Filters+Outputs stage of the pipeline
#   pipeline.workers: 1 (actually defaults to number of CPUs)
#
#   # How many events to retrieve from inputs before sending to filters+workers
#   pipeline.batch.size: 125
#
#   # How long to wait in milliseconds while polling for the next event
#   # before dispatching an undersized batch to filters+outputs
#   pipeline.batch.delay: 50
#
#   # Internal queuing model, "memory" for legacy in-memory based queuing and
#   # "persisted" for disk-based acked queueing. Defaults is memory
#   queue.type: memory
#
#   # If using queue.type: persisted, the page data files size. The queue data consists of
#   # append-only data files separated into pages. Default is 64mb
#   queue.page_capacity: 64mb
#
#   # If using queue.type: persisted, the maximum number of unread events in the queue.
#   # Default is 0 (unlimited)
#   queue.max_events: 0
#
#   # If using queue.type: persisted, the total capacity of the queue in number of bytes.
#   # Default is 1024mb or 1gb
#   queue.max_bytes: 1024mb
#
#   # If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
#   # Default is 1024, 0 for unlimited
#   queue.checkpoint.acks: 1024
#
#   # If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
#   # Default is 1024, 0 for unlimited
#   queue.checkpoint.writes: 1024
#
#   # If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
#   # Default is 1000, 0 for no periodic checkpoint.
#   queue.checkpoint.interval: 1000
#
#   # Enable Dead Letter Queueing for this pipeline.
#   dead_letter_queue.enable: false
#
#   If using dead_letter_queue.enable: true, the maximum size of dead letter queue for this pipeline. Entries
#   will be dropped if they would increase the size of the dead letter queue beyond this setting.
#   Default is 1024mb
#   dead_letter_queue.max_bytes: 1024mb
#
#   If using dead_letter_queue.enable: true, the directory path where the data files will be stored.
#   Default is path.data/dead_letter_queue
#
#   path.dead_letter_queue:
该模块主要用来配置在一个进程中运行多个管道,可以用pipeline.yml 配置流水线数据处理,格式如下

- pipeline.id: my-pipeline_1
  path.config: "/etc/path/to/p1.config"
  pipeline.workers: 3
- pipeline.id: my-other-pipeline
  path.config: "/etc/different/path/p2.cfg"
  queue.type: persisted

每个管道可以配置不同的参数,未显示设置的参数将通过 logstash 进行补全。

在没有参数的启动logstash 的时候,会读取 pipeline.yml, 在使用 -e, -f 的情况下, logstash 将会忽略 pipeline.yml。

更高级的启动方式可以探索 starUp.options

4.4 logstash.conf

input {
   elasticsearch {
       hosts => ["es-cn-v0h180kgv000f92cs.public.elasticsearch.aliyuncs.com:9200"]
       index => "some_index"
       user => "xxxxx"
       password => "xxxxxx"
       query => '{ "query": {"match_all" : {} }, "sort": ["_doc"] }'
       size => 10000
       scroll => "100m"
       docinfo => true
   }
}

output {
   elasticsearch {
       hosts => ["es-cn-st21r3mgu00053ob9.elasticsearch.aliyuncs.com:9200"]
       index => "%{[@metadata][_index]}"
       document_type => "%{[@metadata][_type]}"
       document_id => "%{[@metadata][_id]}"
       user => "xxxxxx"
       password => "xxxxxxx"
     }
}

该配置文件主要用来配置 定义的 input → filter → output 的相关参数,具体信息可以查看文档,选择相应的插件文档信息,查看即可。

input : https://www.elastic.co/guide/en/logstash/7.x/input-plugins.html

filter : https://www.elastic.co/guide/en/logstash/7.x/filter-plugins.html

output: https://www.elastic.co/guide/en/logstash/7.x/output-plugins.html

4.5 path.data

在你在 logstash.yml 或者是 pipeline 中启用来 PQ的时候,此时logstash 将会在指定文件夹下 缓存 队列内容,所以需要指定文件夹,一般可以在 启动命令的时候指定 --path.data=/root/log_stash_dir/logstash-7.1.0/path_data/note

注意,不同的任务,应该使用不同的文件夹,否则会相互影响。

posted @ 2020-09-04 15:22  jzczer  阅读(874)  评论(0编辑  收藏  举报