4、Filebeat原理和安装

Filebeat原理

Filebeat 有两大部分

inputs 负责找文件（类似 find 命令）

管理 harvesters：一个 harvester 则和一个文件一一对应，一行行读然后发送给 output（类似tail -f）

Filebeat 核心配置详解
官方说明：https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html
基本配置
inputs 可以配置多块（block）
paths 可以配置多个文件，文件路径和文件名都支持通配
filebeat.inputs: - type: log paths: - /var/log/system.log - /var/log/wifi.log - type: log paths: - "/var/log/apache2/*" fields: apache: true fields_under_root: true
 
ignore_older 和 scan_frequency
场景问题
场景1：路径下的历史文件可能很多，比如配置了按天分割，显然旧文件我们一般是不需要的。
场景2：扫描频率如何控制，通配设置复杂的话频繁扫文件也是很大的开销。
解决方案
场景1是通过 ignore_older 参数解决，意思是多久前的旧文件不扫描。比如设置为 1h，表示文件时间在 1h 之前的日志都不会被 input 模块搜集，直到有新日志产生。
场景2是通过 scan_frequency 参数控制，表示多久扫描一次是否有新文件产生。比如设置 10s（默认），一个新文件产生 10s 后会被发现，或者一个旧文件（上面
ignore_older）新产生了一行日志 10s 才发现这个文件。
close_* 和 clean_*
被 harvester 获取的文件就一直拿着不放吗？文件重命名或者被删除后怎么办呢？
close_* 配置簇
The close_* configuration options are used to close the harvester after a certain criteria or time. Closing the harvester means closing the file handler.
close_inactive 
多久关闭文件，比如一个日志文件，10 分钟都没有读到新的内容就把文件句柄关闭
这里的时间不是取决于文件的最后更新时间，而是 Filebeat 内部记录的时间，上次读到文件和这次尝试读文件的时间差。
官方建议设置的时间是比文件产生数据频率高一个数量级（默认5m），比如每秒都有日志产生，这个值就可以设置为 1m。
close_renamed 是否关闭 rename 的文件
close_removed 默认开启。文件被删除后，就关闭文件句柄。
clean_* 配置簇
The clean_* options are used to clean up the state entries in the registry file.
Filebeat 内部记录了很多文件状态，保存在 data/registry/filebeat/data.json。如果不清理的话这个文件会越来越大，影响效率。
{ "source": "/xxx/logs/logFile.2021-09-20.log", "offset": 661620031, "timestamp": "2021-09-21T00:04:23.050179808+08:00", "ttl": 10800000000000, "type": "log", "meta": null, "FileStateOS": { "inode": 184559118, "device": 2056 } }
clean_inactive
不设置正确启动时候出现：Exiting: Error in initing input: clean_inactive must be > ignore_older + scan_frequency to make sure only files which are not monitored anymore are removed accessing 'filebeat.inputs.0' (source:'filebeat.yml')
多久清理一次注册信息。默认值是0（不开启clean_*相关功能）
清理的文件信息需要保证这个文件已经不活跃了，所以这个值需要大于 ignore_older + scan_frequency。
不然的话清理后这个文件又被发现，则会重头开始读取，这样就重了。
clean_removed
文件被删除后是否清理注册信息，默认开启。
需要和 close_removed 值保持一致
推荐配置
tail_files: false scan_frequency: 10s ignore_older: 60m close_inactive: 10m close_renamed: true close_removed: true clean_inactive: 70m clean_removed: true
资源限制
在日志非常多机器负载高的时候加重机器负担，建议生产环境上需要对 Filebeat 资源进行限制：
max_procs 最多使用多少核，默认会全部使用，按机器情况限制为1-4核，不太会影响推送效率。
配置自动加载
filebeat.config.inputs: enabled: true path: configs/*.yml reload.enabled: true reload.period: 10s
具体的 input 配置文件放在 configs 文件夹下，例如：
- type: log paths: - /var/log/messages - /var/log/*.log
 
安装filebeat
filebeat通常与日志源服务器放同一台机器上。因此我们在另一台跑业务（有业务日志）的Linux机器上安装filebeat。
如果是收集业务系统的日志，确保filebeat对日志文件具有读权限。
1、下载和解压
cd /home/zyplanke/elk
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-6.8.3-linux-x86_64.tar.gz
tar -xvf filebeat-6.8.3-linux-x86_64.tar.gz
2、配置日志文件采集
见Filebeat 核心配置详解
filebeat.yml
3. 启动
1、编写管理脚本run_filebeat.sh
因filebeat生成日志巨大，因为丢弃日志
chmod +x run_filebeat.sh
启动：
./run_filebeat.sh start

##################################################################################
# desc:    FileBeat运行管理脚本
###################################################################################
CURR_PWD=`pwd -P`

Usage()
{
    echo "*******************************************************"
    echo " Usage: "
    echo "  `basename $0`            : print this usage info "
    echo "  `basename $0` show       : show current running process "
    echo "  `basename $0` start      : start process"            
    echo "  `basename $0` stop       : stop process"
    echo "  `basename $0` kill       : froce kill process"
    echo ""
    exit 0
}

#判断参数的参数个数，如果不符合要求，则输出用法提示
if [ $# -ne 1 ];then   
    Usage
fi 

case $1 in
    "show")  # 显示当前正运行的进程
      echo ""
        echo " Currently, running processes as follows....."
        echo "*******************************************************"
        #ps -f | head -1
        ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}'  |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r ps -f -p | grep -v "grep"
        echo "*******************************************************"
        echo ""
        ;;    
        
    "start")
        nohup ${CURR_PWD}/filebeat -e -c filebeat.yml >/dev/null 2>&1 &
        echo " starting...  "
        sleep 1
        echo " Please check the result via logs files or nohup.out!"
        echo ""
        ;;
        
    "stop")
        ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}'  |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill > /dev/null 2>&1
        echo " stoping...  "
        sleep 1
        echo " Please check the result by yourself!"
        echo ""
        ;;
    "kill")
        ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}'  |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill > /dev/null 2>&1
        sleep 5
        ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}'  |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill -9 > /dev/null 2>&1
        ;;
    *)
        echo " input error!!! "
        Usage
        ;;    
esac
exit 0
filebeat.yml

###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html

# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.

#=========================== Filebeat inputs =============================

# List of inputs to fetch data.
filebeat.inputs:
# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

# Type of the files. Based on this the way the file is read is decided.
# The different types cannot be mixed in one input
#
# Possible options are:
# * log: Reads every line of the log file (default)
# * stdin: Reads the standard in

#------------------------------ Log input --------------------------------
- type: log

  # Change to true to enable this input configuration.
  enabled: true

  # Paths that should be crawled and fetched. Glob based paths.
  # To fetch all ".log" files from a specific level of subdirectories
  # /var/log/*/*.log can be used.
  # For each file found under this path, a harvester is started.
  # Make sure not file is defined twice as this can lead to unexpected behaviour.
  paths:
    - /app/tomcat/apache-tomcat-8.5.38/logs/*
    - /app/tomcat/apache-tomcat-8.5.38/bin/log/*.log
    #- c:\programdata\elasticsearch\logs\*

  # Configure the file encoding for reading files with international characters
  # following the W3C recommendation for HTML5 (http://www.w3.org/TR/encoding).
  # Some sample encodings:
  #   plain, utf-8, utf-16be-bom, utf-16be, utf-16le, big5, gb18030, gbk,
  #    hz-gb-2312, euc-kr, euc-jp, iso-2022-jp, shift-jis, ...
  #encoding: plain


  # Exclude lines. A list of regular expressions to match. It drops the lines that are
  # matching any regular expression from the list. The include_lines is called before
  # exclude_lines. By default, no lines are dropped.
  #exclude_lines: ['^DBG']

  # Include lines. A list of regular expressions to match. It exports the lines that are
  # matching any regular expression from the list. The include_lines is called before
  # exclude_lines. By default, all the lines are exported.
  #include_lines: ['^ERR', '^WARN']

  # Exclude files. A list of regular expressions to match. Filebeat drops the files that
  # are matching any regular expression from the list. By default, no files are dropped.
  #exclude_files: ['.gz$']

  # Optional additional fields. These fields can be freely picked
  # to add additional information to the crawled log files for filtering
  fields:
  #  level: debug
  #  review: 1
    logcategory: vhlloguat

  # Set to true to store the additional fields as top level fields instead
  # of under the "fields" sub-dictionary. In case of name conflicts with the
  # fields added by Filebeat itself, the custom fields overwrite the default
  # fields.
  #fields_under_root: false

  # Ignore files which were modified more then the defined timespan in the past.
  # ignore_older is disabled by default, so no files are ignored by setting it to 0.
  # Time strings like 2h (2 hours), 5m (5 minutes) can be used.
  ignore_older: 60m

  # How often the input checks for new files in the paths that are specified
  # for harvesting. Specify 1s to scan the directory as frequently as possible
  # without causing Filebeat to scan too frequently. Default: 10s.
  scan_frequency: 10s

  # Defines the buffer size every harvester uses when fetching the file
  harvester_buffer_size: 16384

  # Maximum number of bytes a single log event can have
  # All bytes after max_bytes are discarded and not sent. The default is 10MB.
  # This is especially useful for multiline log messages which can get large.
  max_bytes: 10485760

  ### Recursive glob configuration

  # Expand "**" patterns into regular glob patterns.
  #recursive_glob.enabled: true

  ### JSON configuration

  # Decode JSON options. Enable this if your logs are structured in JSON.
  # JSON key on which to apply the line filtering and multiline settings. This key
  # must be top level and its value must be string, otherwise it is ignored. If
  # no text key is defined, the line filtering and multiline features cannot be used.
  #json.message_key:

  # By default, the decoded JSON is placed under a "json" key in the output document.
  # If you enable this setting, the keys are copied top level in the output document.
  #json.keys_under_root: false

  # If keys_under_root and this setting are enabled, then the values from the decoded
  # JSON object overwrite the fields that Filebeat normally adds (type, source, offset, etc.)
  # in case of conflicts.
  #json.overwrite_keys: false
 
  # If this setting is enabled, Filebeat adds a "error.message" and "error.key: json" key in case of JSON
  # unmarshaling errors or when a text key is defined in the configuration but cannot
  # be used.
  #json.add_error_key: false

  ### Multiline options

  # Multiline can be used for log messages spanning multiple lines. This is common
  # for Java Stack Traces or C-Line Continuation

  # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
  multiline.pattern: ^\[

  # Defines if the pattern set under pattern should be negated or not. Default is false.
  #multiline.negate: false

  # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
  # that was (not) matched before or after or as long as a pattern is not matched based on negate.
  # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
  multiline.match: after

  # The maximum number of lines that are combined to one event.
  # In case there are more the max_lines the additional lines are discarded.
  # Default is 500
  multiline.max_lines: 500

  # After the defined timeout, an multiline event is sent even if no new pattern was found to start a new event
  # Default is 5s.
  multiline.timeout: 5s

  # Setting tail_files to true means filebeat starts reading new files at the end
  # instead of the beginning. If this is used in combination with log rotation
  # this can mean that the first entries of a new file are skipped.
  tail_files: false

  # The Ingest Node pipeline ID associated with this input. If this is set, it
  # overwrites the pipeline option from the Elasticsearch output.
  #pipeline:

  # If symlinks is enabled, symlinks are opened and harvested. The harvester is opening the
  # original for harvesting but will report the symlink name as source.
  symlinks: true

  # Backoff values define how aggressively filebeat crawls new files for updates
  # The default values can be used in most cases. Backoff defines how long it is waited
  # to check a file again after EOF is reached. Default is 1s which means the file
  # is checked every second if new lines were added. This leads to a near real time crawling.
  # Every time a new line appears, backoff is reset to the initial value.
  backoff: 1s

  # Max backoff defines what the maximum backoff time is. After having backed off multiple times
  # from checking the files, the waiting time will never exceed max_backoff independent of the
  # backoff factor. Having it set to 10s means in the worst case a new line can be added to a log
  # file after having backed off multiple times, it takes a maximum of 10s to read the new line
  max_backoff: 10s

  # The backoff factor defines how fast the algorithm backs off. The bigger the backoff factor,
  # the faster the max_backoff value is reached. If this value is set to 1, no backoff will happen.
  # The backoff value will be multiplied each time with the backoff_factor until max_backoff is reached
  backoff_factor: 2

  # Max number of harvesters that are started in parallel.
  # Default is 0 which means unlimited
  #harvester_limit: 0

  ### Harvester closing options

  # Close inactive closes the file handler after the predefined period.
  # The period starts when the last line of the file was, not the file ModTime.
  # Time strings like 2h (2 hours), 5m (5 minutes) can be used.
  close_inactive: 10m

  # Close renamed closes a file handler when the file is renamed or rotated.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_renamed: true

  # When enabling this option, a file handler is closed immediately in case a file can't be found
  # any more. In case the file shows up again later, harvesting will continue at the last known position
  # after scan_frequency.
  close_removed: true

  # Closes the file handler as soon as the harvesters reaches the end of the file.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_eof: true

  ### State options

  # Files for the modification data is older then clean_inactive the state from the registry is removed
  # By default this is disabled.
  clean_inactive: 70m

  # Removes the state for file which cannot be found on disk anymore immediately
  clean_removed: true

  # Close timeout closes the harvester after the predefined time.
  # This is independent if the harvester did finish reading the file or not.
  # By default this option is disabled.
  # Note: Potential data loss. Make sure to read and understand the docs for this option.
  close_timeout: 1

  # Defines if inputs is enabled
  enabled: true


#============================= Filebeat modules ===============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: true

  # Period on which files under path should be checked for changes
  reload.period: 10s

#==================== Elasticsearch template setting ==========================

setup.template.settings:
  index.number_of_shards: 3
  #index.codec: best_compression
  #_source.enabled: false

#================================ General =====================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging


#============================== Dashboards =====================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here, or by using the `-setup` CLI flag or the `setup` command.
#setup.dashboards.enabled: false

# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:

#============================== Kibana =====================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  #host: "localhost:5601"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

#============================= Elastic Cloud ==================================

# These settings simplify using filebeat with the Elastic Cloud (https://cloud.elastic.co/).

# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:

# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:

#================================ Outputs =====================================

# Configure what output to use when sending the data collected by the beat.

#-------------------------- Elasticsearch output ------------------------------
#output.elasticsearch:
  # Array of hosts to connect to.
  #hosts: ["localhost:9200"]

  # Enabled ilm (beta) to use index lifecycle management instead daily indices.
  #ilm.enabled: false

  # Optional protocol and basic auth credentials.
  #protocol: "https"
  #username: "elastic"
  #password: "changeme"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ["10.1.110.153:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

#================================ Processors =====================================

# Configure processors to enhance or manipulate events generated by the beat.

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

#================================ Logging =====================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
logging.selectors: ["*"]

#============================== Xpack Monitoring ===============================
# filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#xpack.monitoring.enabled: false

# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well. Any setting that is not set is
# automatically inherited from the Elasticsearch output configuration, so if you
# have the Elasticsearch output configured, you can simply uncomment the
# following line.
#xpack.monitoring.elasticsearch:

posted on 2024-10-09 14:27 Old-Kang 阅读(140) 评论(0) 编辑收藏举报

刷新页面返回顶部

公告