Filebeat原理
Filebeat 有两大部分
inputs 负责找文件(类似 find 命令)
管理 harvesters:一个 harvester 则和一个文件一一对应,一行行读然后发送给 output(类似tail -f)
![0](https://img2024.cnblogs.com/blog/2007173/202410/2007173-20241009142711434-759779553.png)
Filebeat 核心配置详解
基本配置
inputs 可以配置多块(block)
paths 可以配置多个文件,文件路径和文件名都支持通配
filebeat.inputs: - type: log paths: - /var/log/system.log - /var/log/wifi.log - type: log paths: - "/var/log/apache2/*" fields: apache: true fields_under_root: true
ignore_older 和 scan_frequency
场景问题
场景1:路径下的历史文件可能很多,比如配置了按天分割,显然旧文件我们一般是不需要的。
场景2:扫描频率如何控制,通配设置复杂的话频繁扫文件也是很大的开销。
解决方案
场景1是通过 ignore_older 参数解决,意思是多久前的旧文件不扫描。比如设置为 1h,表示文件时间在 1h 之前的日志都不会被 input 模块搜集,直到有新日志产生。
场景2是通过 scan_frequency 参数控制,表示多久扫描一次是否有新文件产生。比如设置 10s(默认),一个新文件产生 10s 后会被发现,或者一个旧文件(上面
ignore_older)新产生了一行日志 10s 才发现这个文件。
close_* 和 clean_*
被 harvester 获取的文件就一直拿着不放吗?文件重命名或者被删除后怎么办呢?
close_* 配置簇
The close_* configuration options are used to close the harvester after a certain criteria or time. Closing the harvester means closing the file handler.
close_inactive
多久关闭文件,比如一个日志文件,10 分钟都没有读到新的内容就把文件句柄关闭
这里的时间不是取决于文件的最后更新时间,而是 Filebeat 内部记录的时间,上次读到文件和这次尝试读文件的时间差。
官方建议设置的时间是比文件产生数据频率高一个数量级(默认5m),比如每秒都有日志产生,这个值就可以设置为 1m。
close_renamed 是否关闭 rename 的文件
close_removed 默认开启。文件被删除后,就关闭文件句柄。
clean_* 配置簇
The clean_* options are used to clean up the state entries in the registry file.
Filebeat 内部记录了很多文件状态,保存在 data/registry/filebeat/data.json。如果不清理的话这个文件会越来越大,影响效率。
{ "source": "/xxx/logs/logFile.2021-09-20.log", "offset": 661620031, "timestamp": "2021-09-21T00:04:23.050179808+08:00", "ttl": 10800000000000, "type": "log", "meta": null, "FileStateOS": { "inode": 184559118, "device": 2056 } }
clean_inactive
不设置正确启动时候出现:Exiting: Error in initing input: clean_inactive must be > ignore_older + scan_frequency to make sure only files which are not monitored anymore are removed accessing 'filebeat.inputs.0' (source:'filebeat.yml')
多久清理一次注册信息。默认值是0(不开启clean_*相关功能)
清理的文件信息需要保证这个文件已经不活跃了,所以这个值需要大于 ignore_older + scan_frequency。
不然的话清理后这个文件又被发现,则会重头开始读取,这样就重了。
clean_removed
文件被删除后是否清理注册信息,默认开启。
需要和 close_removed 值保持一致
推荐配置
tail_files: false scan_frequency: 10s ignore_older: 60m close_inactive: 10m close_renamed: true close_removed: true clean_inactive: 70m clean_removed: true
资源限制
在日志非常多机器负载高的时候加重机器负担,建议生产环境上需要对 Filebeat 资源进行限制:
max_procs 最多使用多少核,默认会全部使用,按机器情况限制为1-4核,不太会影响推送效率。
配置自动加载
filebeat.config.inputs: enabled: true path: configs/*.yml reload.enabled: true reload.period: 10s
具体的 input 配置文件放在 configs 文件夹下,例如:
- type: log paths: - /var/log/messages - /var/log/*.log
安装filebeat
filebeat通常与日志源服务器放同一台机器上。因此我们在另一台跑业务(有业务日志)的Linux机器上安装filebeat。
如果是收集业务系统的日志,确保filebeat对日志文件具有读权限。
1、下载和解压
cd /home/zyplanke/elk
tar -xvf filebeat-6.8.3-linux-x86_64.tar.gz
2、配置日志文件采集
见Filebeat 核心配置详解
filebeat.yml
3. 启动
1、编写管理脚本run_filebeat.sh
因filebeat生成日志巨大,因为丢弃日志
chmod +x run_filebeat.sh
启动:
./run_filebeat.sh start
################################################################################## # desc: FileBeat运行管理脚本 ################################################################################### CURR_PWD=`pwd -P` Usage() { echo "*******************************************************" echo " Usage: " echo " `basename $0` : print this usage info " echo " `basename $0` show : show current running process " echo " `basename $0` start : start process" echo " `basename $0` stop : stop process" echo " `basename $0` kill : froce kill process" echo "" exit 0 } #判断参数的参数个数,如果不符合要求,则输出用法提示 if [ $# -ne 1 ];then Usage fi case $1 in "show") # 显示当前正运行的进程 echo "" echo " Currently, running processes as follows....." echo "*******************************************************" #ps -f | head -1 ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}' |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r ps -f -p | grep -v "grep" echo "*******************************************************" echo "" ;; "start") nohup ${CURR_PWD}/filebeat -e -c filebeat.yml >/dev/null 2>&1 & echo " starting... " sleep 1 echo " Please check the result via logs files or nohup.out!" echo "" ;; "stop") ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}' |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill > /dev/null 2>&1 echo " stoping... " sleep 1 echo " Please check the result by yourself!" echo "" ;; "kill") ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}' |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill > /dev/null 2>&1 sleep 5 ps -f -u `whoami` | grep -w "filebeat" | grep -v "grep" | awk '{print $2}' |xargs -r pwdx | grep -w "${CURR_PWD}" | awk -F: '{print $1}' | xargs -r kill -9 > /dev/null 2>&1 ;; *) echo " input error!!! " Usage ;; esac exit 0
filebeat.yml
###################### Filebeat Configuration Example ######################### # This file is an example configuration file highlighting only the most common # options. The filebeat.reference.yml file from the same directory contains all the # supported options with more comments. You can use it as a reference. # # You can find the full configuration reference here: # https://www.elastic.co/guide/en/beats/filebeat/index.html # For more available modules and options, please see the filebeat.reference.yml sample # configuration file. #=========================== Filebeat inputs ============================= # List of inputs to fetch data. filebeat.inputs: # Each - is an input. Most options can be set at the input level, so # you can use different inputs for various configurations. # Below are the input specific configurations. # Type of the files. Based on this the way the file is read is decided. # The different types cannot be mixed in one input # # Possible options are: # * log: Reads every line of the log file (default) # * stdin: Reads the standard in #------------------------------ Log input -------------------------------- - type: log # Change to true to enable this input configuration. enabled: true # Paths that should be crawled and fetched. Glob based paths. # To fetch all ".log" files from a specific level of subdirectories # /var/log/*/*.log can be used. # For each file found under this path, a harvester is started. # Make sure not file is defined twice as this can lead to unexpected behaviour. paths: - /app/tomcat/apache-tomcat-8.5.38/logs/* - /app/tomcat/apache-tomcat-8.5.38/bin/log/*.log #- c:\programdata\elasticsearch\logs\* # Configure the file encoding for reading files with international characters # following the W3C recommendation for HTML5 (http://www.w3.org/TR/encoding). # Some sample encodings: # plain, utf-8, utf-16be-bom, utf-16be, utf-16le, big5, gb18030, gbk, # hz-gb-2312, euc-kr, euc-jp, iso-2022-jp, shift-jis, ... #encoding: plain # Exclude lines. A list of regular expressions to match. It drops the lines that are # matching any regular expression from the list. The include_lines is called before # exclude_lines. By default, no lines are dropped. #exclude_lines: ['^DBG'] # Include lines. A list of regular expressions to match. It exports the lines that are # matching any regular expression from the list. The include_lines is called before # exclude_lines. By default, all the lines are exported. #include_lines: ['^ERR', '^WARN'] # Exclude files. A list of regular expressions to match. Filebeat drops the files that # are matching any regular expression from the list. By default, no files are dropped. #exclude_files: ['.gz$'] # Optional additional fields. These fields can be freely picked # to add additional information to the crawled log files for filtering fields: # level: debug # review: 1 logcategory: vhlloguat # Set to true to store the additional fields as top level fields instead # of under the "fields" sub-dictionary. In case of name conflicts with the # fields added by Filebeat itself, the custom fields overwrite the default # fields. #fields_under_root: false # Ignore files which were modified more then the defined timespan in the past. # ignore_older is disabled by default, so no files are ignored by setting it to 0. # Time strings like 2h (2 hours), 5m (5 minutes) can be used. ignore_older: 60m # How often the input checks for new files in the paths that are specified # for harvesting. Specify 1s to scan the directory as frequently as possible # without causing Filebeat to scan too frequently. Default: 10s. scan_frequency: 10s # Defines the buffer size every harvester uses when fetching the file harvester_buffer_size: 16384 # Maximum number of bytes a single log event can have # All bytes after max_bytes are discarded and not sent. The default is 10MB. # This is especially useful for multiline log messages which can get large. max_bytes: 10485760 ### Recursive glob configuration # Expand "**" patterns into regular glob patterns. #recursive_glob.enabled: true ### JSON configuration # Decode JSON options. Enable this if your logs are structured in JSON. # JSON key on which to apply the line filtering and multiline settings. This key # must be top level and its value must be string, otherwise it is ignored. If # no text key is defined, the line filtering and multiline features cannot be used. #json.message_key: # By default, the decoded JSON is placed under a "json" key in the output document. # If you enable this setting, the keys are copied top level in the output document. #json.keys_under_root: false # If keys_under_root and this setting are enabled, then the values from the decoded # JSON object overwrite the fields that Filebeat normally adds (type, source, offset, etc.) # in case of conflicts. #json.overwrite_keys: false # If this setting is enabled, Filebeat adds a "error.message" and "error.key: json" key in case of JSON # unmarshaling errors or when a text key is defined in the configuration but cannot # be used. #json.add_error_key: false ### Multiline options # Multiline can be used for log messages spanning multiple lines. This is common # for Java Stack Traces or C-Line Continuation # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [ multiline.pattern: ^\[ # Defines if the pattern set under pattern should be negated or not. Default is false. #multiline.negate: false # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern # that was (not) matched before or after or as long as a pattern is not matched based on negate. # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash multiline.match: after # The maximum number of lines that are combined to one event. # In case there are more the max_lines the additional lines are discarded. # Default is 500 multiline.max_lines: 500 # After the defined timeout, an multiline event is sent even if no new pattern was found to start a new event # Default is 5s. multiline.timeout: 5s # Setting tail_files to true means filebeat starts reading new files at the end # instead of the beginning. If this is used in combination with log rotation # this can mean that the first entries of a new file are skipped. tail_files: false # The Ingest Node pipeline ID associated with this input. If this is set, it # overwrites the pipeline option from the Elasticsearch output. #pipeline: # If symlinks is enabled, symlinks are opened and harvested. The harvester is opening the # original for harvesting but will report the symlink name as source. symlinks: true # Backoff values define how aggressively filebeat crawls new files for updates # The default values can be used in most cases. Backoff defines how long it is waited # to check a file again after EOF is reached. Default is 1s which means the file # is checked every second if new lines were added. This leads to a near real time crawling. # Every time a new line appears, backoff is reset to the initial value. backoff: 1s # Max backoff defines what the maximum backoff time is. After having backed off multiple times # from checking the files, the waiting time will never exceed max_backoff independent of the # backoff factor. Having it set to 10s means in the worst case a new line can be added to a log # file after having backed off multiple times, it takes a maximum of 10s to read the new line max_backoff: 10s # The backoff factor defines how fast the algorithm backs off. The bigger the backoff factor, # the faster the max_backoff value is reached. If this value is set to 1, no backoff will happen. # The backoff value will be multiplied each time with the backoff_factor until max_backoff is reached backoff_factor: 2 # Max number of harvesters that are started in parallel. # Default is 0 which means unlimited #harvester_limit: 0 ### Harvester closing options # Close inactive closes the file handler after the predefined period. # The period starts when the last line of the file was, not the file ModTime. # Time strings like 2h (2 hours), 5m (5 minutes) can be used. close_inactive: 10m # Close renamed closes a file handler when the file is renamed or rotated. # Note: Potential data loss. Make sure to read and understand the docs for this option. close_renamed: true # When enabling this option, a file handler is closed immediately in case a file can't be found # any more. In case the file shows up again later, harvesting will continue at the last known position # after scan_frequency. close_removed: true # Closes the file handler as soon as the harvesters reaches the end of the file. # By default this option is disabled. # Note: Potential data loss. Make sure to read and understand the docs for this option. close_eof: true ### State options # Files for the modification data is older then clean_inactive the state from the registry is removed # By default this is disabled. clean_inactive: 70m # Removes the state for file which cannot be found on disk anymore immediately clean_removed: true # Close timeout closes the harvester after the predefined time. # This is independent if the harvester did finish reading the file or not. # By default this option is disabled. # Note: Potential data loss. Make sure to read and understand the docs for this option. close_timeout: 1 # Defines if inputs is enabled enabled: true #============================= Filebeat modules =============================== filebeat.config.modules: # Glob pattern for configuration loading path: ${path.config}/modules.d/*.yml # Set to true to enable config reloading reload.enabled: true # Period on which files under path should be checked for changes reload.period: 10s #==================== Elasticsearch template setting ========================== setup.template.settings: index.number_of_shards: 3 #index.codec: best_compression #_source.enabled: false #================================ General ===================================== # The name of the shipper that publishes the network data. It can be used to group # all the transactions sent by a single shipper in the web interface. #name: # The tags of the shipper are included in their own field with each # transaction published. #tags: ["service-X", "web-tier"] # Optional fields that you can specify to add additional information to the # output. #fields: # env: staging #============================== Dashboards ===================================== # These settings control loading the sample dashboards to the Kibana index. Loading # the dashboards is disabled by default and can be enabled either by setting the # options here, or by using the `-setup` CLI flag or the `setup` command. #setup.dashboards.enabled: false # The URL from where to download the dashboards archive. By default this URL # has a value which is computed based on the Beat name and version. For released # versions, this URL points to the dashboard archive on the artifacts.elastic.co # website. #setup.dashboards.url: #============================== Kibana ===================================== # Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. # This requires a Kibana endpoint configuration. setup.kibana: # Kibana Host # Scheme and port can be left out and will be set to the default (http and 5601) # In case you specify and additional path, the scheme is required: http://localhost:5601/path # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 #host: "localhost:5601" # Kibana Space ID # ID of the Kibana Space into which the dashboards should be loaded. By default, # the Default Space will be used. #space.id: #============================= Elastic Cloud ================================== # These settings simplify using filebeat with the Elastic Cloud (https://cloud.elastic.co/). # The cloud.id setting overwrites the `output.elasticsearch.hosts` and # `setup.kibana.host` options. # You can find the `cloud.id` in the Elastic Cloud web UI. #cloud.id: # The cloud.auth setting overwrites the `output.elasticsearch.username` and # `output.elasticsearch.password` settings. The format is `<user>:<pass>`. #cloud.auth: #================================ Outputs ===================================== # Configure what output to use when sending the data collected by the beat. #-------------------------- Elasticsearch output ------------------------------ #output.elasticsearch: # Array of hosts to connect to. #hosts: ["localhost:9200"] # Enabled ilm (beta) to use index lifecycle management instead daily indices. #ilm.enabled: false # Optional protocol and basic auth credentials. #protocol: "https" #username: "elastic" #password: "changeme" #----------------------------- Logstash output -------------------------------- output.logstash: # The Logstash hosts hosts: ["10.1.110.153:5044"] # Optional SSL. By default is off. # List of root certificates for HTTPS server verifications #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] # Certificate for SSL client authentication #ssl.certificate: "/etc/pki/client/cert.pem" # Client Certificate Key #ssl.key: "/etc/pki/client/cert.key" #================================ Processors ===================================== # Configure processors to enhance or manipulate events generated by the beat. processors: - add_host_metadata: ~ - add_cloud_metadata: ~ #================================ Logging ===================================== # Sets log level. The default log level is info. # Available log levels are: error, warning, info, debug logging.level: debug # At debug level, you can selectively enable logging only for some components. # To enable all selectors use ["*"]. Examples of other selectors are "beat", # "publish", "service". logging.selectors: ["*"] #============================== Xpack Monitoring =============================== # filebeat can export internal metrics to a central Elasticsearch monitoring # cluster. This requires xpack monitoring to be enabled in Elasticsearch. The # reporting is disabled by default. # Set to true to enable the monitoring reporter. #xpack.monitoring.enabled: false # Uncomment to send the metrics to Elasticsearch. Most settings from the # Elasticsearch output are accepted here as well. Any setting that is not set is # automatically inherited from the Elasticsearch output configuration, so if you # have the Elasticsearch output configured, you can simply uncomment the # following line. #xpack.monitoring.elasticsearch: