prometheus 监控windows
通过ansible 批量操作windows机器,部署windows_exporter-0.21.0-amd64.exe
1、需要检查的点:
1)、ansible登录用户名必须与系统组记录的成员名一致,要不然会一直报错,"ntlm: the specified credentials were rejected by the server"
2)、检查网络状态是否是专有网络
2、在被控windows电脑中,通过管理员方式打开powershell,执行如下
winrm quickconfig winrm set winrm/config/service/auth '@{Basic="true"}' winrm set winrm/config/service '@{AllowUnencrypted="true"}'
3、控制端在linux机器上,安装如下
pip install pywinrm>=0.2.2
4、ansible host可以如下
[wind] my_server ansible_host=192.168.61.153 ansible_ssh_user="aaa\bbb" ansible_ssh_pass="xxx" [wind:vars] ansible_ssh_port=5985 ansible_connection="winrm" ansible_winrm_server_cert_validation=ignore ansible_winrm_transport=ntlm
5、测试一下
[root@www ~]# ansible -i /etc/ansible/myhost wind -m win_ping my_server | SUCCESS => { "changed": false, "ping": "pong" }
6、开始批量操作
--- - hosts: wind gather_facts: no tasks: - name: create directory win_file: path: D:\ops_control state: directory - name: copy pkg win_copy: src: "{{item}}" dest: D:\ops_control\ force: yes with_items: - /etc/ansible/pkg/windows_exporter-0.21.0-amd64.exe - /etc/ansible/pkg/wind_exporter.exe - /etc/ansible/config/config.yml - /etc/ansible/pkg/start_win_exporter.bat #创建目录 ansible -i /etc/ansible/myhost wind -m win_file -a "path=D:\ops_control state=directory" 复制 ansible -i /etc/ansible/myhost wind -m win_copy -a "src=/etc/ansible/pkg/windows_exporter-0.21.0-amd64.exe dest=D:\ops_control\ force=yes" ansible -i /etc/ansible/myhost wind -m win_copy -a "src=/etc/ansible/config/config.yml dest=D:\ops_control\ force=yes" ansible -i /etc/ansible/myhost wind -m win_copy -a "src=/etc/ansible/pkg/start_win_exporter.bat dest=D:\ops_control\ force=yes" 执行 ansible -i /etc/ansible/myhost wind -m win_command -a "chdir=D:\ops_control .\start_win_exporter.bat" 重启 ansible -i /etc/ansible/myhost wind -m raw -a "taskkill /F /IM windows_exporter-0.21.0-amd64.exe /T"
启动操作,着实不太友好,如果哪位大佬有比较好的通过ansible 执行bat在后台启动的,可以赐教一下。
1) windows_exporter的config.yml配置文件示例,监控指定进程的状态,我上传的一个dashboard,也是参考别的大佬的,有需要的可以自取:https://grafana.com/grafana/dashboards/18236-1-windows-exporter-for-prometheus-dashboard-cn-v20201012/
--- # Note this is not an exhaustive list of all configuration values collectors: enabled: cpu,cs,logical_disk,net,os,process,system,textfile collector: process: whitelist: "QQ.*" # service: # services-where: Name='windows_exporter' # scheduled_task: # blacklist: /Microsoft/.+ log: level: warn scrape: timeout-margin: 0.5 telemetry: addr: ":9182" path: /metrics max-requests: 5
2)start_win_exporter.bat
start D:\ops_control\windows_exporter-0.21.0-amd64.exe --config.file=D:\ops_control\config.yml
exporter地址:https://github.com/prometheus-community/windows_exporter
7、在windows的计划任务里面部署 wind_exporter.exe,健康检查与启动。
这个exe文件可以参考
golang win10 app存活状态简单监控
8、ansible部分参考
https://cloud.tencent.com/developer/article/1788393
二、顺便搞了一下prometheus的报警
1、prometheus的配置
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - localhost:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "/opt/soft/alertmanager-0.25.0.linux-amd64/local_rules.yml" # - "/opt/soft/alertmanager-0.25.0.linux-amd64/hoststatus_rules.yml" # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. #- job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. #static_configs: # - targets: ["localhost:9090"] #- job_name: 'redis' # static_configs: # - targets: ['192.168.60.174:9121'] # labels: # ecs: redis # metrics_path: /scrape # relabel_configs: # - source_labels: [__address__] # target_label: __param_target # - source_labels: [__param_target] # target_label: instance - job_name: "windows_prometheus" static_configs: - targets: ["192.168.61.153:9182","192.168.62.238:9182"]
2、alertmanager的配置
alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxx@163.com' smtp_auth_username: 'xxx@163.com' smtp_auth_password: 'xxx' smtp_require_tls: false templates: - 'notify.tmpl' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'default-receiver' routes: - receiver: 'ops' match: severity: 'error' group_wait: 10s receivers: - name: 'default-receiver' email_configs: - to: 'xxx@dingtalk.com' html: '{{ template "notify.html" . }}' headers: { Subject: "来自prometheus_60.203_报警邮件"} send_resolved: true - name: 'ops' email_configs: - to: 'xxx@qq.com' html: '{{ template "notify.html" . }}' headers: { Subject: "来自prometheus_60.203_报警邮件"} send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
local_rules.yml
groups: - name: local_rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown expr: up{job="windows_prometheus"} == 0 for: 2m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes, current value: {{ $value }}" - alert: processThreadsTooMany expr: windows_process_threads{} > 0 for: 10m annotations: summary: "Threads too many {{ $labels.instance }} {{ $labels.process }}" description: "{{ $labels.instance }} {{ $labels.process }} (current value: {{ $value }})"
notify.tmpl
{{ define "notify.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} @异常警告 <br> 实例: {{ .Labels.instance }} <br> 信息: {{ .Annotations.summary }} <br> 详情: {{ .Annotations.description }} <br> 开始时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br><br> {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} @异常恢复 <br> 实例: {{ .Labels.instance }} <br> 信息: {{ .Annotations.summary }} <br> 开始时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> 恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> {{ end }}{{ end -}} {{- end }}
修改配置后请重启
curl -X POST http://127.0.0.1:9093/-/reload curl -X POST http://127.0.0.1:9091/-/reload
报警好久不搞了,也参考了大佬们的例子:
https://blog.csdn.net/W1124824402/article/details/128408171
https://ost.51cto.com/posts/14124
https://github.com/prometheus/alertmanager/blob/main/template/default.tmpl
https://prometheus.io/docs/alerting/latest/notification_examples/