前言:
一个初略自动化运维平台,应该实现以下3个层面自动化:
1.操作系统层面自动化
如果想要万台服务器共舞,没有操作系统这个舞台还怎么舞?
1.1:物理环境:
OS预备自动安装(Pxe/KickStart/Cobbler)
1.2:云环境
Iaas实现框架(OpenStack)
2.应用软件层面自动化
拥有操作系统之后万人登场,如何统一指挥(安装、配置、管理、更新),此时一些批量执行shell命令的工具是必不可少的如以下几种;
pupet、saltstack
3.监控系统
万一以上组件出现故障怎么办呢?所以需要对以上2个层面(系统、软件)包含硬件、网络进行实时监控,出现报警即使通知责任人并采取措施,此时监控系统来了;
zabbix
有了监控系统之后,一个基本的自动化运维框架才算大致完整;
一、Zabbix概述
啥是zabbix?
Zabbix是一款可以完成
数据采集(Zabbix-Agent/SNMP/SSH....)
数据存储(MySQL/Oracle)
数据展示(PHP绘图库)
数据分析
报警(zabbix调用媒介 )
报警升级机制
的分布式监控工具!
Zabbix的特点:
Zabbix支持分布式监控:Agent-----》地域proxy-----》各地域proxy汇报到主server
自动发现需要监控的主机设备:

主动模式:
Agent向Server建立一个TCP连接
Agent请求需要检测的数据列表(主动拉取server端的items配置)
Server响应Agent,发送一个Items列表
Agent允许响应
TCP连接完成本次会话关闭
Agent开始周期性地收集数据
被动模式:
Server 打开一个TCP连接
Server发送一个key 为agent.ping
Agent接受这个请求,然后响应
Server对接受到的数据进行处理
TCP连接关闭


ps:
zabbix有3个核心组件zbbix-server、zabbix-proxy、zabbix-agnt,这3个组件都是独立运行的服务,有都独自的配置文件、日志文件
对于zabbix的配置和查看结果都是基于zabbix的web gui和api接口实现
c.Zabbix工作流程
Zabbix常用术语
主机(host):要监控的设备(server、router、printer、switch)
主机组(host group):针对主机的逻辑管理单位(把N个主机归类到1个主机组,便于批量施加templates模板)
监控项(item):被监控主机需要根据哪些 监控指标 采集数据
键(key):Zabbix-server和Zabbix-agent之间的数据请求都是通过Key来实现的;
在1个agent上可能有很多item(监控项),例如: item 0监控agent 10分钟内网卡速率,item 1监控agent 20分钟内网卡速率,它们都向zabbix server端汇报数据,如何区别它们呢?我们通过key来唯一标识;
key是支持参数的,这样可以让1个key收集多重信息,
另外可在zabbix-agent端的配置文件,自定义key,调用自己写的脚本,通过这种自定义监控的方式,向zabbix-server汇报信息
图形(graph):1个item对应1个graph,用于图形展示这个item采集到数据的trend
屏幕(screen)单1个graph看着不爽,例如网卡的入站流量和出站流量分2个item监控,但是可以合并到1个screen
应用(application):我们可以把多个item归为1类,称之为application
触发器(trigger): 为某个item创建触发器,用以判断 item采集的数据,是否正在合理阈值;基本评估标准(一个表达式),触发报警的阀值
事件(event):一旦设置的触发器Trigger超出阀值,Trigger的状态就会由于ok转换为problem,恢复从problem状态恢复为ok状态,就会触发1个事件;
另外事件Event的来源不只有Trigger,discoverry也可以生成1个事件,例如发现1个主机
动作(action):1个Action由 条件Condition设置的Trigger触发 ----》操作Operation(Remote-command/send-message)-----》通过媒介Midia-------》通知到运维人员;
Action根据我们所关注事件Event判断是否触发1个动作;
这个action无非就是2种类型
通知(notification)通过选择指定好的媒介向用户发送相关的事件信息
操作(operation)触发了动作未必都是得通知啊,也可以执行1个修复操作
条件(condition)指定是否发送通知/操作的条件(已知某些主机已经故障,可以通过设置动作执行的条件,不再让其触发报警,例如zabbix的预维护功能)
说白了就是触发器状态转变之后---------》根据指定的条件判断----------->是否执行操作--------》媒介------》报警
媒介类型(Media-Type):zabbix 目前通知报警的媒介类型(External-Script/SMS/E-mail/jabber)
媒介(Media):媒介和媒介类型是 1对多的关系,1个媒介类型下包含N个媒介(注册N个微信公众号,使用不同的脚本,推送到不同的部门)
Zabbix本身无法完成报警功能,Zabbix触发1个动作------》 调用已经配置好的媒介--------->媒介触发脚本(调用Emai/Wechart/Aotophone的API)-------》通知到相关责任人!




2.Zabbix采集数据方式
<HEADER> - "ZBXD\x01" (5 bytes) <DATALEN> - data length (8 bytes). 1 will be formatted as 01/00/00/00/00/00/00/00 (eight bytes in HEX, 64 bit number)
<DATA>
<DATA>: json格式,内容又分为主动检查和被动检查
为了避免Zabbix内存耗尽,Zabbix限制每1个连接最多使用128M内存

#!/usr/bin/python #-*- coding:utf8 -*- __author__ = 'pdd' __date__ = '2016/11/28' ''' script simulate zabbix_sender ''' import sys import json import time import struct import socket import argparse parser = argparse.ArgumentParser(description='script simulate zabbix_sender') parser.add_argument('-z','--server',dest='server',action='store',help='Zabbix server ip') parser.add_argument('-p','--port',dest='port',action='store',help='Zabbix server port',default=10051,type=int) parser.add_argument('-s','--host',dest='host',action='store') parser.add_argument('-k','--key',dest='key',action='store',help='item key') parser.add_argument('-o','--value',dest='value',action='store',help='item value') args = parser.parse_args() class Metric(object): def __init__(self, host, key, value): self.host = host self.key = key self.value = value def __repr__(self): result = 'Metric(%r, %r, %r)' % (self.host, self.key, self.value) return result def send_to_zabbix(): j = json.dumps m = Metric(args.host, args.key, args.value) clock = ('%d' % time.time()) metrics = '{"host":%s,"key":%s,"value":%s,"clock":%s}' % (j(m.host), j(m.key), j(m.value), j(clock)) json_data = '{"request":"sender data","data":[%s]}' % metrics data_len = struct.pack('<Q', len(json_data)) packet = 'ZBXD\x01' + data_len + json_data try: zabbix = socket.socket() zabbix.connect((args.server, args.port)) zabbix.sendall(packet) resp_hdr = zabbix.recv(13) resp_body_len = struct.unpack('<Q', resp_hdr[5:])[0] resp_body = zabbix.recv(resp_body_len) zabbix.close() resp = json.loads(resp_body) print(resp) except: print('Error while sending data to Zabbix') if __name__=='__main__': send_to_zabbix()

二、Zabbix的安装部署
0、安装zabbix yum源:rpm -i http://repo.zabbix.com/zabbix/3.4/rhel/7/x86_64/zabbix-release-3.4-2.el7.noarch.rpm
1.安装LAMP环境:yum-y install maria*httpd php php-mysql
2.安装zabbix-server 和zabbix-agent:yum -y install zabbix-server-mysql zabbix-web-mysql zabbix-agent
3.创建zabbix数据库并授权zabbix用户
mysql -uroot -p password mysql>
create database zabbix character set utf8 collate utf8_bin;
mysql> grant all privileges on zabbix.* to zabbix@localhost identified by 'password';
mysql> quit;
4.迁移数据数据库 : zcat /usr/share/doc/zabbix-server-mysql*/create.sql.gz | mysql -uzabbix -p zabbix
5.启动LAMP环境: systemctl restart mariadb.service httpd.service
6.启动zabbix server、agent: systemctl start zabbix-server zabbix-agent
参考:https://www.zabbix.com/download
三、judge监控系统
Zabbix报警是运维人员的命脉,我们的zabbix报警主要通过 微信平台 推送给运维人员,zabbix配置、zabbix队列、zabbix-wechart---ops这条报警通道,任何一环出现闪失,我的KPI都是无法设想的;
自研发了一套judge监控系统,功能如下:
监控zabbix-server进程、API 是否正常?
监控报警 是否已经推送到微信接口
zabbix媒介异常自动调用zabbixAPI进行报警
judge监控每天报告 自我存活状态
1.每次调用微信接口之后,把报警记录到1个json文件里

{"touser": "markguo@bestseller.com.cn", "subject": "PROBLEM: 注意:【Zabbix agent on EMS测试应用22.23 is unreachable for 5 minutes】", "message": "【告警主机】:172.16.22.23\r\n【主机地址】:172.16.22.23\r\n【告警时间】:2019.05.28_09:31:00\r\n【告警等级】:Average\r\n\r\n【告警信息】:Zabbix agent on EMS测试应用22.23 is unreachable for 5 minutes\r\n【告警项目】:Agent ping\r\n【告警详情】:agent.ping:Up (1)\r\n【事件ID】:5115222\r\n【告警URL】:\r\n\r\n【当前状态】:PROBLEM"}
2.配置zabbix用户、密码 微信API的corpid、corpsecret 报警的接收人, 报警推送...的时间间隔

[zabbix] host=10.150.22.211 port=10050 username=admin password =123123xxx action_id=13 windows_log_path=a.json linux_log_path=/zabbix_alert_log/zabbix_to_wchart_log.json [wechart] corpid = corpsecret = agentid= to_user=zhanggen@bestseller.com.cn interval=500 [judge] collect_alert_interval=120 confirm_different_interval=600 notify_interval_hour=3
3.欢迎拍砖

#!/usr/bin/python #_*_coding:utf-8 _*_ # Author:Martin import socket, requests, json,sys,re,time,datetime,multiprocessing,os from configparser import ConfigParser cp = ConfigParser() cp.read("judge.conf") # 读取配置文件 zabbix_host= cp.get('zabbix', "host") zabbix_port= cp.getint('zabbix',"port") zabbix_username = cp.get('zabbix',"username") zabbix_password= cp.get('zabbix',"password") zabbix_action_id=cp.get('zabbix','action_id') #监控的action log_path=cp.get('zabbix','windows_log_path') if sys.platform =='win32' else cp.get('zabbix','linux_log_path') confirm_different_interval=cp.getint('judge','confirm_different_interval') wechart_corpid=cp.get('wechart','corpid') wechart_corpsecret=cp.get('wechart','corpsecret') agentid=cp.get('wechart','agentid') to_user=cp.get('wechart','to_user') notify_interval_hour=cp.getint('judge','notify_interval_hour') requests.packages.urllib3.disable_warnings() requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False class Zabbix_message(object): def __init__(self,host,port,username,password,action_id,log_path,confirm_different_interval): self.host = host self.port = port self.user = username # admin self.pwd = password # '123123xxx self.action_id=action_id self.zabbix_api = 'http://%s/zabbix/api_jsonrpc.php' % (self.host) self.headers = {'content-type': 'application/json-rpc'} self.log_file =log_path #'/zabbix_alert_log/zabbix_to_wchart_log.json' if sys.platform == 'linux' else "a.json" self.confirm_different_interval=confirm_different_interval #如果发现 日志文件 和api获取的数据一样,复审的间隔 self.message= {"item": None, "status": None, "log_data": None,"API_data":None,"error":'', "latest_alert_date": '','next_alert_date': ''} def reture_info(self,item,status,info,API_data=False): self.message['item'] = item self.message['status']=status if status: self.message['log_data'] = info else: self.message['API_data']=API_data self.message['error']=info return self.message def check_process_available(self): # 检测zabbix-server的进程存活 item_name='Zabbix-server' try: sk = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sk.connect((self.host,self.port)) sk.close() except Exception: error = '%s %s:%s was down.'%(item_name,self.host,self.port) r = self.reture_info(item_name, 0,error) return r data='%s %s:%s was active.'%(item_name,self.host,self.port) r=self.reture_info( item_name,1,data) return r def check_zabbix_api(self): item_name = 'Zabbix-API' data=self.get_token() if isinstance(data,(str,)): data = '%s was active.' % (item_name) r=self.reture_info(item_name,1,data) else: error='%s was down.' % (item_name) r=self.reture_info(item_name,0,error) return r def check_log_file(self): os.system("/bin/chown zabbix:zabbix %s" % (self.log_file)) if sys.platform == 'linux' else print('') #设置zabbix用户对其有写入的权限! latest_one_log = self.get_log() return latest_one_log def check_action(self): item='Zabbix-action' get_actions_data=json.dumps({ "jsonrpc": "2.0", "method": "action.get", "params": { "output": "extend", "selectOperations": "extend", "selectFilter": "extend", "filter": { "eventsource": 0 } }, "auth":self.get_token(), "id": 1 }) actions = requests.post(url=self.zabbix_api, headers=self.headers, data=get_actions_data).json()['result'] for action in actions: action_id = action.get('actionid')#遍历所有action if action_id == self.action_id and action.get('status') == '0': data = self.reture_info(item,1,action) return data #找到了配置的action直接返回 else:#找不到就说明配置的action_id有问题,或者action已经禁用 data = self.reture_info(item,0, '请检查配置的action,是否处于未激活状态?') return data def check_alert_log(self): # 检测动作执行的结果 item='Consistence' latest_one_alert = self.get_latest_one_alert() alert_message = latest_one_alert['message'] latest_one_log = self.get_latest_one_log() log_message = latest_one_log alarm_time = re.findall("【告警时间】:(.*?)\r\n", alert_message)[0] log_time = re.findall("【告警时间】:(.*?)\r\n", log_message)[0] API_data = latest_one_alert log_data = '一致性检测通过!' if alarm_time != log_time:#有时候zabbix推送到微信可能会有时差 time.sleep(self.confirm_different_interval) # 等N分钟后,复审1次再报警 # print('第二次检查------------') second_check_latest_one_log = self.get_latest_one_log() #二次获取日志文件 second_check_alert_message =self.get_latest_one_alert()['message'] #二次获取zabbix报警 log_time= re.findall("【告警时间】:(.*?)\r\n",second_check_latest_one_log)[0]#二次检查日志文件 alarm_time= re.findall("【告警时间】:(.*?)\r\n", second_check_alert_message)[0]#二次检查zabbix报警 if alarm_time != log_time: log_data='最近一条推送记录和最近一条报警不一致!' r=self.reture_info(item,alarm_time == log_time,log_data,API_data) return r def get_token(self): self.auth_data = json.dumps( {"jsonrpc": "2.0", 'method': 'user.login', 'params':{"user": self.user, "password": self.pwd}, "auth": None, 'id': 0}) try: token=self.auth = requests.post(url=self.zabbix_api, headers=self.headers, data=self.auth_data).json()['result'] except: token=self.reture_info('0','获取不到token,请检查zabbix用户密码是否正确?') return token # 获取zabbix api使用的Token def get_latest_one_alert(self): # 获取最近1条报警 get_alert_data = json.dumps({ "jsonrpc": "2.0", "method": "alert.get", "params": { "mediatypeid": "1", "output": "extend", "actionids":self.action_id, }, "auth": self.get_token(), "id": 1 }) response = requests.get(url=self.zabbix_api, data=get_alert_data, headers=self.headers).json()['result'][-1] return response def get_latest_one_log(self): d=self.check_log_file() return d['log_data'] def get_log(self): item= 'Log_file' try: with open(self.log_file,'r',encoding='utf-8' ) as f: log_data=json.loads(f.read()) latest_one_log=log_data.get('message') if not latest_one_log: error='文件格式错误,请检查 %s json日志中是否包含message的key?'%(self.log_file) r=self.reture_info(item,0,error) else: r=self.reture_info(item,1,latest_one_log) except Exception: error = 'Check that the path %s is correct and that the file exists' % (self.log_file) r = self.reture_info(item, 0, error) return r def write_log(self,**kwargs): if not kwargs: alert_info = self.get_latest_one_alert() sendto = alert_info['sendto'] subject = alert_info['subject'] content = alert_info['message'] kwargs = {"touser":sendto, "subject": subject, "message":content} with open(self.log_file,'w',encoding='utf-8') as f: json_str = json.dumps(kwargs, ensure_ascii=False) f.write(json_str) def collect(self): zabbix_available=self.check_process_available api_available=self.check_zabbix_api action_available=self.check_action log_available=self.check_log_file the_same_available=self.check_alert_log #检测日志文件和 api获取最近一条报警的一致性 item_list=[zabbix_available,api_available,action_available,log_available,the_same_available] for item in item_list: collect_result=item() if not collect_result['status']: print(collect_result['error']) return collect_result print('Every thing is okay...............') class WeChatClass(object): def __init__(self,corpid,corpsecret,agentid,user): self.corpid =corpid# CorpID是企业号的标识 self.corpsecret = corpsecret# corpsecretSecret是管理组凭证密钥 self.agentid=agentid self.gettoken_url = 'https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid=' +self.corpid + '&corpsecret=' + self.corpsecret self.user=user def gettoken(self): try: token_info = json.loads(requests.get(url=self.gettoken_url,verify=False).text) access_token=token_info.get("access_token") except Exception as e: print('微信插件,获取微信tocken错误:%s'%(e)) sys.exit() return access_token def send_senddata(self,content): if self.user: send_values = { "touser": self.user, # 企业号中的部门id。 "msgtype": "text", # 消息类型。 "agentid": self.agentid, # 企业号中的应用id。 "text": {"content":content},#发送的内容 "safe": "0" } else: send_values = { "toparty": "1", # 企业号中的部门id。 "msgtype": "text", # 消息类型。 "agentid":self.agentid, # 企业号中的应用id。 "text": { "content":content }, "safe": "0" } send_url = 'https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token=%s'%(self.gettoken()) send_data=json.dumps(send_values,ensure_ascii=False).encode('utf-8') head = {'Content-Type': 'application/json;charset=utf-8'} ret=requests.post(url=send_url,headers=head,data=send_data) return ret zabbix = Zabbix_message(host=zabbix_host, port=zabbix_port,username=zabbix_username ,password=zabbix_password,action_id=zabbix_action_id,log_path=log_path,confirm_different_interval=confirm_different_interval) wechart = WeChatClass(corpid=wechart_corpid,corpsecret=wechart_corpsecret,agentid=agentid,user=to_user) class Monitor_zabbix(multiprocessing.Process): def __init__(self,zabbix_obj,wechart): super().__init__() self.zabbix = zabbix_obj self.wechart=wechart def init_zabbix(self):#初始化下,从新监控 self.zabbix.write_log() #写日志*核心 def notify(self,send_info,send_to_depart=False): #通知报警策略 if isinstance(send_info,(str,)): if send_to_depart: self.wechart.user='' self.wechart.send_senddata(content=send_info) return next_alert_time = send_info.get('next_alert_date') delta = datetime.timedelta(hours=notify_interval_hour) if not next_alert_time or next_alert_time < datetime.datetime.now(): send_info['latest_alert_date'] = datetime.datetime.now() send_info['next_alert_date'] = datetime.datetime.now() + delta self.wechart.send_senddata(content=send_info['error'],to_user='') if send_to_depart else self.wechart.send_senddata(content=send_info['error']) def check_item(self): while True: alert_info=self.zabbix.collect() if alert_info: self.notify(alert_info) collect_alert_interval = cp.getint('judge', 'collect_alert_interval') time.sleep(collect_alert_interval) # 间隔N秒钟去采集数据 def check_self(self): #每天监控自己是否存活? while True: import time curent_time = time.localtime(time.time()) wday=curent_time.tm_wday # 星期几 hour=curent_time.tm_hour # 时 minute=curent_time.tm_min # 分 if wday in [0,1,2,3,4] and hour in [9,17] and minute in [30]: #设置个cron任务每天自我检查 try: latest_one_alert=self.zabbix.get_latest_one_alert()['message'] except Exception: latest_one_alert='抱歉:API错误!' daily_notify_message='最近1条zabbix报警:\r\n%s\r\n%s'% (latest_one_alert,str(datetime.datetime.now())) self.notify(daily_notify_message,send_to_depart=True) time.sleep(60) def run(self): self.check_self() if __name__ == '__main__': OBJ = Monitor_zabbix(zabbix,wechart) if len(sys.argv) >1 and sys.argv[1]=='init': OBJ.init_zabbix() else: OBJ.start() OBJ.check_item()
4.启动 judge
[root@monitor ~]# cd /monitor_zabbix/ [root@monitor monitor_zabbix]# ls judge.conf judge.py nohup.out [root@monitor monitor_zabbix]# nohup python judge.py & [1] 20128 [root@monitor monitor_zabbix]# nohup: 忽略输入并把输出追加到"nohup.out" [root@monitor monitor_zabbix]# ps -ef | grep judge root 20128 19834 13 16:37 pts/3 00:00:00 python judge.py root 20145 20128 81 16:37 pts/3 00:00:04 python judge.py root 20270 19834 0 16:37 pts/3 00:00:00 grep judge [root@monitor monitor_zabbix]#
5.停止 judge
ps -ef | grep judge | grep -v grep | awk '{print "kill -9 "$2}'|sh
6.重新开始监控
[root@monitor monitor_zabbix]# nohup python judge.py init
监控页面响应速度重启Tomcat

import time import datetime import sys import paramiko import requests import threading config = { "http://192.168.0.74:8081/bklogin.screen?tdsourcetag=s_pcqq_aiomsg": { "login_host":"192.168.0.74", "login_port": 3444, 'login_user': 'bestseller', 'login_password':"Best+2017", "JAVA_HOME":'/usr/local/jdk1.6.0_11',#JAVA_HOME路径 "JRE_HOME":"/usr/local/jdk1.6.0_11/jre",#JRE_HOME路径 "startup":"/opt/apache-tomcat-7.0.40/bin/startup.sh",#启动Tomcat的脚本路径 "shutdown":"/opt/apache-tomcat-7.0.40/bin/shutdown.sh",#关闭Tomcat的脚本路径 "monitor_interval":5, #探测时间间隔 "timeout_trager":10, #访问时间阀值 "lastest_request_cost":None,#最近1次探测 "lastest_rebbot_date":None, "authorized_ssh":None, }, } class Monitor_EMS(object): def __init__(self,config): self.config=config def ssh(self,url): ssh = paramiko.SSHClient() ssh.load_system_host_keys() ssh._policy = paramiko.AutoAddPolicy() try: print(self.config[url]['login_port'],self.config[url]['login_user'],self.config[url]['login_password']) ssh.connect(self.config[url]['login_host'],self.config[url]['login_port'],self.config[url]['login_user'],self.config[url]['login_password']) except Exception: print('请检查%s用户或密码是否可用?'%(self.config[url]['login_host'])) ssh.close() sys.exit() return ssh def validate_url(self): for url in self.config: try: response = requests.get(url=url,timeout=2)#最大等待2秒 except Exception: print('请检查该URL是否用?或者访问时间过长?',url) sys.exit() status_code = response.status_code if status_code == 200: ssh_socket=self.ssh(url) print(url,'验证通过') self.config[url]['authorized_ssh']=ssh_socket else: print(url,'验证失败状态码', status_code) sys.exit() def check_schedule(self): # 检查运维人员休息时间 flag = False curent_time = time.localtime(time.time()) wday = curent_time.tm_wday # 星期几 hour = curent_time.tm_hour # 时 # minute = curent_time.tm_min # 分 if wday in [5, 6]: # 周六日 flag = True elif wday in [0, 1, 2, 3,4] and hour > 18 and hour < 9: # 休息时间 flag = True return flag def check_url_timeout(self,url): # 检查URL是否超时 或者5xx while True: if self.check_schedule(): # 如果是休息时间 monitor_interval=self.config[url]['monitor_interval'] time.sleep(monitor_interval)#间隔多久采集一次数据 start_request_time = time.time() try: response = requests.get(url=url) except Exception: print('URL:%s不可以访问!'%(url)) response.status_code=500 status_code = response.status_code if status_code > 200: print('请检查URL,状态码:%s' % (status_code)) self.reboot_ems(url) continue end_request_time = time.time() cost_seconds =end_request_time - start_request_time timeout_trager=self.config[url]['timeout_trager']#超时触发器 if cost_seconds > timeout_trager: self.reboot_ems(url) self.config[url]['lastest_rebbot_date']=str(datetime.datetime.now()) self.config[url]['lastest_request_cost']=cost_seconds print(self.config) print('不是休息时间') def reboot_ems(self,url): JAVA_HOME=self.config[url]['JAVA_HOME'] JRE_HOME=self.config[url]['JRE_HOME'] shutdown_script = self.config[url]["shutdown"] startup_script = self.config[url]["startup"] load_jave_environment_cmd = 'export JAVA_HOME=%s;export JRE_HOME=%s;. ./.bash_profile;'%(JAVA_HOME,JRE_HOME)#加载JAVA的系统环境变量 shutdown_script_cmd='%s /bin/bash %s'%(load_jave_environment_cmd,shutdown_script) ssh=self.ssh(url=url) startup_script_cmd='%s /bin/bash %s'%(load_jave_environment_cmd,startup_script) shutdown_stdin, shutdown_stdout, shutdown_stderr = ssh.exec_command(shutdown_script_cmd) #执行关闭脚本 shutdown_result=shutdown_stderr or shutdown_stdout start_stdin, start_stdout, start_stderr = ssh.exec_command(startup_script_cmd) # 执行启动启动脚本 start_reault=start_stderr or start_stdout reboot_resault=(shutdown_result or start_reault ) print('%s重启日志:%s'% (self.config[url]['login_host'],reboot_resault.read())) ssh.close() time.sleep(300)#重启5分钟后,页面可以访问,再次进入检测状态 def monitor_constantly(self): for url in self.config: task= threading.Thread(name=url, target=self.check_url_timeout,args=(url,)) task.start() if __name__ == '__main__': obj = Monitor_EMS(config) obj.validate_url() obj.monitor_constantly()

import gevent from gevent import spawn,joinall,monkey;monkey.patch_all() import time import datetime import sys import paramiko import requests config = { "http://192.168.0.74:8081/bklogin.screen?tdsourcetag=s_pcqq_aiomsg": { "login_host":"192.168.0.74", "login_port": 3444, 'login_user': 'bestseller', 'login_password':"Best+2017", "JAVA_HOME":'/usr/local/jdk1.6.0_11',#JAVA_HOME路径 "JRE_HOME":"/usr/local/jdk1.6.0_11/jre",#JRE_HOME路径 "startup":"/opt/apache-tomcat-7.0.40/bin/startup.sh",#启动Tomcat的脚本路径 "shutdown":"/opt/apache-tomcat-7.0.40/bin/shutdown.sh",#关闭Tomcat的脚本路径 "monitor_interval":5, #探测时间间隔 "timeout_trager":10, #访问时间阀值 "lastest_request_cost":None,#最近1次探测 "lastest_rebbot_date":None, "authorized_ssh":None, }, } class Monitor_EMS(object): def __init__(self,config): self.config=config def ssh(self,url): ssh = paramiko.SSHClient() ssh.load_system_host_keys() ssh._policy = paramiko.AutoAddPolicy() try: print(self.config[url]['login_port'],self.config[url]['login_user'],self.config[url]['login_password']) ssh.connect(self.config[url]['login_host'],self.config[url]['login_port'],self.config[url]['login_user'],self.config[url]['login_password']) except Exception: print('请检查%s用户或密码是否可用?'%(self.config[url]['login_host'])) ssh.close() sys.exit() return ssh def validate_url(self): for url in self.config: try: response = requests.get(url=url,timeout=2)#最大等待2秒 except Exception: print('请检查该URL是否用?或者访问时间过长?',url) sys.exit() status_code = response.status_code if status_code == 200: ssh_socket=self.ssh(url) print(url,'验证通过') self.config[url]['authorized_ssh']=ssh_socket else: print(url,'验证失败状态码', status_code) sys.exit() def check_schedule(self): # 检查运维人员休息时间 flag = False curent_time = time.localtime(time.time()) wday = curent_time.tm_wday # 星期几 hour = curent_time.tm_hour # 时 # minute = curent_time.tm_min # 分 if wday in [5, 6]: # 周六日 flag = True elif wday in [0, 1, 2, 3,4] and hour > 18 and hour < 9: # 休息时间 flag = True return flag def check_url_timeout(self,url): # 检查URL是否超时 或者5xx while True: if self.check_schedule(): # 如果是休息时间 monitor_interval=self.config[url]['monitor_interval'] time.sleep(monitor_interval)#间隔多久采集一次数据 start_request_time = time.time() try: response = requests.get(url=url) except Exception: print('URL:%s不可以访问!'%(url)) response.status_code=500 status_code = response.status_code if status_code > 200: print('请检查URL,状态码:%s' % (status_code)) self.reboot_ems(url) continue end_request_time = time.time() cost_seconds =end_request_time - start_request_time timeout_trager=self.config[url]['timeout_trager']#超时触发器 if cost_seconds > timeout_trager: self.reboot_ems(url) self.config[url]['lastest_rebbot_date']=str(datetime.datetime.now()) self.config[url]['lastest_request_cost']=cost_seconds print(self.config) else: print('不是休息时间') def reboot_ems(self,url): JAVA_HOME=self.config[url]['JAVA_HOME'] JRE_HOME=self.config[url]['JRE_HOME'] shutdown_script = self.config[url]["shutdown"] startup_script = self.config[url]["startup"] load_jave_environment_cmd = 'export JAVA_HOME=%s;export JRE_HOME=%s;. ./.bash_profile;'%(JAVA_HOME,JRE_HOME)#加载JAVA的系统环境变量 shutdown_script_cmd='%s /bin/bash %s'%(load_jave_environment_cmd,shutdown_script) ssh=self.ssh(url=url) startup_script_cmd='%s /bin/bash %s'%(load_jave_environment_cmd,startup_script) shutdown_stdin, shutdown_stdout, shutdown_stderr = ssh.exec_command(shutdown_script_cmd) #执行关闭脚本 shutdown_result=shutdown_stderr or shutdown_stdout start_stdin, start_stdout, start_stderr = ssh.exec_command(startup_script_cmd) # 执行启动启动脚本 start_reault=start_stderr or start_stdout reboot_resault=(shutdown_result or start_reault ) print('%s重启日志:%s'% (self.config[url]['login_host'],reboot_resault.read())) ssh.close() time.sleep(300)#重启5分钟后,页面可以访问,再次进入检测状态 def monitor_constantly(self): Coroutine_list=[] for url in self.config: current_coroutine=gevent.spawn(self.check_url_timeout,url) Coroutine_list.append(current_coroutine) else: gevent.joinall(Coroutine_list) if __name__ == '__main__': obj = Monitor_EMS(config) obj.validate_url() obj.monitor_constantly()
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南