Python 日志处理(一) 按Nginx log_format 分割日志记录
要求:不使用正则
根据nginx 默认的日志记录格式,分割日志记录。
log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"';
日志记录先后顺序:
访客IP - 访客用户 请求时间 请求URL 状态码 响应字节大小 访问来源 浏览器标识 转发标识
单行日志:
183.60.212.153 - - [19/Feb/2013:10:23:29 +0800] "GET /o2o/media.html?menu=3 HTTP/1.1" 200 16691 "-" "Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)"
本次只处理单行,单行处理可以了就可以应用于大文件日志分析。
每个字段特殊情况:
183.60.212.153 #不以" 或 [/] 开头结尾
[19/Feb/2013:10:23:29 +0800] #以 [ 开头 ] 结尾
"GET /o2o/media.html?menu=3 HTTP/1.1" #以 " 开头 " 结尾1
"-" #以 " 开头 " 结尾但只有一个字符
"Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)" #以 " 开头 " 结尾2
logline = '''183.60.212.153 - - [19/Feb/2013:10:23:29 +0800] "GET /o2o/media.html?menu=3 HTTP/1.1" 200 16691 "-" "Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)"''' fields = logline.split() flag = False tmp = '' lst = [] for word in fields: if not flag: # if flag == False if word.startswith('[') or word.startswith('"'): if word.endswith(']') or word.endswith('"'): # "-" tmp = word.strip('[]"') lst.append(tmp) else: # '[19/Feb/2013:10:23:29', tmp = word[1:] flag = True else: lst.append(word) continue if flag: # '+0800]' if word.endswith(']') or word.endswith('"'): tmp += ' ' + word[:-1] # '19/Feb/2013:10:23:29 +0800' lst.append(tmp) tmp = '' flag = False else: tmp += ' ' + word print(lst)
输出结果:
['183.60.212.153', '-', '-', '19/Feb/2013:10:23:29 +0800', 'GET /o2o/media.html?menu=3 HTTP/1.1', '200', '16691', '-', 'Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)']