Chrome历史记录分析

360浏览器采用的是Chrome开源内核。两个浏览器的数据库文件结构一样。

参考文献:www.doc88.com/p-6971366976640.html  杨雪 靳慧云  《Chrome 浏览器历史记录提取与分析》

 

1、历史记录的提取

sqlite是一款基于磁盘的轻量级关系型数据库管理系统,。浏览器中 Chrome、Firefox、Safari等常用浏览器均使用sqlite存储及管理用户的web访问行为。

Chrome浏览器历史记录数据存储在名为History的sqlite数据库文件中。

  • 数据库表urls存储用户访问的URL,
  • visits 存储用户每次web浏览的属性,
  • keyword_search_terms 存储用户检索过的关键字。

URLS表结构

 

visits 表结构

 keyword_search_terms  表结构

 

时间数据的处理

由于Chrome浏览器在sqlite中存储的时间是以1601-01-01 00:00:00 为起始时间点的微妙计数,与Unix时间戳存在时间间隔。因此研究人员输入时间后,需要先把该时间转换为Unix时间戳,再通过时间间隔补偿得到与Chrome浏览器历史记录相匹配的时间戳格式,从而筛选出对应的记录。

此处进行时间转换的具体代码为:

datetime(datetime(visite_time/1000000)+(strftime('%s','1601-01-01')),'unixe-poch','localtime')

举例:

 

 


DOS的时间基准是1980年1月1日,
Unix的时间基准是1970年1月1日上午12 点,
Linux的时间基准是1970年1月1日凌晨0点。
Windows的时间基准是1601年1月1日。

 

时间转换:

1秒(s) = 1000毫秒(ms)

1毫秒(ms) = 1000微秒(us)

1微秒(us) = 1000纳秒(ns)

1 纳秒(ns) = 1000皮秒(ps) 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import time
'''
-- 13181125268490285 微秒
-- select datetime('now')
-- 2018-10-17 03:07:55
-- select datetime('now','localtime')
-- 2018-10-17 11:07:23

-- 13181125268490285 微秒   Chrome时间戳
-- 13181125268.490285 秒    Chrome时间戳
-- 11644473600 间隔

-- 13181125268490-11644473600  拿到的是Unix时间戳

--  先转换为时间元组'''

time_c=13181125268490285/1000000-11644473600
print(time_c)
print(time.strftime("%Y-%m-%d %X",time.gmtime(time_c)))
时间转换验证

 

print("转换为时间元组>>>",time.strptime('2018-01-01 00:00:00','%Y-%m-%d %X'))
print("转换为时间Unix时间戳,单位是:秒>>>",time.mktime(time.strptime('2018-01-01 00:00:00','%Y-%m-%d %X')))

  

2、域名聚合

 取到二级域名。将用户访问的大量URL访问记录转化为用户访问的域名,以便在此基础上进行数据挖掘及关联分析。

3、频繁项集挖掘

假设 所有聚合后的域名为集合D,统计D中的中子项出现的频数。  可以展示出用户最常访问的TOP-N 个网站的域名。

4、关键词关联分析

用户检索的关键词是其WEB浏览行为的重要组成部分。支持按照设定的时间段展示检索关键词,且能够把用户检索的关键词与搜索引擎相关联。


 采用Python+window平台+pycharmIDE

历史记录提取与筛选

SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id

从数据库中提取历史记录信息。提取用户的访问url 、网页标题、及访问时间。

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time
def parse(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")
        return domain
    except IndexError:
        print('URL format error!')







if __name__=="__main__":
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    # new_data={}
    # for item in data:
    #     url=parse(item[0])
    #     title=item[1]
    #     visit_time=(item[2]/1000000)-11644473600
    #     if url in new_data:
    #         new_data[url][2] += 1
    #     else:
    #         new_data[url]=[title,visit_time,1]
    new_data = []
    for item in data:
        url = parse(item[0])
        title = item[1]
        # visit_time = (item[2] / 1000000) - 11644473600#这是时间戳 Unix
        visit_time = time.strftime("%Y-%m-%d %X",time.localtime((item[2] / 1000000) - 11644473600))
        new_data.append([url,title, visit_time])
    print(new_data)
    cursor.close()
    conn.close()
View Code

部分整理的结果

['hao.360.cn', '360导航_新一代安全上网导航', '2018-10-17 09:10:11']
['hao.360.cn', '360导航_新一代安全上网导航', '2018-10-17 11:43:05']
['baidu.com', '百度一下,你就知道', '2018-10-17 11:43:09']
['baidu.com', '中国_百度搜索', '2018-10-17 11:43:13']
['baidu.com', '中华人民共和国_百度百科', '2018-10-17 11:43:15']
['baike.baidu.com', '中华人民共和国_百度百科', '2018-10-17 11:43:16']
['hao.360.cn', '360导航_新一代安全上网导航', '2018-10-17 11:49:20']
['hao.360.cn', '360导航_新一代安全上网导航', '2018-10-17 13:16:57']
['liaoxuefeng.com', 'Day 1 - 搭建开发环境 - 廖雪峰的官方网站', '2018-10-17 13:17:09']
['liaoxuefeng.com', '访问数据库 - 廖雪峰的官方网站', '2018-10-17 13:17:15']
['liaoxuefeng.com', '使用SQLite - 廖雪峰的官方网站', '2018-10-17 13:17:17']

Process finished with exit code 0
 1 #!/usr/bin/env python3
 2 #-*- coding:utf-8 -*-
 3 '''
 4 Administrator 
 5 2018/10/17 
 6 '''
 7 import os  # 访问系统的包
 8 import sqlite3  # 链接数据库文件胡包
 9 import time
10 def parse(url):
11     try:
12         parsed_url_components = url.split('//')
13         sublevel_split = parsed_url_components[1].split('/', 1)
14         domain = sublevel_split[0].replace("www.", "")
15         return domain
16     except IndexError:
17         print('URL format error!')
18 def FormatToStamp():
19     flag=True
20     while flag:
21         try:
22             formatTime=input("请输入时间(格式:2018-06-24 11:50:00)").strip()
23             formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
24             return formatTime
25         except Exception as e:
26             print("转换失败,请重新输入。失败原因:%s"%e)
27 
28 
29 
30 if __name__=="__main__":
31     data_path = r"D:\360浏览器\360se6\User Data\Default"
32     files=os.listdir(data_path)
33     history_db = os.path.join(data_path, 'History')
34     conn = sqlite3.connect(history_db)
35     cursor = conn.cursor()
36     SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
37     cursor.execute(SQL)
38     data=cursor.fetchall()
39     # new_data={}
40     # for item in data:
41     #     url=parse(item[0])
42     #     title=item[1]
43     #     visit_time=(item[2]/1000000)-11644473600
44     #     if url in new_data:
45     #         new_data[url][2] += 1
46     #     else:
47     #         new_data[url]=[title,visit_time,1]
48     new_data = []
49     start_time=FormatToStamp()
50     for item in data:
51         url = parse(item[0])
52         title = item[1]
53         # visit_time = (item[2] / 1000000) - 11644473600#这是时间戳 Unix
54         visit_time=(item[2] / 1000000) - 11644473600
55         if visit_time>start_time:
56             visit_time = time.strftime("%Y-%m-%d %X",time.localtime(visit_time))
57 
58             new_data.append([url,title, visit_time])
59     for i in new_data:
60         print(i)
61     cursor.close()
62     conn.close()
支持选择时间间隔查看数据

 

频繁访问网站的数据挖掘

首先对数据进行域名聚合

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time,re
def parse(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")
        return domain
    except IndexError:
        print('URL format error!')
def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def FormatToStamp():
    flag=True
    while flag:
        try:
            formatTime=input("请输入时间(格式:2018-06-24 11:50:00)").strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)



if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    # new_data={}
    # for item in data:
    #     url=parse(item[0])
    #     title=item[1]
    #     visit_time=(item[2]/1000000)-11644473600
    #     if url in new_data:
    #         new_data[url][2] += 1
    #     else:
    #         new_data[url]=[title,visit_time,1]
    new_data = []
    # start_time=FormatToStamp()
    for item in data:
        url=item[0]
        url_2= filter_data(item[0])
        title = item[1]
        # visit_time = (item[2] / 1000000) - 11644473600#这是时间戳 Unix
        visit_time=(item[2] / 1000000) - 11644473600
        # if visit_time>start_time:
        visit_time = time.strftime("%Y-%m-%d %X",time.localtime(visit_time))
        new_data.append([url,url_2,title, visit_time])
    for i in new_data:
        print(i)

    print(address_count)
    cursor.close()
    conn.close()
域名聚合 代码1,提取二级域名

 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time,re
def parse(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")
        return domain
    except IndexError:
        print('URL format error!')
def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def FormatToStamp():
    flag=True
    while flag:
        try:
            formatTime=input("请输入时间(格式:2018-06-24 11:50:00)").strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)



if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    # new_data={}
    # for item in data:
    #     url=parse(item[0])
    #     title=item[1]
    #     visit_time=(item[2]/1000000)-11644473600
    #     if url in new_data:
    #         new_data[url][2] += 1
    #     else:
    #         new_data[url]=[title,visit_time,1]
    # new_data = []
    new_data = {}
    # start_time=FormatToStamp()
    for item in data:
        # url=item[0]
        url_2= filter_data(item[0])
        title = item[1]
        if url_2 in new_data:
            new_data[url_2]+=1
        else:
            new_data[url_2]=1
        # visit_time = (item[2] / 1000000) - 11644473600#这是时间戳 Unix
        # visit_time=(item[2] / 1000000) - 11644473600
        # # if visit_time>start_time:
        # visit_time = time.strftime("%Y-%m-%d %X",time.localtime(visit_time))
        # new_data.append([url,url_2,title, visit_time])
    for i,j in new_data.items():
        print(i,">>>",j)

    print(address_count)
    cursor.close()
    conn.close()
域名聚合后统计的数量

频繁访问网站前25的数据(top-X可设置)

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time,re
import matplotlib.pyplot as plt

def parse(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")
        return domain
    except IndexError:
        print('URL format error!')
def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def FormatToStamp():
    flag=True
    while flag:
        try:
            formatTime=input("请输入时间(格式:2018-06-24 11:50:00)").strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)
def analyze(results):
    #条形图
    prompt = input("[.] Type <c> to print or <p> to plot\n[>] ")

    if prompt == "c":
        with open('./history.txt', 'w') as f:
            for site, count in sites_count_sorted.items():
                f.write(site + '\t' + str(count) + '\n')
    elif prompt == "p":
        key = []
        value = []
        for i in results:
            key.append(i[0])
            value.append(i[1])
        n = 25
        X=range(n)
        Y = value[:n]#数量

        plt.bar( X,Y, align='edge')
        plt.xticks(rotation=45)
        plt.xticks(X, key[:n])
        for x, y in zip(X, Y):
            plt.text(x + 0.4, y + 0.05, y, ha='center', va='bottom')



        plt.show()
    else:
        print("[.] Uh?")
        quit()
def analyze2(results):
    print("我一看就知道你要打印折线图")
    key=[]
    value=[]
    for i in results:
        key.append(i[0])
        value.append(i[1])
    n = 20
    X = key[:n]
    Y = value[:n]

    plt.plot(X,Y,label="number count")
    plt.xticks(rotation=45)
    plt.xlabel('numbers')
    plt.ylabel('webname')
    plt.title('number count')
    plt.show()

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    # new_data={}
    # for item in data:
    #     url=parse(item[0])
    #     title=item[1]
    #     visit_time=(item[2]/1000000)-11644473600
    #     if url in new_data:
    #         new_data[url][2] += 1
    #     else:
    #         new_data[url]=[title,visit_time,1]
    # new_data = []
    new_data = {}
    # start_time=FormatToStamp()
    for item in data:
        # url=item[0]
        url_2= filter_data(item[0])
        title = item[1]
        if url_2 in new_data:
            new_data[url_2]+=1
        else:
            new_data[url_2]=1
        # visit_time = (item[2] / 1000000) - 11644473600#这是时间戳 Unix
        # visit_time=(item[2] / 1000000) - 11644473600
        # # if visit_time>start_time:
        # visit_time = time.strftime("%Y-%m-%d %X",time.localtime(visit_time))
        # new_data.append([url,url_2,title, visit_time])


    # for i,j in new_data.items():
    #     print(i,">>>",j)
    #我们对拿到的字典数据进行排序
    del new_data["ok"]
    sites_count_sorted = sorted(new_data.items(),key=lambda item:item[1],reverse=True)
    # print(sites_count_sorted)
    analyze(sites_count_sorted)
    # print(type(sites_count_sorted))
    #print(sites_count_sorted)
    #print("不符合条件的域名",address_count)
    cursor.close()
    conn.close()
数据可视化 top-25

可以看出 访问百度的频数最高,依次递减。  


 

关联检索关键词

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import time,os,sqlite3,re
def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT keyword_search_terms.term,urls.url,urls.visit_count,urls.last_visit_time from keyword_search_terms LEFT JOIN urls on keyword_search_terms.url_id=urls.id;"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()
    print("取得数据:关键词keyword,url,访问次数,访问时间",data)

    new_data = []
    for item in data:
        keyword=item[0]
        url= filter_data(item[1])
        count=item[2]
        last_vist_time=time.strftime("%Y-%m-%d %X",time.localtime((item[3] / 1000000) - 11644473600))
        new_data.append([keyword,url,count, last_vist_time])
    for i in new_data:
        print(i)
拿到关键词 代码

部分数据如下:

 用户行为分析——统计搜索的前十大关键词:

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/17 
'''
import time,os,sqlite3,re
def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')

#统计一下搜索引擎使用占比



if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT keyword_search_terms.term,urls.url,urls.visit_count,urls.last_visit_time from keyword_search_terms LEFT JOIN urls on keyword_search_terms.url_id=urls.id;"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()
    #print("取得数据:关键词keyword,url,访问次数,访问时间",data)

    new_data = []
    ky_count={}#统计关键词数量
    search_engines_count={}#统计搜索引擎使用数量
    for item in data:
        keyword=item[0]
        url= filter_data(item[1])
        count=item[2]
        last_vist_time=time.strftime("%Y-%m-%d %X",time.localtime((item[3] / 1000000) - 11644473600))
        new_data.append([keyword,url,count, last_vist_time])
        if keyword in ky_count:
            ky_count[keyword]+=1
        # elif url in search_engines_count:
        #     search_engines_count[url]+=1
        else:
            ky_count[keyword]=1
            # search_engines_count[url] = 1

    # for i in new_data:
    # print(new_data)
    # print(ky_count)
    # print(search_engines_count)
    #对关键词进行排序
    ky_count_sorted = sorted(ky_count.items(), key=lambda item: item[1], reverse=True)
    print(ky_count_sorted)
对关键词进行排序

用直方图显示频数前20个 top-20的关键词。

 

 1 #!/usr/bin/env python3
 2 #-*- coding:utf-8 -*-
 3 '''
 4 Administrator 
 5 2018/10/17 
 6 '''
 7 import time,os,sqlite3,re
 8 import matplotlib.pyplot as plt
 9 
10 
11 def filter_data(url):
12     try:
13         parsed_url_components = url.split('//')
14         sublevel_split = parsed_url_components[1].split('/', 1)
15         data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
16         if data:
17             return data.group()
18         else:
19             address_count.add(sublevel_split[0])
20             return "ok"
21     except IndexError:
22         print('URL format error!')
23 
24 #统计一下搜索引擎使用占比
25 def analyze(results):
26     #条形图
27     plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
28     plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
29     # 有中文出现的情况,需要u'内容'
30     print("I will generate histograms directly !!")
31     key = []
32     value = []
33     for i in results:
34         key.append(i[0])
35         value.append(i[1])
36     n = 20
37     X=range(n)
38     Y = value[:n]#数量
39 
40     plt.bar( X,Y, align='edge',color="g")
41     plt.xticks(rotation=45)
42     plt.xticks(X, key[:n])
43     for x, y in zip(X, Y):
44         plt.text(x + 0.4, y + 0.05, y, ha='center', va='bottom')
45     plt.show()
46 
47 
48 
49 if __name__=="__main__":
50     address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
51     data_path = r"D:\360浏览器\360se6\User Data\Default"
52     files = os.listdir(data_path)
53     history_db = os.path.join(data_path, 'History')
54     conn = sqlite3.connect(history_db)
55     cursor = conn.cursor()
56     SQL = "SELECT keyword_search_terms.term,urls.url,urls.visit_count,urls.last_visit_time from keyword_search_terms LEFT JOIN urls on keyword_search_terms.url_id=urls.id;"
57     cursor.execute(SQL)
58     data = cursor.fetchall()
59     cursor.close()
60     conn.close()
61     #print("取得数据:关键词keyword,url,访问次数,访问时间",data)
62 
63     new_data = []
64     ky_count={}#统计关键词数量
65     search_engines_count={}#统计搜索引擎使用数量
66     for item in data:
67         keyword=item[0]
68         url= filter_data(item[1])
69         count=item[2]
70         last_vist_time=time.strftime("%Y-%m-%d %X",time.localtime((item[3] / 1000000) - 11644473600))
71         new_data.append([keyword,url,count, last_vist_time])
72         if keyword in ky_count:
73             ky_count[keyword]+=count
74         # elif url in search_engines_count:
75         #     search_engines_count[url]+=1
76         else:
77             ky_count[keyword]=count
78             # search_engines_count[url] = 1
79 
80     # for i in new_data:
81     # print(new_data)
82     # print(ky_count)
83     # print(search_engines_count)
84     #对关键词进行排序
85     ky_count_sorted = sorted(ky_count.items(), key=lambda item: item[1], reverse=True)
86     analyze(ky_count_sorted)
统计关键词,生成上面的直方图

从关键词上可以看出,作者还是挺喜欢看小说的。业余时间喜欢用浏览器去搜索小说。特别是《道君》的搜索频数最大,O(∩_∩)O哈哈~        其次技术类的信息也喜欢使用浏览器查找。

 

使用最多的搜索引擎是baidu.com ,其次是so.com    pypi.org则使用最少。仅占比0.4%

  1 #!/usr/bin/env python3
  2 #-*- coding:utf-8 -*-
  3 '''
  4 Administrator 
  5 2018/10/17 
  6 '''
  7 import time,os,sqlite3,re
  8 import matplotlib.pyplot as plt
  9 
 10 
 11 def filter_data(url):
 12     try:
 13         parsed_url_components = url.split('//')
 14         sublevel_split = parsed_url_components[1].split('/', 1)
 15         data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
 16         if data:
 17             return data.group()
 18         else:
 19             address_count.add(sublevel_split[0])
 20             return "ok"
 21     except IndexError:
 22         print('URL format error!')
 23 
 24 #统计一下搜索引擎使用占比
 25 def analyze(results):
 26     #条形图
 27     plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
 28     plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
 29     # 有中文出现的情况,需要u'内容'
 30     print("I will generate histograms directly !!")
 31     key = []
 32     value = []
 33     for i in results:
 34         key.append(i[0])
 35         value.append(i[1])
 36     n = 20
 37     X=range(n)
 38     Y = value[:n]#数量
 39 
 40     plt.bar( X,Y, align='edge',color="g")
 41     plt.xticks(rotation=45)
 42     plt.xticks(X, key[:n])
 43     for x, y in zip(X, Y):
 44         plt.text(x + 0.4, y + 0.05, y, ha='center', va='bottom')
 45     plt.show()
 46 
 47 def analyze2(results):
 48     #饼图
 49     plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
 50     plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
 51     # 有中文出现的情况,需要u'内容'
 52     slices=[]
 53     activities=[]
 54     for i,j in results.items():
 55         activities.append(i)
 56         slices.append(j)
 57     cols=['c','m','y','b']#给饼图配颜色
 58     num=len(slices)#取到一个需要几个数组
 59     cols=cols[:num]#需要一个颜色,取几个
 60     plt.pie(slices,
 61             labels=activities,
 62             colors=cols,
 63             startangle=90,
 64             # shadow=True,
 65             explode=(0,0,0,0.1),
 66             autopct='%1.1f%%')
 67     plt.title(u"搜索引擎使用占比")
 68     plt.show()
 69 
 70 if __name__=="__main__":
 71     address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
 72     data_path = r"D:\360浏览器\360se6\User Data\Default"
 73     files = os.listdir(data_path)
 74     history_db = os.path.join(data_path, 'History')
 75     conn = sqlite3.connect(history_db)
 76     cursor = conn.cursor()
 77     SQL = "SELECT keyword_search_terms.term,urls.url,urls.visit_count,urls.last_visit_time from keyword_search_terms LEFT JOIN urls on keyword_search_terms.url_id=urls.id;"
 78     cursor.execute(SQL)
 79     data = cursor.fetchall()
 80     cursor.close()
 81     conn.close()
 82     #print("取得数据:关键词keyword,url,访问次数,访问时间",data)
 83 
 84     new_data = []
 85     ky_count={}#统计关键词数量
 86     search_engines_count={}#统计搜索引擎使用数量
 87     for item in data:
 88         keyword=item[0]
 89         url= filter_data(item[1])
 90         count=item[2]
 91         last_vist_time=time.strftime("%Y-%m-%d %X",time.localtime((item[3] / 1000000) - 11644473600))
 92         new_data.append([keyword,url,count, last_vist_time])
 93         # if keyword in ky_count:
 94             # ky_count[keyword]+=count
 95         if url in search_engines_count:
 96             search_engines_count[url]+=1
 97         else:
 98             # ky_count[keyword]=count
 99             search_engines_count[url] = 1
100 
101     # for i in new_data:
102     # print(new_data)
103     # print(ky_count)
104     # print(search_engines_count)
105     #对关键词进行排序
106     # ky_count_sorted = sorted(ky_count.items(), key=lambda item: item[1], reverse=True)
107 
108     # analyze(ky_count_sorted)
109     analyze2(search_engines_count)
搜索引擎使用率分析

 


 

统计每天打开网页的次数

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/18 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time,re
import matplotlib.pyplot as plt

class dataForm:
    """
    本类适用于生成连续的时间。包括时,天,月。
    数据用于数据分析,只考虑今年的情况-----2018年。
    hours:# 用于生成2018年的任意一个月份的每天的时间字典。单位是小时
    """
    #hours_dic = {}
    days_dic={}
    months_dic={}
    def __init__(self):
        pass
    def hours(self,num):
        hours_dic = {}
        if num in [1, 3, 5, 7, 8, 10, 12]:
            for day in range(1,32):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        elif num == 2:
            for day in range(1,29):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        elif num in [4, 6, 9, 11]:
            for day in range(1,31):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        else:
            print("输入格式不正确")
        return hours_dic
    def days(self):
        pass
    def months(self):
        pass
    def years(self):
        pass
    def FormatToStamp(self,string):
        #转成Unix时间戳
        flag = True
        while flag:
            try:
                formatTime = string.strip()
                formatTime = time.mktime(time.strptime(formatTime, '%Y-%m-%d %X'))
                return formatTime
            except Exception as e:
                print("转换失败,请重新输入。失败原因:%s" % e)



def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def analyze2(results):
    #折线图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    s1="2018-10-15"
    select_day=a.FormatToStamp("%s 00:00:00"%s1)
    key=[]
    value=[]
    for i in results:
        if select_day<i[1][1]:
            key.append(i[0][-8:])
            value.append(i[1][0])
    X = key[:24]
    Y = value[:24]
    #X=list(range(0, 24))
    #group_labels=list(range(0, 24))

    plt.plot(X,Y,label="%s 当日登陆网站走势图"%s1)
    plt.xticks(X,rotation=45)
    plt.xlabel('当日时刻:小时')
    plt.ylabel('登陆次数:次')
    plt.title("用户登陆网站走势图")
    plt.legend()
    plt.show()

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    cursor.close()
    conn.close()

    new_data1=[]#统计每小时访问网站数量
    new_data2=[]#统计每天的访问网站数量
    new_data3=[]#统计每月的访问网站数量
    new_data4=[]#统计每年的访问网站数量   暂时用不到
    for num,item in enumerate(data):
        #new_data1.append([item,time.strftime("%Y-%m-%d %X",time.localtime((item[2] / 1000000) - 11644473600))])
        new_data1.append([item, time.strftime("%Y-%m-%d %X", time.localtime((item[2] / 1000000) - 11644473600))[:-6]+":00:00"])
        new_data2.append([item, time.strftime("%Y-%m-%d", time.localtime((item[2] / 1000000) - 11644473600))])
        new_data3.append([item, time.strftime("%Y-%m", time.localtime((item[2] / 1000000) - 11644473600))])

    new_data_hours = {}  # 统计每小时访问网站数量
    new_data_days = {}  # 统计每天的访问网站数量
    new_data_months = {}  # 统计每月的访问网站数量
    new_data_years = {}  # 统计每年的访问网站数量   暂时用不到

    # for item in new_data1:
    #     for h in range(24):
    #         if item[1] in new_data_hours:
    #             new_data_hours[item[1]]+=1
    #         else:
    #             new_data_hours[item[1]]=1
    #print("统计每小时访问网站数量", new_data_hours)

    #print(new_data1)
    ##############  统计每小时访问网站数量 #############################################
    a=dataForm()#定义一个时间生成对象
    # print(new_data1)
    for item in new_data1:
        if item[1] in new_data_hours:
            new_data_hours[item[1]][0]+=1
        elif item[1] not in new_data_hours:
            num=item[1][5:7]
            hours_dic = a.hours(int(num))
            new_data_hours.update(hours_dic)
            new_data_hours[item[1]][0]=1


    #对数据进行排序
    new_data_hours_sorted=sorted(new_data_hours.items(), key=lambda item: item[0])
    # for i in new_data_hours_sorted:
    #     print(i)
    analyze2(new_data_hours_sorted)


    #print("统计每小时访问网站数量", new_data_hours)
    # for item in new_data2:
    #     if item[1] in new_data_days:
    #         new_data_days[item[1]]+=1
    #     else:
    #         new_data_days[item[1]]=1
    # print("统计每天的访问网站数量",new_data_days)
    # for item in new_data3:
    #     if item[1] in new_data_months:
    #         new_data_months[item[1]]+=1
    #     else:
    #         new_data_months[item[1]]=1
    #
    # print("统计每月的访问网站数量",new_data_months)
代码

效果图:

 

 

  1 #!/usr/bin/env python3
  2 #-*- coding:utf-8 -*-
  3 '''
  4 Administrator 
  5 2018/10/18 
  6 '''
  7 import os  # 访问系统的包
  8 import sqlite3  # 链接数据库文件胡包
  9 import time,re
 10 import matplotlib.pyplot as plt
 11 
 12 class dataForm:
 13     """
 14     本类适用于生成连续的时间。包括时,天,月。
 15     数据用于数据分析,只考虑今年的情况-----2018年。
 16     hours:# 用于生成2018年的任意一个月份的每天的时间字典。单位是小时
 17     """
 18     #hours_dic = {}
 19     days_dic={}
 20     months_dic={}
 21     def __init__(self):
 22         pass
 23     def hours(self,num):
 24         hours_dic = {}
 25         if num in [1, 3, 5, 7, 8, 10, 12]:
 26             for day in range(1,32):
 27                 day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
 28                 for h in range(24):
 29                     hour = day_to_str + " "+str(h).zfill(2)+":00:00"
 30                     hours_dic[hour]=[0,self.FormatToStamp(hour)]
 31         elif num == 2:
 32             for day in range(1,29):
 33                 day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
 34                 for h in range(24):
 35                     hour = day_to_str + " "+str(h).zfill(2)+":00:00"
 36                     hours_dic[hour]=[0,self.FormatToStamp(hour)]
 37         elif num in [4, 6, 9, 11]:
 38             for day in range(1,31):
 39                 day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
 40                 for h in range(24):
 41                     hour = day_to_str + " "+str(h).zfill(2)+":00:00"
 42                     hours_dic[hour]=[0,self.FormatToStamp(hour)]
 43         else:
 44             print("输入格式不正确")
 45         return hours_dic
 46     def days(self):
 47         pass
 48     def months(self):
 49         pass
 50     def years(self):
 51         pass
 52     def FormatToStamp(self,string):
 53         #转成Unix时间戳
 54         flag = True
 55         while flag:
 56             try:
 57                 formatTime = string.strip()
 58                 formatTime = time.mktime(time.strptime(formatTime, '%Y-%m-%d %X'))
 59                 return formatTime
 60             except Exception as e:
 61                 print("转换失败,请重新输入。失败原因:%s" % e)
 62 
 63 
 64 
 65 def filter_data(url):
 66     try:
 67         parsed_url_components = url.split('//')
 68         sublevel_split = parsed_url_components[1].split('/', 1)
 69         data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
 70         if data:
 71             return data.group()
 72         else:
 73             address_count.add(sublevel_split[0])
 74             return "ok"
 75     except IndexError:
 76         print('URL format error!')
 77 def analyze2(results):
 78     #折线图
 79 
 80     plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
 81     plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
 82     # 有中文出现的情况,需要u'内容'
 83     def go(string):
 84         col=['black','brown','red','goldenrod',"blue","y","g"]
 85         s_to=string
 86         for i,j in enumerate(col):
 87             # s=string[:8]+str(int(string[-2:])-i)
 88             select_day = a.FormatToStamp("%s 00:00:00" % s_to)-i*86400*2
 89             keys = []
 90             values = []
 91             for i in results:
 92                 if select_day < i[1][1]:
 93                     keys.append(i[0][-11:-3])
 94                     values.append(i[1][0])
 95             # x = keys[:48]
 96             x=range(48)
 97             y = values[:48]
 98             plt.plot(x, y, label="%s 登陆网站走势图" %(time.strftime("%Y-%m-%d %X",time.localtime(select_day))), color=j)
 99             plt.xticks(x, rotation=0)
100     go("2018-10-05")
101 
102     plt.xlabel('当日时刻:小时')
103     plt.ylabel('登陆次数:次')
104     plt.title("用户登陆网站48小时走势图")
105     plt.legend()
106     plt.show()
107 
108 if __name__=="__main__":
109     address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
110     data_path = r"D:\360浏览器\360se6\User Data\Default"
111     files=os.listdir(data_path)
112     history_db = os.path.join(data_path, 'History')
113     conn = sqlite3.connect(history_db)
114     cursor = conn.cursor()
115     SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
116     cursor.execute(SQL)
117     data=cursor.fetchall()
118     cursor.close()
119     conn.close()
120 
121     new_data1=[]#统计每小时访问网站数量
122     new_data2=[]#统计每天的访问网站数量
123     new_data3=[]#统计每月的访问网站数量
124     new_data4=[]#统计每年的访问网站数量   暂时用不到
125     for num,item in enumerate(data):
126         #new_data1.append([item,time.strftime("%Y-%m-%d %X",time.localtime((item[2] / 1000000) - 11644473600))])
127         new_data1.append([item, time.strftime("%Y-%m-%d %X", time.localtime((item[2] / 1000000) - 11644473600))[:-6]+":00:00"])
128         new_data2.append([item, time.strftime("%Y-%m-%d", time.localtime((item[2] / 1000000) - 11644473600))])
129         new_data3.append([item, time.strftime("%Y-%m", time.localtime((item[2] / 1000000) - 11644473600))])
130 
131     new_data_hours = {}  # 统计每小时访问网站数量
132     new_data_days = {}  # 统计每天的访问网站数量
133     new_data_months = {}  # 统计每月的访问网站数量
134     new_data_years = {}  # 统计每年的访问网站数量   暂时用不到
135 
136     # for item in new_data1:
137     #     for h in range(24):
138     #         if item[1] in new_data_hours:
139     #             new_data_hours[item[1]]+=1
140     #         else:
141     #             new_data_hours[item[1]]=1
142     #print("统计每小时访问网站数量", new_data_hours)
143 
144     #print(new_data1)
145     ##############  统计每小时访问网站数量 #############################################
146     a=dataForm()#定义一个时间生成对象
147     # print(new_data1)
148     for item in new_data1:
149         if item[1] in new_data_hours:
150             new_data_hours[item[1]][0]+=1
151         elif item[1] not in new_data_hours:
152             num=item[1][5:7]
153             hours_dic = a.hours(int(num))
154             new_data_hours.update(hours_dic)
155             new_data_hours[item[1]][0]=1
156 
157 
158     #对数据进行排序
159     new_data_hours_sorted=sorted(new_data_hours.items(), key=lambda item: item[0])
160     # for i in new_data_hours_sorted:
161     #     print(i)
162     analyze2(new_data_hours_sorted)
163 
164 
165     #print("统计每小时访问网站数量", new_data_hours)
166     # for item in new_data2:
167     #     if item[1] in new_data_days:
168     #         new_data_days[item[1]]+=1
169     #     else:
170     #         new_data_days[item[1]]=1
171     # print("统计每天的访问网站数量",new_data_days)
172     # for item in new_data3:
173     #     if item[1] in new_data_months:
174     #         new_data_months[item[1]]+=1
175     #     else:
176     #         new_data_months[item[1]]=1
177     #
178     # print("统计每月的访问网站数量",new_data_months)
代码 取一个7*2 天的走势图

效果图 如下:

 

从图中可以看出,每天都是从早上七点开始到下午五点之间会产生登陆行为。非上班时间,不会使用本机电脑。至少证明,本人上班还是听规律的呀。哈哈·······

 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/18 
'''
import os  # 访问系统的包
import sqlite3  # 链接数据库文件胡包
import time,re
import matplotlib.pyplot as plt

class dataForm:
    """
    本类适用于生成连续的时间。包括时,天,月。
    数据用于数据分析,只考虑今年的情况-----2018年。
    hours:# 用于生成2018年的任意一个月份的每天的时间字典。单位是小时
    """
    #hours_dic = {}
    days_dic={}
    months_dic={}
    def __init__(self):
        pass
    def hours(self,num):
        hours_dic = {}
        if num in [1, 3, 5, 7, 8, 10, 12]:
            for day in range(1,32):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        elif num == 2:
            for day in range(1,29):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        elif num in [4, 6, 9, 11]:
            for day in range(1,31):
                day_to_str="2018-"+str(num).zfill(2)+"-"+str(day).zfill(2)
                for h in range(24):
                    hour = day_to_str + " "+str(h).zfill(2)+":00:00"
                    hours_dic[hour]=[0,self.FormatToStamp(hour)]
        else:
            print("输入格式不正确")
        return hours_dic
    def days(self):
        pass
    def months(self):
        pass
    def years(self):
        pass
    def FormatToStamp(self,string):
        #转成Unix时间戳
        flag = True
        while flag:
            try:
                formatTime = string.strip()
                formatTime = time.mktime(time.strptime(formatTime, '%Y-%m-%d %X'))
                return formatTime
            except Exception as e:
                print("转换失败,请重新输入。失败原因:%s" % e)



def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',sublevel_split[0])
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def analyze2(results):
    #折线图

    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    def go(string):
        col=['black','brown','red','goldenrod',"blue","y","g",'black','brown','red','goldenrod',"blue","y","g",'black','brown','red','goldenrod',"blue","y","g",'black','brown','red','goldenrod',"blue","y","g"]
        s_to=string
        for i,j in enumerate(col):
            # s=string[:8]+str(int(string[-2:])-i)
            select_day = a.FormatToStamp("%s 00:00:00" % s_to)-i*86400
            keys = []
            values = []
            for i in results:
                if select_day < i[1][1]:
                    keys.append(i[0][-11:-3])
                    values.append(i[1][0])
            # x = keys[:48]
            x=range(24)
            y = values[:24]
            plt.plot(x, y, label="%s 登陆网站走势图" %(time.strftime("%Y-%m-%d %X",time.localtime(select_day))), color=j)
            plt.xticks(x, rotation=0)
    go("2018-10-05")

    plt.xlabel('当日时刻:小时')
    plt.ylabel('登陆次数:次')
    plt.title("用户登陆网站24小时走势图")
    plt.legend()
    plt.show()
def analyze3(results):
    #散点图

    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    def go(string):
        s_to=string
        for i in range(30):
            select_day = a.FormatToStamp("%s 00:00:00" % s_to)-i*86400
            keys = []
            values = []
            for i in results:
                if select_day < i[1][1]:
                    keys.append(i[0][-11:-3])
                    values.append(i[1][0])
            # x = keys[:48]
            x=range(24)
            y = values[:24]
            print(x)
            print(y)
            plt.scatter(x, y, label="%s 登陆网站散点图" %(time.strftime("%Y-%m-%d %X",time.localtime(select_day))))
            plt.xticks(x, rotation=0)
    go("2018-10-18")
    #
    plt.xlabel('当日时刻:小时')
    plt.ylabel('登陆次数:次')
    plt.title("用户登陆网站24小时散点图")
    plt.legend()
    plt.show()
if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files=os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL="SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data=cursor.fetchall()
    cursor.close()
    conn.close()

    new_data1=[]#统计每小时访问网站数量
    new_data2=[]#统计每天的访问网站数量
    new_data3=[]#统计每月的访问网站数量
    new_data4=[]#统计每年的访问网站数量   暂时用不到
    for num,item in enumerate(data):
        #new_data1.append([item,time.strftime("%Y-%m-%d %X",time.localtime((item[2] / 1000000) - 11644473600))])
        new_data1.append([item, time.strftime("%Y-%m-%d %X", time.localtime((item[2] / 1000000) - 11644473600))[:-6]+":00:00"])
        new_data2.append([item, time.strftime("%Y-%m-%d", time.localtime((item[2] / 1000000) - 11644473600))])
        new_data3.append([item, time.strftime("%Y-%m", time.localtime((item[2] / 1000000) - 11644473600))])

    new_data_hours = {}  # 统计每小时访问网站数量
    new_data_days = {}  # 统计每天的访问网站数量
    new_data_months = {}  # 统计每月的访问网站数量
    new_data_years = {}  # 统计每年的访问网站数量   暂时用不到

    # for item in new_data1:
    #     for h in range(24):
    #         if item[1] in new_data_hours:
    #             new_data_hours[item[1]]+=1
    #         else:
    #             new_data_hours[item[1]]=1
    #print("统计每小时访问网站数量", new_data_hours)

    #print(new_data1)
    ##############  统计每小时访问网站数量 #############################################
    a=dataForm()#定义一个时间生成对象
    # print(new_data1)
    for item in new_data1:
        if item[1] in new_data_hours:
            new_data_hours[item[1]][0]+=1
        elif item[1] not in new_data_hours:
            num=item[1][5:7]
            hours_dic = a.hours(int(num))
            new_data_hours.update(hours_dic)
            new_data_hours[item[1]][0]=1


    #对数据进行排序
    new_data_hours_sorted=sorted(new_data_hours.items(), key=lambda item: item[0])
    # for i in new_data_hours_sorted:
    #     print(i)
    analyze3(new_data_hours_sorted)


    #print("统计每小时访问网站数量", new_data_hours)
    # for item in new_data2:
    #     if item[1] in new_data_days:
    #         new_data_days[item[1]]+=1
    #     else:
    #         new_data_days[item[1]]=1
    # print("统计每天的访问网站数量",new_data_days)
    # for item in new_data3:
    #     if item[1] in new_data_months:
    #         new_data_months[item[1]]+=1
    #     else:
    #         new_data_months[item[1]]=1
    #
    # print("统计每月的访问网站数量",new_data_months)
散点图代码


分析用户上网行为

由于用户的网站的URL请求特别多,每次请求一次就会产生一条记录。使用穷举法,我们对用户访问的所有网站进行一次分类。 根据需求,设置分类。 假设如果分成9类,每条记录可以用列表表示: visit_type=[0,0,0,0,0,0,0,0,0] 

1、大型门户网站
比如国内知名的新浪、搜狐、网易、腾讯等都属于大型门户网站。大型门户网站类型的特点:网站信息量非常大,海量信息量,同时网站以咨询、新闻等内容为主。网站内容比较全面,包括很多分支信息,比如房产、经济、科技、旅游等。大型门户网站通常访问量非常大,每天有数千万甚至上亿的访问量,是互联网的最重要组成部分。
2、行业网站
以某一个行业内容为主题的网站,行业网站通常包括行业资讯,行业技术信息,产品广告发布等。目前基本每个行业都有行业网站,比如五金行业网站、机电行业网站、工程机械行业网站、旅游服务行业网站等。行业网站类型在该行业有一定的知名度,通常流量也比较大,每天有上万的流量。行业网站盈利模式主要靠广告输入,付费商铺,联盟广告,软文,链接买卖等方式盈利。
3、交易类网站
交易类网站主要包括B2B、B2C、C2C等类型。交易类网站以在网站产生销售为目的,通过产品选择-》订购-》付款-》物流发货-》确认发货等流程实现产品的销售。国内知名的交易网站类型有阿里巴巴、淘宝、京东等。
4、分类信息网站
分类信息网站好比互联网的集贸市场,有人在上面发布信息销售产品,有人在上面购买物品。分类信息主要面向同城,是同城产品销售的重要平台。国内知名的分类信息包括58同城、百姓、列表等。如果你有闲置的物品,那么分类信息为你提供了最好的销售平台,而且还是免费的。
5、论坛
说起论坛,估计大家都不陌生,论坛是一个交流的平台,注册论坛账号并登陆以后,就可以发布信息,也可以信息回帖等,实现交流的功能。
6、政府网站
政府网站有政府和事业单位主办,通常内容比较权威,是政府对外发布信息的平台。目前国内政府和事业单位基本都有自己的网站。
7、功能性质网站
网站提供某一种或者几种功能,比如站长工具、电话手机号码查询、物流信息查询、火车票购买等。功能性网站以实现某一种或者几种功能为主要服务内容。用户也是为了实现某一种功能来浏览该网站。
8、娱乐类型网站
娱乐类型网站主要包括视频网站(优酷、土豆)、音乐网站、游戏网站等。虽然互联网发展非常迅速,但是互联网还是以娱乐为主,大部分人上网还是为了娱乐。通常娱乐网站浏览量非常大,主要是需求非常大,以视频、游戏娱乐网站最为突出。
9、企业网站
企业网站是互联网网站数量最多的类型,现在几乎每一个企业都有自己的企业网站。企业网站内容包括企业的新闻动态、企业的产品信息、企业的简介、企业的联系方式等内容。企业网站是企业对外展示的窗口,也是企业销售产品的最主要方式。
网站分类 9大类

同时由于现在大部分用户习惯使用搜索引擎,在表中增加一列,用来统计搜索引擎的使用情况。 

时间 搜索引擎 大型门户网站 行业网站 交易类网站 分类信息网站 论坛 政府网站 功能性质网站 娱乐类型网站 企业网站
2018-09-01 00:00:00                    
2018-09-01 00:00:01                    
2018-09-01 00:00:02                    
2018-09-01 00:00:03                    
......                    

 

 1 #!/usr/bin/env python3
 2 #-*- coding:utf-8 -*-
 3 '''
 4 Administrator 
 5 2018/10/19 
 6 '''
 7 import os,sqlite3,re,time
 8 def filter_data(url):
 9     try:
10         parsed_url_components = url.split('//')
11         sublevel_split = parsed_url_components[1].split('/', 1)
12         domain = sublevel_split[0].replace("www.", "")
13 
14         data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',domain)
15         if data:
16             return data.group()
17         else:
18             address_count.add(sublevel_split[0])
19             return "ok"
20     except IndexError:
21         print('URL format error!')
22 
23 if __name__=="__main__":
24     address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
25     data_path = r"D:\360浏览器\360se6\User Data\Default"
26     files = os.listdir(data_path)
27     history_db = os.path.join(data_path, 'History')
28     conn = sqlite3.connect(history_db)
29     cursor = conn.cursor()
30     SQL = "SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
31     cursor.execute(SQL)
32     data = cursor.fetchall()
33     cursor.close()
34     conn.close()
35 
36     new_list=[]
37 
38     web_dic={0:["360.cn","baidu.com","so.com","pypi.org","bing.com","stackoverflow.com","google.cn"],#搜索引擎
39              1: ["qihoo.com","eastday.com","toutiao.com","ufochn.com","timeacle.cn","bilibili.com","xfwed.com",
40                  "docin.com","sohu.com","msn.cn","chouti.com","milkjpg.com","btime.com","51xuedu.com",
41                  "joyme.com"],#大型门户网站
42              2: ["699pic.com","epub360.com","wix.com","crsky.com","runoob.com","scrapy.org","debugbar.com","linuxidc.com",
43                  "911cha.com","tuke88.com","readthedocs.org","jb51.net","readthedocs.io","github.com","888pic.com",
44                  "ibaotu.com","w3school.com","17sucai.com","mycodes.net","freemoban.com"],#行业网站
45              3: ["www.cc","b2b.cc","mall.cc","creditcard.cc","taobao.com","alipay.com",
46                  "tmall.com","jd.com","dangdang.com","company1.cc","ebank.cc"],#交易类网站
47              4: ["qcloud.com","360doc.com","eol.cn","xiaohuar.com","exam8.com"],#分类信息网站
48              5: ["csdn.net","2cto.com","liaoxuefeng.com","w3cschool.cn","cnblogs.com","zol.com",
49                  "python.org","douban.com","shidastudy.com","v2ex.com","i.cn","home.cn","passport.cn",
50                  "testerhome.com","aosabook.org","ggdoc.com","biqukan.com"],#论坛
51              6: ["apply.cc","ccopyright.com","gov.cn","jzjyy.cn","jseea.cn","jszk.net"],#政府网站
52              7: ["ccb.com",'ok',"163.com","ibsbjstar.cc","app.com","maiziedu.com","browsershots.org","gerensuodeshui.cn",
53                  "gitlab.com","youdao.com","alicdn.com","qmail.com"],#功能性质网站
54              8:["biqudao.com","biqudu.com","siluke.tw","biquge.cc","6mao.com","biqiuge.com","biquge.info",
55                 "dingdiann.com","qq.com","booktxt.net","biquge.com","xs.la","208xs.com","xxbiquge.com",
56                 "xuexi111.com","ufochn.com","ahzww.net","555zw.com","biquge5200.com","bequge.com","bqg99.com",
57                 "bqg99.cc","sbiquge.com","biquge5200.cc","166xs.com"],#    娱乐类型网站
58              9: ["microsoft.com","weaver.com","oa8000.com","goodwaysoft.com","lenosoft.net",
59                  "lizheng.com","adobe.com","pexels.com","dingtalk.com","autodesk.com","qifeiye.com",
60                  "lizhengyun.com","shuishandt.com","windows7en.com","coursera.org","smoson.com","gongboshi.com",
61                  "huawei.com","spoon.net","jetbrains.com","sqlite.org","yesacc.com","sunlandzk.com",
62                  "alibabagroup.com","turbo.net","cqttech.com","prezi.com","tradedoubler.com",
63                  "renrendai.com","alizila.com","hongzhixx.cn","studyol.com","cnbim.com"],#企业网站
64              }
65     for row in data:
66         web_type=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
67         stime=(row[2] / 1000000) - 11644473600#这是时间戳 Unix
68         stime = time.strftime("%Y-%m-%d %X", time.localtime(stime))
69         url=filter_data(row[0])
70         for i in web_dic:
71             for j in web_dic[i]:
72                 if url in j:
73                     web_type[i]=1
74         data_old=row
75         new_list.append([stime,web_type,url,data_old])
76     for row_new in new_list:
77         # if not 1 in row_new[1]:
78         #     print(row_new)
79         print(row_new)
需要一个二级域名数据库来对已有的url 比较,帮其分类。本例子,采用手写的字典形式的数据源。

部分数据如上图所示,1代表该行URL地址的类型。

 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/19 
'''
import os,sqlite3,re,time
import matplotlib.pyplot as plt
def FormatToStamp(string):
    flag=True
    while flag:
        try:
            formatTime=input("请输入%s(格式:2018-06-24 11:50:00)"%string).strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)
def per_seconds(start_time,end_time):
    time_dict={}
    start_time=int(start_time)
    end_time=int(end_time)
    for i in range(start_time,end_time):
        time_dict[i]=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    return time_dict

def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")

        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',domain)
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()

    new_list=[]

    web_dic={0:["360.cn","baidu.com","so.com","pypi.org","bing.com","stackoverflow.com","google.cn"],#搜索引擎
             1: ["qihoo.com","eastday.com","toutiao.com","ufochn.com","timeacle.cn","bilibili.com","xfwed.com",
                 "docin.com","sohu.com","msn.cn","chouti.com","milkjpg.com","btime.com","51xuedu.com",
                 "joyme.com"],#大型门户网站
             2: ["699pic.com","epub360.com","wix.com","crsky.com","runoob.com","scrapy.org","debugbar.com","linuxidc.com",
                 "911cha.com","tuke88.com","readthedocs.org","jb51.net","readthedocs.io","github.com","888pic.com",
                 "ibaotu.com","w3school.com","17sucai.com","mycodes.net","freemoban.com"],#行业网站
             3: ["www.cc","b2b.cc","mall.cc","creditcard.cc","taobao.com","alipay.com",
                 "tmall.com","jd.com","dangdang.com","company1.cc","ebank.cc"],#交易类网站
             4: ["qcloud.com","360doc.com","eol.cn","xiaohuar.com","exam8.com"],#分类信息网站
             5: ["csdn.net","2cto.com","liaoxuefeng.com","w3cschool.cn","cnblogs.com","zol.com",
                 "python.org","douban.com","shidastudy.com","v2ex.com","i.cn","home.cn","passport.cn",
                 "testerhome.com","aosabook.org","ggdoc.com","biqukan.com"],#论坛
             6: ["apply.cc","ccopyright.com","gov.cn","jzjyy.cn","jseea.cn","jszk.net"],#政府网站
             7: ["ccb.com",'ok',"163.com","ibsbjstar.cc","app.com","maiziedu.com","browsershots.org","gerensuodeshui.cn",
                 "gitlab.com","youdao.com","alicdn.com","qmail.com"],#功能性质网站
             8:["biqudao.com","biqudu.com","siluke.tw","biquge.cc","6mao.com","biqiuge.com","biquge.info",
                "dingdiann.com","qq.com","booktxt.net","biquge.com","xs.la","208xs.com","xxbiquge.com",
                "xuexi111.com","ufochn.com","ahzww.net","555zw.com","biquge5200.com","bequge.com","bqg99.com",
                "bqg99.cc","sbiquge.com","biquge5200.cc","166xs.com"],#    娱乐类型网站
             9: ["microsoft.com","weaver.com","oa8000.com","goodwaysoft.com","lenosoft.net",
                 "lizheng.com","adobe.com","pexels.com","dingtalk.com","autodesk.com","qifeiye.com",
                 "lizhengyun.com","shuishandt.com","windows7en.com","coursera.org","smoson.com","gongboshi.com",
                 "huawei.com","spoon.net","jetbrains.com","sqlite.org","yesacc.com","sunlandzk.com",
                 "alibabagroup.com","turbo.net","cqttech.com","prezi.com","tradedoubler.com",
                 "renrendai.com","alizila.com","hongzhixx.cn","studyol.com","cnbim.com"],#企业网站
             }

    start_time=FormatToStamp("请输入开始时间:")
    end_time = FormatToStamp("请输入结束时间:")
    new_time_dic=per_seconds(start_time,end_time)
    for row in data:
        stime=int((row[2] / 1000000) - 11644473600)#这是时间戳 Unix
        url=filter_data(row[0])
        data_old = row
        if stime in new_time_dic:
            for i in web_dic:
                for j in web_dic[i]:
                    if url in j:
                        new_time_dic[stime][i] = 1


    # for row in data:
    #     web_type=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    #     stime=(row[2] / 1000000) - 11644473600#这是时间戳 Unix
    #     #stime = time.strftime("%Y-%m-%d %X", time.localtime(stime))
    #     url=filter_data(row[0])
    #     for i in web_dic:
    #         for j in web_dic[i]:
    #             if url in j:
    #                 web_type[i]=1
    #     data_old=row
    #     if stime>start_time and stime<end_time:
    #         new_list.append([stime,web_type,url,data_old])

    for row_new in new_time_dic:
        if 1 in new_time_dic[row_new]:
            print(row_new,new_time_dic[row_new])
取自定义时间段的数据
"D:\Program Files (x86)\python36\python.exe" F:/python从入门到放弃/10.17/用户行为分析.py
请输入请输入开始时间:(格式:2018-06-24 11:50:00)2018-10-17 00:00:00
请输入请输入结束时间:(格式:2018-06-24 11:50:00)2018-10-19 00:00:00
1539738611 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1539747785 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1539747789 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1539747793 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1539747795 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...........................
部分数据 时间戳 和登陆类别状态

 

Python位运算符
按位运算符是把数字看作二进制来进行计算的。Python中的按位运算法则如下:

下表中变量 a 为 60,b 为 13二进制格式如下:
a = 0011 1100

b = 0000 1101

-----------------

a&b = 0000 1100

a|b = 0011 1101

a^b = 0011 0001

~a  = 1100 0011
小知识点--python的位运算

 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/19 
'''
import os,sqlite3,re,time
import matplotlib.pyplot as plt
def FormatToStamp(string):
    flag=True
    while flag:
        try:
            formatTime=input("请输入%s(格式:2018-06-24 11:50:00)"%string).strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)
def per_seconds(start_time,end_time):
    time_dict={}
    start_time=int(start_time)
    end_time=int(end_time)
    for i in range(start_time,end_time):
        time_dict[i]=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    return time_dict

def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")

        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',domain)
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def scatter_plot(results):
    # 散点图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    keys = []
    values = []
    for items in results:
        keys.append(items[0])
        state=0
        for i in range(10):
            if 1 == items[1][i]:
                state=i+1
                break
        values.append(state)


    # x = keys[:48]
    x = range(24000)
    # for i in values:
    #     if i != 0:
    #         print("ok",i)

    # print(">>",values)
    y = values[-24000:]
    #y_title=["搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站"]
    # print(x)
    # print(y)
    plt.scatter(x, y, label="登陆网站散点图")



    #
    plt.xlabel('当日时刻:秒')
    plt.ylabel('登陆网站类别')
    plt.title("用户登陆网站行为分析")
    plt.show()

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()

    new_list=[]

    web_dic={0:["360.cn","baidu.com","so.com","pypi.org","bing.com","stackoverflow.com","google.cn"],#搜索引擎
             1: ["qihoo.com","eastday.com","toutiao.com","ufochn.com","timeacle.cn","bilibili.com","xfwed.com",
                 "docin.com","sohu.com","msn.cn","chouti.com","milkjpg.com","btime.com","51xuedu.com",
                 "joyme.com"],#大型门户网站
             2: ["699pic.com","epub360.com","wix.com","crsky.com","runoob.com","scrapy.org","debugbar.com","linuxidc.com",
                 "911cha.com","tuke88.com","readthedocs.org","jb51.net","readthedocs.io","github.com","888pic.com",
                 "ibaotu.com","w3school.com","17sucai.com","mycodes.net","freemoban.com"],#行业网站
             3: ["www.cc","b2b.cc","mall.cc","creditcard.cc","taobao.com","alipay.com",
                 "tmall.com","jd.com","dangdang.com","company1.cc","ebank.cc"],#交易类网站
             4: ["qcloud.com","360doc.com","eol.cn","xiaohuar.com","exam8.com"],#分类信息网站
             5: ["csdn.net","2cto.com","liaoxuefeng.com","w3cschool.cn","cnblogs.com","zol.com",
                 "python.org","douban.com","shidastudy.com","v2ex.com","i.cn","home.cn","passport.cn",
                 "testerhome.com","aosabook.org","ggdoc.com","biqukan.com"],#论坛
             6: ["apply.cc","ccopyright.com","gov.cn","jzjyy.cn","jseea.cn","jszk.net"],#政府网站
             7: ["ccb.com",'ok',"163.com","ibsbjstar.cc","app.com","maiziedu.com","browsershots.org","gerensuodeshui.cn",
                 "gitlab.com","youdao.com","alicdn.com","qmail.com"],#功能性质网站
             8:["biqudao.com","biqudu.com","siluke.tw","biquge.cc","6mao.com","biqiuge.com","biquge.info",
                "dingdiann.com","qq.com","booktxt.net","biquge.com","xs.la","208xs.com","xxbiquge.com",
                "xuexi111.com","ufochn.com","ahzww.net","555zw.com","biquge5200.com","bequge.com","bqg99.com",
                "bqg99.cc","sbiquge.com","biquge5200.cc","166xs.com"],#    娱乐类型网站
             9: ["microsoft.com","weaver.com","oa8000.com","goodwaysoft.com","lenosoft.net",
                 "lizheng.com","adobe.com","pexels.com","dingtalk.com","autodesk.com","qifeiye.com",
                 "lizhengyun.com","shuishandt.com","windows7en.com","coursera.org","smoson.com","gongboshi.com",
                 "huawei.com","spoon.net","jetbrains.com","sqlite.org","yesacc.com","sunlandzk.com",
                 "alibabagroup.com","turbo.net","cqttech.com","prezi.com","tradedoubler.com",
                 "renrendai.com","alizila.com","hongzhixx.cn","studyol.com","cnbim.com"],#企业网站
             }

    start_time=FormatToStamp("请输入开始时间:")
    end_time = FormatToStamp("请输入结束时间:")
    new_time_dic=per_seconds(start_time,end_time)
    for row in data:
        stime=int((row[2] / 1000000) - 11644473600)#这是时间戳 Unix
        url=filter_data(row[0])
        data_old = row
        if stime in new_time_dic:
            for i in web_dic:
                for j in web_dic[i]:
                    if url in j:
                        new_time_dic[stime][i] = 1
    sites_count_sorted = sorted(new_time_dic.items(), key=lambda item: item[0])
    # print(sites_count_sorted)

    for row_new in new_time_dic:
        if 1 in new_time_dic[row_new]:
            print(row_new,new_time_dic[row_new])
    scatter_plot(sites_count_sorted)
时间单位用秒的话,太小了。显示不出数据

 


#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/19 
'''
import os,sqlite3,re,time
import matplotlib.pyplot as plt
def FormatToStamp(string):
    flag=True
    while flag:
        try:
            formatTime=input("请输入%s(格式:2018-06-24 11:50:00)"%string).strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)
def per_seconds(start_time,end_time):
    time_dict={}
    start_time=int(start_time)
    end_time=int(end_time)
    for i in range(start_time,end_time):
        time_dict[i]=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    return time_dict

def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")

        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',domain)
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def fold_line_diagram(results):
    # 折线图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    keys = []
    values = []
    for items in results:
        ntime=time.strftime("%Y-%m-%d %X", time.localtime(items[0]))
        keys.append(ntime)
        state=0
        for i in range(10):
            if 1 == items[1][i]:
                state=i+1
                break
        values.append(state)


    x = keys[-500:]
    #x = range(500)
    y = values[-500:]
    y_title=["搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站"]
    #plt.xticks(y_title)
    plt.xticks(rotation=45,fontsize='small')
    plt.plot(x, y, label="登陆网站散点图")
    plt.xlabel('当日时刻:秒')
    plt.ylabel('登陆网站类别')
    plt.title("用户登陆网站行为分析")
    plt.show()
def scatter_plot(results):
    # 散点图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    keys = []
    values = []
    for items in results:
        ntime=time.strftime("%Y-%m-%d %X", time.localtime(items[0]))
        keys.append(ntime)
        state=0
        for i in range(10):
            if 1 == items[1][i]:
                state=i+1
                break
        values.append(state)


    x = keys[-100:]
    #x = range(500)
    y = values[-100:]
    y_title=["搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站","搜索引擎","大型门户网站"]
    #plt.xticks(y_title)
    plt.xticks(rotation=45,fontsize='small')
    plt.scatter(x, y, label="1:搜索引擎,2:大型门户网站")
    plt.xlabel('当日时刻:秒')
    plt.ylabel('登陆网站类别')
    plt.title("用户登陆网站行为分析")
    plt.legend()
    plt.show()

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()
    web_dic={0:["360.cn","baidu.com","so.com","pypi.org","bing.com","stackoverflow.com","google.cn"],#搜索引擎
             1: ["qihoo.com","eastday.com","toutiao.com","ufochn.com","timeacle.cn","bilibili.com","xfwed.com",
                 "docin.com","sohu.com","msn.cn","chouti.com","milkjpg.com","btime.com","51xuedu.com",
                 "joyme.com"],#大型门户网站
             2: ["699pic.com","epub360.com","wix.com","crsky.com","runoob.com","scrapy.org","debugbar.com","linuxidc.com",
                 "911cha.com","tuke88.com","readthedocs.org","jb51.net","readthedocs.io","github.com","888pic.com",
                 "ibaotu.com","w3school.com","17sucai.com","mycodes.net","freemoban.com"],#行业网站
             3: ["www.cc","b2b.cc","mall.cc","creditcard.cc","taobao.com","alipay.com",
                 "tmall.com","jd.com","dangdang.com","company1.cc","ebank.cc"],#交易类网站
             4: ["qcloud.com","360doc.com","eol.cn","xiaohuar.com","exam8.com"],#分类信息网站
             5: ["csdn.net","2cto.com","liaoxuefeng.com","w3cschool.cn","cnblogs.com","zol.com",
                 "python.org","douban.com","shidastudy.com","v2ex.com","i.cn","home.cn","passport.cn",
                 "testerhome.com","aosabook.org","ggdoc.com","biqukan.com"],#论坛
             6: ["apply.cc","ccopyright.com","gov.cn","jzjyy.cn","jseea.cn","jszk.net"],#政府网站
             7: ["ccb.com",'ok',"163.com","ibsbjstar.cc","app.com","maiziedu.com","browsershots.org","gerensuodeshui.cn",
                 "gitlab.com","youdao.com","alicdn.com","qmail.com"],#功能性质网站
             8:["biqudao.com","biqudu.com","siluke.tw","biquge.cc","6mao.com","biqiuge.com","biquge.info",
                "dingdiann.com","qq.com","booktxt.net","biquge.com","xs.la","208xs.com","xxbiquge.com",
                "xuexi111.com","ufochn.com","ahzww.net","555zw.com","biquge5200.com","bequge.com","bqg99.com",
                "bqg99.cc","sbiquge.com","biquge5200.cc","166xs.com"],#    娱乐类型网站
             9: ["microsoft.com","weaver.com","oa8000.com","goodwaysoft.com","lenosoft.net",
                 "lizheng.com","adobe.com","pexels.com","dingtalk.com","autodesk.com","qifeiye.com",
                 "lizhengyun.com","shuishandt.com","windows7en.com","coursera.org","smoson.com","gongboshi.com",
                 "huawei.com","spoon.net","jetbrains.com","sqlite.org","yesacc.com","sunlandzk.com",
                 "alibabagroup.com","turbo.net","cqttech.com","prezi.com","tradedoubler.com",
                 "renrendai.com","alizila.com","hongzhixx.cn","studyol.com","cnbim.com"],#企业网站
             }

    start_time=FormatToStamp("请输入开始时间:")
    end_time = FormatToStamp("请输入结束时间:")
    new_time_list =[]
    for row in data:
        web_type=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        stime=(row[2] / 1000000) - 11644473600#这是时间戳 Unix
        url=filter_data(row[0])

        data_old=row
        if stime>start_time and stime<end_time:
            for i in web_dic:
                for j in web_dic[i]:
                    if url in j:
                        web_type[i] = 1
                        new_time_list.append([stime,web_type])

    for row_new in new_time_list:
        if 1 in row_new[1]:
            print(row_new,new_time_list[1])
    scatter_plot(new_time_list)
散点图

 

 

#!/usr/bin/env python3
#-*- coding:utf-8 -*-
'''
Administrator 
2018/10/19 
'''
import os,sqlite3,re,time
import matplotlib.pyplot as plt
def FormatToStamp(string):
    flag=True
    while flag:
        try:
            formatTime=input("请输入%s(格式:2018-06-24 11:50:00)"%string).strip()
            formatTime=time.mktime(time.strptime(formatTime,'%Y-%m-%d %X'))
            return formatTime
        except Exception as e:
            print("转换失败,请重新输入。失败原因:%s"%e)
def per_seconds(start_time,end_time):
    time_dict={}
    start_time=int(start_time)
    end_time=int(end_time)
    for i in range(start_time,end_time):
        time_dict[i]=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    return time_dict

def filter_data(url):
    try:
        parsed_url_components = url.split('//')
        sublevel_split = parsed_url_components[1].split('/', 1)
        domain = sublevel_split[0].replace("www.", "")

        data=re.search('\w+\.(com|cn|net|tw|la|io|org|cc|info|cm|us|tv|club|co|in)',domain)
        if data:
            return data.group()
        else:
            address_count.add(sublevel_split[0])
            return "ok"
    except IndexError:
        print('URL format error!')
def fold_line_diagram(results):
    # 折线图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    keys = []
    values = []
    for items in results:
        ntime=time.strftime("%Y-%m-%d %X", time.localtime(items[0]))
        keys.append(ntime)
        state=0
        for i in range(10):
            if 1 == items[1][i]:
                state=i+1
                break
        values.append(state)


    x = keys[-100:]
    #x = range(500)
    y = values[-100:]
    #plt.xticks(y_title)
    plt.xticks(rotation=45,fontsize='small')
    plt.plot(x, y, label="1:搜索引擎 2:大型门户网站 3:行业网站 4:交易类网站,\n5:分类信息网站 6:论坛7 :政府网站 8:功能性质网站\n9:娱乐类型网站 10:企业网站")
    plt.xlabel('当日时刻:秒')
    plt.ylabel('登陆网站类别')
    plt.title("用户登陆网站行为分析")
    plt.legend()
    plt.show()
def scatter_plot(results):
    # 散点图
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
    # 有中文出现的情况,需要u'内容'
    keys = []
    values = []
    for items in results:
        ntime=time.strftime("%Y-%m-%d %X", time.localtime(items[0]))
        keys.append(ntime)
        state=0
        for i in range(10):
            if 1 == items[1][i]:
                state=i+1
                break
        values.append(state)


    x = keys[-100:]
    #x = range(500)
    y = values[-100:]
    #plt.xticks(y_title)
    plt.xticks(rotation=45,fontsize='small')
    plt.scatter(x, y, label="1:搜索引擎 2:大型门户网站 3:行业网站 4:交易类网站,\n5:分类信息网站 6:论坛7 :政府网站 8:功能性质网站\n9:娱乐类型网站 10:企业网站")
    plt.xlabel('当日时刻:秒')
    plt.ylabel('登陆网站类别')
    plt.title("用户登陆网站行为分析")
    plt.legend()
    plt.show()

if __name__=="__main__":
    address_count = set()  # 创建一个空的集合,用来收集已经存在国际域名
    data_path = r"D:\360浏览器\360se6\User Data\Default"
    files = os.listdir(data_path)
    history_db = os.path.join(data_path, 'History')
    conn = sqlite3.connect(history_db)
    cursor = conn.cursor()
    SQL = "SELECT urls.url,urls.title,visits.visit_time from visits LEFT JOIN urls on visits.url=urls.id"
    cursor.execute(SQL)
    data = cursor.fetchall()
    cursor.close()
    conn.close()
    web_dic={0:["360.cn","baidu.com","so.com","pypi.org","bing.com","stackoverflow.com","google.cn"],#搜索引擎
             1: ["qihoo.com","eastday.com","toutiao.com","ufochn.com","timeacle.cn","bilibili.com","xfwed.com",
                 "docin.com","sohu.com","msn.cn","chouti.com","milkjpg.com","btime.com","51xuedu.com",
                 "joyme.com"],#大型门户网站
             2: ["699pic.com","epub360.com","wix.com","crsky.com","runoob.com","scrapy.org","debugbar.com","linuxidc.com",
                 "911cha.com","tuke88.com","readthedocs.org","jb51.net","readthedocs.io","github.com","888pic.com",
                 "ibaotu.com","w3school.com","17sucai.com","mycodes.net","freemoban.com"],#行业网站
             3: ["www.cc","b2b.cc","mall.cc","creditcard.cc","taobao.com","alipay.com",
                 "tmall.com","jd.com","dangdang.com","company1.cc","ebank.cc"],#交易类网站
             4: ["qcloud.com","360doc.com","eol.cn","xiaohuar.com","exam8.com"],#分类信息网站
             5: ["csdn.net","2cto.com","liaoxuefeng.com","w3cschool.cn","cnblogs.com","zol.com",
                 "python.org","douban.com","shidastudy.com","v2ex.com","i.cn","home.cn","passport.cn",
                 "testerhome.com","aosabook.org","ggdoc.com","biqukan.com"],#论坛
             6: ["apply.cc","ccopyright.com","gov.cn","jzjyy.cn","jseea.cn","jszk.net"],#政府网站
             7: ["ccb.com",'ok',"163.com","ibsbjstar.cc","app.com","maiziedu.com","browsershots.org","gerensuodeshui.cn",
                 "gitlab.com","youdao.com","alicdn.com","qmail.com"],#功能性质网站
             8:["biqudao.com","biqudu.com","siluke.tw","biquge.cc","6mao.com","biqiuge.com","biquge.info",
                "dingdiann.com","qq.com","booktxt.net","biquge.com","xs.la","208xs.com","xxbiquge.com",
                "xuexi111.com","ufochn.com","ahzww.net","555zw.com","biquge5200.com","bequge.com","bqg99.com",
                "bqg99.cc","sbiquge.com","biquge5200.cc","166xs.com"],#    娱乐类型网站
             9: ["microsoft.com","weaver.com","oa8000.com","goodwaysoft.com","lenosoft.net",
                 "lizheng.com","adobe.com","pexels.com","dingtalk.com","autodesk.com","qifeiye.com",
                 "lizhengyun.com","shuishandt.com","windows7en.com","coursera.org","smoson.com","gongboshi.com",
                 "huawei.com","spoon.net","jetbrains.com","sqlite.org","yesacc.com","sunlandzk.com",
                 "alibabagroup.com","turbo.net","cqttech.com","prezi.com","tradedoubler.com",
                 "renrendai.com","alizila.com","hongzhixx.cn","studyol.com","cnbim.com"],#企业网站
             }

    start_time=FormatToStamp("请输入开始时间:")
    end_time = FormatToStamp("请输入结束时间:")
    new_time_list =[]
    for row in data:
        web_type=[0 for x in range(10)]#生成一个记录单条数据的列表 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        stime=(row[2] / 1000000) - 11644473600#这是时间戳 Unix
        url=filter_data(row[0])

        data_old=row
        if stime>start_time and stime<end_time:
            for i in web_dic:
                for j in web_dic[i]:
                    if url in j:
                        web_type[i] = 1
                        new_time_list.append([stime,web_type])

    # for row_new in new_time_list:
    #     if 1 in row_new[1]:
    #         print(row_new,new_time_list[1])
    fold_line_diagram(new_time_list)
折线图

 

 

 

posted @ 2018-10-17 09:25  巨兽~墨菲特  阅读(3298)  评论(0编辑  收藏  举报