小白爬取单个微博用户的评论

 

一、简要介绍

对“深圳移动”微博用户爬取所有微博及其评论。

二、工具介绍

语言:python 2.7
使用的库:import requests
微博账号:网上购买若干
IP代理:网上租动态IP的代理服务器
User-agent:网上搜索若干

三、整体思路

1.首先找到“深圳移动”的手机微博页面

https://m.weibo.cn/u/1922826034


2. 手机微博看不到翻页,是一直往下加载的(一共1671页),但是其json格式的数据仍然以翻页的形式呈现。
https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=2

主要就是修改page后面的值来获取手机微博每个页面的json数据。


3. 从上面的json数据页面获取字段idstr,即微博id。
https://m.weibo.cn/status/4177994643916324地址可以获取一条微博的手机页面。
格式:https://m.weibo.cn/status/【id】


4. 从https://m.weibo.cn/api/comments/show?id=4131150395559419&page=1
地址可以获取一条微博的评论的json格式数据,id为一条微博的id,page为评论翻页。
格式:https://m.weibo.cn/api/comments/show?id=【id】&page=【page_num】

首行若ok=1说明该条微博有评论;若ok=0说明该条微博没有评论。

 

四、代码实现

1.设置user-agent、cookies、headers。




从网上获取大量user-agent,在TAOBAO购买若干微博账号,获取其cookie。
Random.choice()函数从列表中每次随机获取一个值,避免短时间内用同一个cookie或者同一个user-agent访问微博页面导致cookie或user-agent被封。

 

2.获取微博每一页json数据,提取其中的idstr字段得到每条微博的id。
Time.sleep(random.randint(1,4)) 休眠时间是随机数而非固定值。

 

3.同样的道理从评论的json页面获取评论的json数据。

 

五、知识反馈

1.时间久了之后会出现NO JSON COULD BE DECODED的错误,debug后发现是获取不到页面源码返回response 404的错误,原因是user-agent使用次数过多被禁,主要是因为使用了单一IP地址,在这里我用的是动态IP地址的服务器,因此不需要在爬虫中设置代理IP,设置代理IP的方法和random.choice( )设置user-agent的方法雷同。此外,尽管使用了动态IP,user-agent仍有被禁的可能。
关于反爬虫如何禁止user-agent抓取网站的办法:

来源:《Nginx反爬虫攻略:禁止某些User Agent抓取网站》

 

2.爬取的数据过多时,需要有代码可以自动更新微博账号的cookie。

 

六、参考资料

对本次数据爬取有重要贡献的参考文章:《pyhton微博爬虫(3)——获取微博评论数据》
http://blog.csdn.net/FlySky1991/article/details/76924443

 

七、只有自己能看懂的代码

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3 import sys
 4 
 5 import requests
 6 
 7 reload(sys)
 8 sys.setdefaultencoding('utf8')
 9 import time
10 import random
11 import crawler.user_agents as ua
12 from crawler import cookies as ck
13 
14 
15 def writeintxt(list,filename):
16     output = open(filename, 'a')
17     for i in list:
18         output.write(str(i[0])+','+str(i[1])+'\n')
19     output.close()
20 
21 cookies = random.choice(ck.cookies)
22 user_agent = random.choice(ua.agents)
23 headers = {
24     'User-agent' : user_agent,
25     'Host' : 'm.weibo.cn',
26     'Accept' : 'application/json, text/plain, */*',
27     'Accept-Language' : 'zh-CN,zh;q=0.8',
28     'Accept-Encoding' : 'gzip, deflate, sdch, br',
29     'Referer' : 'https://m.weibo.cn/u/1922826034',
30     'Cookie' : cookies,
31     'Connection' : 'keep-alive',
32 }
33 
34 id_list = []
35 base_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page='
36 for i in range(0, 1672):
37     try:
38         url = base_url+i.__str__()
39         resp = requests.get(url, headers=headers,timeout = 5)
40         jsondata = resp.json()
41 
42         data = jsondata.get('cards')
43         for d in data:
44             id = d.get("mblog").get('idstr')
45             # print id
46             id_list.append([i,id])
47         time.sleep(random.randint(1,4))
48     except:
49         print i
50         print('*'*100)
51         pass
52 print "ok"
53 
54 
55 writeintxt(id_list,'weibo_id')

 

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3 import sys
 4 
 5 import requests
 6 
 7 reload(sys)
 8 sys.setdefaultencoding('utf8')
 9 import time
10 import random
11 import crawler.user_agents as ua
12 from crawler import cookies as ck
13 
14 
15 def readfromtxt(filename):
16     file = open(u'D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/'+filename, "r")
17     text = file.read()
18     file.close()
19     return text
20 
21 def writeintxt(dict,filename):
22     output = open(u"D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/"+filename, 'a+')
23     for d, list in dict.items():
24         comment_str = ""
25         for l in list:
26             comment_str = comment_str + l.__str__() + "####"
27         output.write(d.split(',')[1]+"####"+comment_str+'\n')
28     output.close()
29 
30 
31 
32 user_agent = random.choice(ua.agents)
33 cookies = random.choice(ck.cookies)
34 headers = {
35     'User-agent' : user_agent,
36     'Host' : 'm.weibo.cn',
37     'Accept' : 'application/json, text/plain, */*',
38     'Accept-Language' : 'zh-CN,zh;q=0.8',
39     'Accept-Encoding' : 'gzip, deflate, sdch, br',
40     'Referer' : 'https://m.weibo.cn/u/1922826034',
41     'Cookie' : cookies,
42     'Connection' : 'keep-alive',
43 }
44 
45 
46 base_url = 'https://m.weibo.cn/api/comments/show?id='
47 weibo_id_list = readfromtxt('weibo_id1.txt').split('\n')
48 result_dict = {}
49 for weibo_id in weibo_id_list:
50     try:
51         record_list = []
52         i=1
53         SIGN = 1
54         while(SIGN):
55             # url = base_url + weibo_id.split(',')[1] + '&page=' + str(i)
56             url = base_url + str(weibo_id) + '&page=' + str(i)
57             resp = requests.get(url, headers=headers, timeout=100)
58             jsondata = resp.json()
59             if jsondata.get('ok') == 1:
60                 SIGN = 1
61                 i = i + 1
62                 data = jsondata.get('data')
63                 for d in data:
64                     comment = d.get('text').replace('$$','')
65                     like_count = d.get('like_counts')
66                     user_id = d.get("user").get('id')
67                     user_name = d.get("user").get('screen_name').replace('$$','')
68                     one_record = user_id.__str__()+'$$'+like_count.__str__()+'$$'+user_name.__str__()+'$$'+ comment.__str__()
69                     record_list.append(one_record)
70             else:
71                 SIGN = 0
72 
73         result_dict[weibo_id]=record_list
74         time.sleep(random.randint(2,3))
75     except:
76         # print traceback.print_exc()
77         print weibo_id
78         print('*'*100)
79         pass
80 print "ok"
81 
82 writeintxt(result_dict,'comment1.txt')

 

 1 # encoding=utf-8
 2 """ User-Agents """
 3 agents = [
 4     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 5     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
 6     "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 7     "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
 8     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 9     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
10     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
11     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
12     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
13     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
14     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
15     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
16     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
17     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
18     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
19     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
20     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
21     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
22     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
23     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
24     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
25     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
26     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
27     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
28     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
29     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
30     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
31     "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
32     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
33     "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
34     "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
35     "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
36     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
37     "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
38     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
39   ]

 

# encoding=utf-8

""" cookies """

cookies = [

"SINAGLOBAL=6061592354656.324.1489207743838; un=18240343109; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; cross_origin_proto=SSL; WBStorage=82ca67f06fa80da0|undefined; UOR=,,login.sina.com.cn; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; SSOLoginState=1511882345; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0OIIQzSqq7xsdv-_GhEe8XWdkHikzsFJyqtvqej6OkaM.; SUB=_2A253GQ45DeThGeRP71IQ9y7NyDyIHXVUb3jxrDV8PUNbmtAKLWrSkW9NTjfYoWTfrO0PkXSICRzowbfjExbQidve; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFaVAdSwLmvOo1VRiSlRa3q5JpX5KzhUgL.FozpSh5pS05pe052dJLoIfMLxKBLBonL122LxKnLB.qL1-z_i--fiKyFi-2Xi--fi-2fiKyFTCH8SFHF1C-4eFH81FHWSE-RebH8SE-4BC-RSFH8SFHFBbHWeEH8SEHWeF-RegUDMJ7t; SUHB=04W-u1HCo6armH; ALF=1543418344; wvr=6",
"SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882443; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0-14gBQox9IhSK8vZVaZYWsLxUaOWNkudAR9iT6NFJkg.; SUB=_2A253GQ6bDeRhGeNH6FsZ8CjLzj2IHXVUb2dTrDV8PUNbmtAKLWTjkW9NSqHIBUvGapKd6-MQhJTejk3w_ivUUNXZ; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5gYdHWIHRmedh9Nyrij6XN5JpX5K2hUgL.Fo-4e0.RehqNSK22dJLoI0.LxK-L122LB.qLxK-LB.BLBKqLxKMLB.2LBKzLxKnL12-L122LxK.LBK2L12qLxKqLBKqL1KHiqc-t; SUHB=0auwlDzUYulNGs; ALF=1543418442; un=13728408992; wvr=6",
"SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882512; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog089iFKjxeT1Oc6cbJkkqgWrnQAuMVukRrJy3898cKIb8.; SUB=_2A253GQ9ADeRhGeNH6FsZ8ynJzz6IHXVUb2eIrDV8PUNbmtAKLVWhkW9NSqG4DzNeLkyPCmJIKq6bXfKXpSRCPLqO; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W50J-rDh2D6-QEqNOZ2NddF5JpX5K2hUgL.Fo-4e0.Re0MfShz2dJLoIEeLxK-LB--L1KeLxK-L1hqLBoMLxKnL1K5LBo8IC281xEfIg5tt; SUHB=0gHiPrbPWNJvao; ALF=1543418511; un=15614187608; wvr=6",
"SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; wb_cusLike_5939837542=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882567; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog02c5hBW41ia6vpj1cAqbFzE2KCcsXvDxToS_KOeUnwRc.; SUB=_2A253GQ8XDeRhGeNH6FsZ9CjKyjuIHXVUb2ffrDV8PUNbmtAKLU7wkW9NSqGOexL53l1CujvuLpAFNeOEsl05T_5E; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWuISqBnuGqpyxGiWdJ4bOv5JpX5K2hUgL.Fo-4e0.RShqceKM2dJLoI0YLxK-L1K5L1K2LxK.L1KnLBoeLxK-L1K5L1K2LxKqL1-2L1KqLxK.L1KMLBo-LxKMLB.zLB.qLxK-L1hML1-Bt; SUHB=0LcSwyK5XYMzbr; ALF=1543418566; un=13242833134; wvr=6"


]

 

posted on 2017-11-30 10:41  Denise_hzf  阅读(11952)  评论(3编辑  收藏  举报

导航