小白爬取单个微博用户的评论
一、简要介绍
对“深圳移动”微博用户爬取所有微博及其评论。
二、工具介绍
语言:python 2.7
使用的库:import requests
微博账号:网上购买若干
IP代理:网上租动态IP的代理服务器
User-agent:网上搜索若干
三、整体思路
1.首先找到“深圳移动”的手机微博页面
https://m.weibo.cn/u/1922826034
2. 手机微博看不到翻页,是一直往下加载的(一共1671页),但是其json格式的数据仍然以翻页的形式呈现。
https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=2
主要就是修改page后面的值来获取手机微博每个页面的json数据。
3. 从上面的json数据页面获取字段idstr,即微博id。
从https://m.weibo.cn/status/4177994643916324地址可以获取一条微博的手机页面。
格式:https://m.weibo.cn/status/【id】
4. 从https://m.weibo.cn/api/comments/show?id=4131150395559419&page=1
地址可以获取一条微博的评论的json格式数据,id为一条微博的id,page为评论翻页。
格式:https://m.weibo.cn/api/comments/show?id=【id】&page=【page_num】
首行若ok=1说明该条微博有评论;若ok=0说明该条微博没有评论。
四、代码实现
1.设置user-agent、cookies、headers。
从网上获取大量user-agent,在TAOBAO购买若干微博账号,获取其cookie。
Random.choice()函数从列表中每次随机获取一个值,避免短时间内用同一个cookie或者同一个user-agent访问微博页面导致cookie或user-agent被封。
2.获取微博每一页json数据,提取其中的idstr字段得到每条微博的id。
Time.sleep(random.randint(1,4)) 休眠时间是随机数而非固定值。
3.同样的道理从评论的json页面获取评论的json数据。
五、知识反馈
1.时间久了之后会出现NO JSON COULD BE DECODED的错误,debug后发现是获取不到页面源码返回response 404的错误,原因是user-agent使用次数过多被禁,主要是因为使用了单一IP地址,在这里我用的是动态IP地址的服务器,因此不需要在爬虫中设置代理IP,设置代理IP的方法和random.choice( )设置user-agent的方法雷同。此外,尽管使用了动态IP,user-agent仍有被禁的可能。
关于反爬虫如何禁止user-agent抓取网站的办法:
来源:《Nginx反爬虫攻略:禁止某些User Agent抓取网站》
2.爬取的数据过多时,需要有代码可以自动更新微博账号的cookie。
六、参考资料
对本次数据爬取有重要贡献的参考文章:《pyhton微博爬虫(3)——获取微博评论数据》
http://blog.csdn.net/FlySky1991/article/details/76924443
七、只有自己能看懂的代码
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 import sys 4 5 import requests 6 7 reload(sys) 8 sys.setdefaultencoding('utf8') 9 import time 10 import random 11 import crawler.user_agents as ua 12 from crawler import cookies as ck 13 14 15 def writeintxt(list,filename): 16 output = open(filename, 'a') 17 for i in list: 18 output.write(str(i[0])+','+str(i[1])+'\n') 19 output.close() 20 21 cookies = random.choice(ck.cookies) 22 user_agent = random.choice(ua.agents) 23 headers = { 24 'User-agent' : user_agent, 25 'Host' : 'm.weibo.cn', 26 'Accept' : 'application/json, text/plain, */*', 27 'Accept-Language' : 'zh-CN,zh;q=0.8', 28 'Accept-Encoding' : 'gzip, deflate, sdch, br', 29 'Referer' : 'https://m.weibo.cn/u/1922826034', 30 'Cookie' : cookies, 31 'Connection' : 'keep-alive', 32 } 33 34 id_list = [] 35 base_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=' 36 for i in range(0, 1672): 37 try: 38 url = base_url+i.__str__() 39 resp = requests.get(url, headers=headers,timeout = 5) 40 jsondata = resp.json() 41 42 data = jsondata.get('cards') 43 for d in data: 44 id = d.get("mblog").get('idstr') 45 # print id 46 id_list.append([i,id]) 47 time.sleep(random.randint(1,4)) 48 except: 49 print i 50 print('*'*100) 51 pass 52 print "ok" 53 54 55 writeintxt(id_list,'weibo_id')
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 import sys 4 5 import requests 6 7 reload(sys) 8 sys.setdefaultencoding('utf8') 9 import time 10 import random 11 import crawler.user_agents as ua 12 from crawler import cookies as ck 13 14 15 def readfromtxt(filename): 16 file = open(u'D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/'+filename, "r") 17 text = file.read() 18 file.close() 19 return text 20 21 def writeintxt(dict,filename): 22 output = open(u"D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/"+filename, 'a+') 23 for d, list in dict.items(): 24 comment_str = "" 25 for l in list: 26 comment_str = comment_str + l.__str__() + "####" 27 output.write(d.split(',')[1]+"####"+comment_str+'\n') 28 output.close() 29 30 31 32 user_agent = random.choice(ua.agents) 33 cookies = random.choice(ck.cookies) 34 headers = { 35 'User-agent' : user_agent, 36 'Host' : 'm.weibo.cn', 37 'Accept' : 'application/json, text/plain, */*', 38 'Accept-Language' : 'zh-CN,zh;q=0.8', 39 'Accept-Encoding' : 'gzip, deflate, sdch, br', 40 'Referer' : 'https://m.weibo.cn/u/1922826034', 41 'Cookie' : cookies, 42 'Connection' : 'keep-alive', 43 } 44 45 46 base_url = 'https://m.weibo.cn/api/comments/show?id=' 47 weibo_id_list = readfromtxt('weibo_id1.txt').split('\n') 48 result_dict = {} 49 for weibo_id in weibo_id_list: 50 try: 51 record_list = [] 52 i=1 53 SIGN = 1 54 while(SIGN): 55 # url = base_url + weibo_id.split(',')[1] + '&page=' + str(i) 56 url = base_url + str(weibo_id) + '&page=' + str(i) 57 resp = requests.get(url, headers=headers, timeout=100) 58 jsondata = resp.json() 59 if jsondata.get('ok') == 1: 60 SIGN = 1 61 i = i + 1 62 data = jsondata.get('data') 63 for d in data: 64 comment = d.get('text').replace('$$','') 65 like_count = d.get('like_counts') 66 user_id = d.get("user").get('id') 67 user_name = d.get("user").get('screen_name').replace('$$','') 68 one_record = user_id.__str__()+'$$'+like_count.__str__()+'$$'+user_name.__str__()+'$$'+ comment.__str__() 69 record_list.append(one_record) 70 else: 71 SIGN = 0 72 73 result_dict[weibo_id]=record_list 74 time.sleep(random.randint(2,3)) 75 except: 76 # print traceback.print_exc() 77 print weibo_id 78 print('*'*100) 79 pass 80 print "ok" 81 82 writeintxt(result_dict,'comment1.txt')
1 # encoding=utf-8 2 """ User-Agents """ 3 agents = [ 4 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 5 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", 6 "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 7 "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", 8 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 9 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 10 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", 11 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", 12 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", 13 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", 14 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", 15 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", 16 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", 17 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", 18 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", 19 "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", 20 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11", 21 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER", 22 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)", 23 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)", 24 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER", 25 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 26 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", 27 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 28 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)", 29 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", 30 "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)", 31 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 32 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", 33 "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", 34 "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre", 35 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0", 36 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 37 "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", 38 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36", 39 ]
# encoding=utf-8 """ cookies """ cookies = [ "SINAGLOBAL=6061592354656.324.1489207743838; un=18240343109; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; cross_origin_proto=SSL; WBStorage=82ca67f06fa80da0|undefined; UOR=,,login.sina.com.cn; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; SSOLoginState=1511882345; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0OIIQzSqq7xsdv-_GhEe8XWdkHikzsFJyqtvqej6OkaM.; SUB=_2A253GQ45DeThGeRP71IQ9y7NyDyIHXVUb3jxrDV8PUNbmtAKLWrSkW9NTjfYoWTfrO0PkXSICRzowbfjExbQidve; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFaVAdSwLmvOo1VRiSlRa3q5JpX5KzhUgL.FozpSh5pS05pe052dJLoIfMLxKBLBonL122LxKnLB.qL1-z_i--fiKyFi-2Xi--fi-2fiKyFTCH8SFHF1C-4eFH81FHWSE-RebH8SE-4BC-RSFH8SFHFBbHWeEH8SEHWeF-RegUDMJ7t; SUHB=04W-u1HCo6armH; ALF=1543418344; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882443; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0-14gBQox9IhSK8vZVaZYWsLxUaOWNkudAR9iT6NFJkg.; SUB=_2A253GQ6bDeRhGeNH6FsZ8CjLzj2IHXVUb2dTrDV8PUNbmtAKLWTjkW9NSqHIBUvGapKd6-MQhJTejk3w_ivUUNXZ; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5gYdHWIHRmedh9Nyrij6XN5JpX5K2hUgL.Fo-4e0.RehqNSK22dJLoI0.LxK-L122LB.qLxK-LB.BLBKqLxKMLB.2LBKzLxKnL12-L122LxK.LBK2L12qLxKqLBKqL1KHiqc-t; SUHB=0auwlDzUYulNGs; ALF=1543418442; un=13728408992; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882512; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog089iFKjxeT1Oc6cbJkkqgWrnQAuMVukRrJy3898cKIb8.; SUB=_2A253GQ9ADeRhGeNH6FsZ8ynJzz6IHXVUb2eIrDV8PUNbmtAKLVWhkW9NSqG4DzNeLkyPCmJIKq6bXfKXpSRCPLqO; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W50J-rDh2D6-QEqNOZ2NddF5JpX5K2hUgL.Fo-4e0.Re0MfShz2dJLoIEeLxK-LB--L1KeLxK-L1hqLBoMLxKnL1K5LBo8IC281xEfIg5tt; SUHB=0gHiPrbPWNJvao; ALF=1543418511; un=15614187608; wvr=6", "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; wb_cusLike_5939837542=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882567; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog02c5hBW41ia6vpj1cAqbFzE2KCcsXvDxToS_KOeUnwRc.; SUB=_2A253GQ8XDeRhGeNH6FsZ9CjKyjuIHXVUb2ffrDV8PUNbmtAKLU7wkW9NSqGOexL53l1CujvuLpAFNeOEsl05T_5E; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWuISqBnuGqpyxGiWdJ4bOv5JpX5K2hUgL.Fo-4e0.RShqceKM2dJLoI0YLxK-L1K5L1K2LxK.L1KnLBoeLxK-L1K5L1K2LxKqL1-2L1KqLxK.L1KMLBo-LxKMLB.zLB.qLxK-L1hML1-Bt; SUHB=0LcSwyK5XYMzbr; ALF=1543418566; un=13242833134; wvr=6" ]
posted on 2017-11-30 10:41 Denise_hzf 阅读(11952) 评论(3) 编辑 收藏 举报