简单爬虫实例
代码工具:jupyter
抓包工具:fiddle
1:搜狗页面内容爬取
1 import requests 2 3 url='https://www.sogou.com/' 4 response=requests.get( 5 url=url 6 ) 7 text=response.text 8 text
2:豆瓣电影分类爬取
1 import requests 2 url='https://movie.douban.com/j/new_search_subjects' 3 param={ 4 'sort':'U', 5 'range': '0,10', 6 'tags': '', 7 'start': '0', 8 'genres': '爱情' 9 } 10 headers={ 11 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' 12 } 13 response=requests.get( 14 url=url, 15 headers=headers, 16 params=param, 17 18 ) 19 text=response.json() 20 text
3:搜索磁条爬取并写入文件
1 import requests 2 3 url='https://www.sogou.com/web' 4 headers={ 5 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' 6 } 7 param={ 8 'query':'校花' 9 } 10 response=requests.get( 11 url=url, 12 headers=headers, 13 params=param 14 ) 15 text=response.content 16 with open('xh.html','wb')as f: 17 f.write(text)
4:国家药监总监内容爬取。爬取动态生成的内容
1 import requests 2 3 4 headers={ 5 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' 6 } 7 url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList' 8 9 data={ 10 'on': 'true', 11 'page': '3', 12 'pageSize': '15', 13 'productName':'', 14 'conditionType': '1', 15 'applyname':'', 16 'applysn':'' 17 } 18 response=requests.post(url=url,headers=headers,data=data) 19 conn_list=response.json()['list'] 20 comm_info_list=[] 21 for i in conn_list: 22 url_c="http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById" 23 data_c={"id":""} 24 if i["XC_DATE"]: 25 data_c["id"]=i["ID"] 26 res=requests.post(url=url_c,data=data_c,headers=headers) 27 comm_info_list.append(res.json()) 28 comm_info_list