Python 爬虫实例(4)—— 爬取网易新闻
自己闲来无聊,就爬取了网易信息,重点是分析网页,使用抓包工具详细的分析网页的每个链接,数据存储在sqllite中,这里只是简单的解析了新闻页面的文字信息,并未对图片信息进行解析
仅供参考,不足之处请指正
# coding:utf-8 import random, re import sqlite3 import json from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf-8') import uuid import requests session = requests.session() def md5(str): import hashlib m = hashlib.md5() m.update(str) return m.hexdigest() def wangyi(): for i in range(1,3): if i ==1: k = "" else: k = "_0" + str(i) url = "http://temp.163.com/special/00804KVA/cm_yaowen" + k + ".js?callback=data_callback" print url headers = { "Host":"temp.163.com", "Connection":"keep-alive", "Accept":"*/*", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 LBBROWSER", "Referer":"http://news.163.com/", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", } result = session.get(url=url,headers=headers).text try: result1 = eval(eval((json.dumps(result)).replace('data_callback(','').replace(')','').replace(' ',''))) except: pass try: for i in result1: tlink = i['tlink'] headers2 = { "Host":"news.163.com", "Connection":"keep-alive", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 LBBROWSER", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", } print "tlinktlinktlinktlink",tlink return_data = session.get(url=tlink,headers=headers2).text try: soup = BeautifulSoup(return_data, 'html.parser') returnSoup = soup.find_all("div", attrs={"id": "endText"})[0] print returnSoup print "===============================" try: returnList = re.findall('<p>(.*?)</p>',str(returnSoup)) content1 = '<-->'.join(returnList) except: content1 ="" try: returnList1 = re.findall('<p class="f_center">(.*?)</p>',str(returnSoup)) content2 = '<-->'.join(returnList1) except: content2 ="" content = content1 +content2 except: content = "" cx = sqlite3.connect("C:\\Users\\xuchunlin\\PycharmProjects\\study\\db.sqlite3", check_same_thread=False) cx.text_factory = str try: print "正在插入链接 %s 数据" % (url) tlink = i['tlink'] title = (i['title']).decode('unicode_escape') commenturl = i['commenturl'] tienum = i['tienum'] opentime = i['time'] print title print tlink print commenturl print tienum print opentime print content url2 = md5(str(tlink)) cx.execute("INSERT INTO wangyi (title,tlink,commenturl,tienum,opentime,content,url)VALUES (?,?,?,?,?,?,?)",(str(title), str(tlink), str(commenturl), str(tienum), str(opentime), str(content), str(url2))) except Exception as e: print e print "cha ru shi bai " cx.commit() cx.close() except: pass wangyi()
如果觉得对您有帮助,麻烦您点一下推荐,谢谢!
好记忆不如烂笔头
好记忆不如烂笔头
分类:
Python 爬虫实例
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 提示词工程——AI应用必不可少的技术
2016-06-30 psycopg2.pool – Connections pooling / psycopg2.pool – 连接池 / postgresql 连接池
2016-06-30 用xshell操作linux系统的常用命令