Python3.x获取网页源码

1，获取网页的头部信息以确定网页的编码方式：

import urllib.request  
res = urllib.request.urlopen('http://www.163.com')  
#info()方法 用来获取网页头部  
print(res.info())

2，获取网页代码：

#导入 urllib库的request模块
import urllib.request
#指定要抓取的网页url，必须以http开头的
url = r'http://fund.eastmoney.com/340007.html?spm=search'
#调用 urlopen（）从服务器获取网页响应（respone），其返回的响应是一个实例
res = urllib.request.urlopen(url)
#调用返回响应示例中的read（）函数，即可以读取html，但需要进行解码，具体解码写什么，要在你要爬取的网址右键，查看源代码
html = res.read().decode('utf-8')
print(html)

3，正式代码（加入头部，伪装成浏览器）：

import urllib.request
url = r'http://fund.eastmoney.com/340007.html?spm=search'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
req = urllib.request.Request(url=url, headers=headers)
res = urllib.request.urlopen(req)
html = res.read().decode('utf-8')
print(html)

注意：urllib.request.Request（）用于向服务端发送请求，就如 http 协议客户端想服务端发送请求；

urllib.request.urlopen（）则相当于服务器返回的响应；

posted @ 2017-12-28 17:13 整合侠阅读(1045) 评论(0) 编辑收藏举报

刷新页面返回顶部

Python3.x获取网页源码

Python3.x获取网页源码

公告