Urllib.request 抓取网页html
语法 urllib.request.urlopen
意思就是打开 url
# 导入urllib import urllib.request # 打开url response = urllib.request.urlopen('https://movie.douban.com/', None, 10) # 读取返回的内容 html = response.read().decode('utf-8') # 写入txt with open('html','w',encoding='utf-8') as f: f.write(html)
就是打开一个网页,并保存下来,读取信息,进行解码操作后,写入txt
但是弹出了错误:urllib.error.HTTPError: HTTP Error 418:
解决方法:
在url中加入头部
用fiddler工具抓包。找到headers包。获取他的请求头
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36
代码如下:
# 导入urllib import urllib.request # 定义一个头部 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'} # 给url加头部 _url = urllib.request.Request('https://movie.douban.com/',headers=headers) # 打开url response = urllib.request.urlopen(_url, None, 10) # 读取返回的内容 html = response.read().decode('utf-8') # 写入txt with open('html','w',encoding='utf-8') as f: f.write(html)