爬虫遇到的问题
一、Python2 和 python3 中的urllib、urllib2问题
1、urllib2在py3中已不存在,解决urllib2的方式:
1 urllib2在python3.x中被改为urllib.request
2、AttributeError: 'module' object has no attribute 'urlencode',解决方法:
1 需要导入import urllib.parse
3、TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.,解决方法:
1 把原先:data = urllib.urlencode(values) 2 改为:data = urllib.parse.urlencode(values).encode(encoding='UTF8')
4、TypeError: Can't convert 'bytes' object to str implicitly,解决方法:
1 需进行编码或解码操作: 2 data = urllib.parse.urlencode(values).encode(encoding='utf8') 3 url = 'https://passport.cnblogs.com/user/signin?ReturnUrl=xxxxxxxxxx 4 geturl = url+'?'+data.decode() 5 print(geturl)
5、UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser")
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page1.html") bsObj = BeautifulSoup(html.read(),"html.parser") print(bsObj.h1) 在BeautifulSoup里面增加"html.parser"