爬虫注意事项

1.tbody 不可以出现在xpath表达式中

2.爬取到的文字为乱码时

name = name.encode（‘iso-8859-1’）.decode（//‘gbk’）

3.# 增强xpath表达式的通用性  ****采用管道符
# url="https://www.aqistudy.cn/historydata/"
# 获取热门城市与普通城市的城市名  

import requests
from lxml import etree
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}


url="https://www.aqistudy.cn/historydata/"


response=requests.get(url=url,headers=headers)
page_text=response.text

tree=etree.HTML(page_text)
# hot_city=tree.xpath('//div[@class="hot"]//div[@class="bottom"]/ul/li/a/text()')
# print(hot_city)
# all_city=tree.xpath('//div[@class="all"]//div[@class="bottom"]/ul/div[2]/li/a/text()')
# print(all_city)


# cities=tree.xpath('//div[@class="hot"]//div[@class="bottom"]/ul/li/a/text() | //div[@class="all"]//div[@class="bottom"]/ul/div[2]/li/a/text()')
# print(cities)

# cities=tree.xpath('//div[@class="all"]//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="hot"]//div[@class="bottom"]/ul/li/a/text()')
# print(cities)

4.常见错误
HTTPConnectionPool（host:XX）Max retries exceeded with url:
如何让请求结束后马上断开连接且释放池中的连接资源：headers={ 'Connection':'close'}
使用代理ip：requests.get(url=url,headers=headers,proxies={'https':'134.209.13.16:8080'}).text

5. 动态页面的请求参数一般会隐藏在前端源码当中

posted on 2021-03-06 15:22 Plyc 阅读(234) 评论(0) 收藏举报

刷新页面返回顶部

爬虫注意事项

导航

公告

爬虫 注意事项

导航

公告

爬虫注意事项