爬取淘宝商品

爬取淘宝商品

一、项目需求

1. 淘宝的整个页面都是由Ajax获取的，而且还包含加密参数，所以这里要使用 Selenium 来模拟浏览器爬取淘宝商品信息。

2. 将淘宝上关于ipad关键字的搜索结果爬取下来，并使用 MongoDB 储存数据。

3. 爬取的数据要包含商品的图片，名称，价格，购买人数，店铺名称和店铺地址。

二、项目分析

抓取入口是淘宝的搜索页面，URL：https://s.taobao.com/search?q=iPad，如下方截图：

　　可以发现，在页面下方有一个分页导航，其中既包括前5页的链接，也包括下一页的链接，同时还有一个输入任意页码跳转的链接，这里商品的搜索结果为100页，要获取每一页的内容，只需要将页码从1到100顺序遍历即可，页码数是确定的。所以，直接在页面跳转文本框中输入要跳转的页面，然后点击确定按钮即可跳转到页码对应的页面了。可能你会问为什么不直接点下一页，因为一旦爬取过程中出现异常退出，比如到50页退出了，此时点击下一页时，就无法快速切换到对应的后续页面了。此外，在爬取过程中，也需要记录当前的页码数，而且一旦点击下一页之后页面加载失败，还需要做异常检测，检测当前页面是加载到第几页，整个流程相对复杂，所以这里使用简单粗暴的方法，直接获取输入框然后在里面输入页码，最后通过点击按钮实现跳转。接下来就可以使用 Selenium 抓取了：

 1 from selenium import webdriver
 2 from selenium.common.exceptions import TimeoutException
 3 from selenium.webdriver.common.by import By
 4 from selenium.webdriver.support import expected_conditions as EC
 5 from selenium.webdriver.support.wait import WebDriverWait
 6 from urllib.parse import quote
 7 
 8 
 9 browser = webdriver.Chrome()
10 wait = WebDriverWait(browser,10)
11 KEYWORD = 'iPad'
12 
13 
14 def index_page(page):
15     """抓取索引页"""
16     print('正在抓取第' + page + '页')
17     try:
18         url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
19         browser.get(url)
20         if page > 1:
21             input = wait.until(
22                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input')))
23             submit = wait.until(
24                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
25             input.clear()
26             submit.click()
27         wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
28         wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
29         get_products()
30     except TimeoutException:
31         index_page(page)

　　这里先构造了一个WebDriver对象，指定关键字‘iPad’，接着定义了index_page()方法用于抓取商品页面。在该方法里，首先访问了搜索商品的链接，然后判断了当前的页码，如果大于1，就进行跳转页面操作，否则等待页面加载完成。等待加载时，使用了WebDriverWait对象，它可以指定等待条件，同时指定一个最长等待时间，这里指定为10秒。如果在这个时间内成功匹配了等待条件，也就是说页面元素成功加载出来了，就立即返回相应结果并继续向下执行，否则到了最大等待时间还没有加载出来时，就直接抛出超时异常。关于翻页的操作，这里首先获取页码输入框，赋值为input，然后获取确定按钮，赋值为submit。然后清空了输入框的内容，再调用send_keys()方法将页码填充到输入框中，然后点击确定按钮。然而这里有一个问题就是，我们怎么知道有没有跳转到对应的页码呢？可以注意到，如果我们在某一页，当前的页码是会高亮显示的，所以只需要判断当前高亮的页码数是当前的页码数即可，然而这里使用了另外一个等待条件text_to_be_present_in_element，它会等待指定的文本出现在某一个节点里面时即返回成功。这里我们将高亮的页码节点对应的CSS选择器和当前要跳转的页码通过参数传递给这个等待条件，这样就会检测当前高亮的页码节点是不是我们传过来的页码数，如果是，就证明页面跳转成功了。接下来就可以实现get_products()方法来解析商品了：

 1 def get_products():
 2     html = browser.page_source
 3     document = pq(html)
 4     items = document('#mainsrp-itemlist .items .item').items()
 5     for item in items:
 6         product = {
 7             'image':item.find('.pic .img').attr('data-src'),
 8             'price':item.find('.price').text(),
 9             'deal':item.find('.deal-cnt').text(),
10             'shop':item.find('.shop').text(),
11             'location':item.find('.location').text(),
12         }
13         print(product)
14         save_to_mongo(product)

首先，调用page_source属性获取页面源码，然后构造了PyQuery解析对象，接着提取了商品列表，此时使用的CSS选择器是#mainsrp-itemlist .items .item，它会匹配整个页面的每个商品。它的匹配结果是多个，所以这里我们又对它进行一次遍历，用for循环将每个结果分别进行解析，每次循环把它赋值为item变量，每个item变量都是一个PyQuery对象，然后再调用它的find()方法，传入CSS选择器，就可以获取单个商品的特定内容了。最后的工作就是讲我们需要的数据保存到 MongoDB 中了。

三、项目源码

 1 from selenium import webdriver
 2 from selenium.common.exceptions import TimeoutException
 3 from selenium.webdriver.common.by import By
 4 from selenium.webdriver.support import expected_conditions as EC
 5 from selenium.webdriver.support.wait import WebDriverWait
 6 from urllib.parse import quote
 7 from pyquery import PyQuery as pq
 8 import pymongo
 9 
10 
11 browser = webdriver.Chrome()
12 wait = WebDriverWait(browser,10)
13 KEYWORD = 'iPad'
14 MAX_PAGE = 100
15 
16 MONGO_URL = 'localhost'
17 MONGO_DB = 'taobao'
18 MONGO_COLLECTION = 'products'
19 client = pymongo.MongoClient(MONGO_URL)
20 db = client[MONGO_DB]
21 
22 
23 def index_page(page):
24     print('now is ',page)
25     try:
26         url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
27         browser.get(url)
28         if page > 1:
29             input = wait.until(
30                 EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager div.form > input')))
31             submit = wait.until(
32                 EC.element_to_be_clickable((By.CSS_SELECTOR,'#mainsrp-pager div.form > span.btn.J_Submit')))
33             input.clear()
34             input.send_keys(page)
35             submit.click()
36         wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
37         wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
38         get_products()
39     except TimeoutException:
40         index_page(page)
41 
42 
43 def get_products():
44     html = browser.page_source
45     document = pq(html)
46     items = document('#mainsrp-itemlist .items .item').items()
47     for item in items:
48         product = {
49             'image':item.find('.pic .img').attr('data-src'),
50             'price':item.find('.price').text(),
51             'deal':item.find('.deal-cnt').text(),
52             'shop':item.find('.shop').text(),
53             'location':item.find('.location').text(),
54         }
55         print(product)
56         save_to_mongo(product)
57 
58 
59 def save_to_mongo(result):
60     try:
61         if db[MONGO_COLLECTION].insert(result):
62             print('success')
63     except Exception:
64         print('fail')
65 
66 
67 def main():
68     for i in range(1,MAX_PAGE+1):
69         index_page(i)
70 
71 
72 if __name__ == '__main__':
73     main()

posted @ 2018-06-21 17:05 jonas_von 阅读(482) 评论(0) 编辑收藏举报

刷新页面返回顶部

jonas_von

爬取淘宝商品

公告