中国大学MOOC —— 学习笔记(四)

 

淘宝商品比价定向爬虫

目标:获取淘宝搜索页面信息,提取其中的商品名称和价格

程序的结构设计:

  1. 提交商品搜索请求,循环获取页面
  2. 对每个页面,提取商品名称和价格信息
  3. 将信息输出到屏幕上
import requests
import re

def getHTMLText(url):
     try:
          r = requests.get(url)
          r.raise_for_status()
          r.encoding = r.apparent_encoding
          return r.text
     except:
          return ""
def parsePage(ilt,html):
     try:
          plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
          tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
          for i in range(len(plt)):
               price = eval(plt[i].split(':')[1])
               title = eval(tlt[i].split(':')[1])
               ilt.append([price,title])
     except:
          print("")

def printGoodList(ilt):
     tplt = "{:4}\t{:8}\t{:16}"
     print(tplt.format("序号","价格","商品名"))
     count = 0
     for g in ilt:
          count = count +1
          print(tplt.format(count,g[0],g[1]))

def main():
     goods = '书包'
     depth = 2
     start_url = 'https://s.taobao.com/search?q=' + goods
     infoList = []
     for i in range(depth):
          try:
               url = start_url + '&s=' + str(44*i)
               html = getHTMLText(url)
               parsePage(infoList,html)
          except:
               continue
     printGoodList(infoList)
main()

 

posted @ 2018-02-20 13:40  未来分析师  阅读(240)  评论(0编辑  收藏  举报