009 Python网络爬虫与信息提取淘宝商品比价定向爬虫

[A] 淘宝商品比价定向爬虫实例介绍

　　　　功能描述

　　　　　　目标：

　　　　　　　　获取淘宝搜索页面的信息，提取其中的商品名称和价格

　　　　　　分析：

　　　　　　　　1. 淘宝的搜索接口， 2. 翻页处理

　　　　技术路线：

　　　　　　requests，re

　　　　程序结构设计：

　　　　　　步骤1：提交商品搜索请求，循环获取页面

　　　　　　步骤2：对于每个页面，提取商品名称和价格信息

　　　　　　步骤3：将信息打印在屏幕上

[B] 淘宝商品比价定向爬虫实例编写

　　　　示例代码：

import requests
from bs4 import BeautifulSoup
import bs4
import re


# 1. 获取页面内容，将所需要的html页面返回
def get_HTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        text = r.text
        return text
    except:
        return ''


# 2. 解析页面内容，将所需要的商品信息放在数组中返回
def parsePage(html):
    ulist = []
    soup = BeautifulSoup(html, 'html.parser')
    divs = soup(name='div', attrs='pro_list_product_img2')
    for k in divs:
        try:
            id = k.attrs['shopid']
            name = k.ul('li', 'titheight font_tit')[0].a.text[18:]
            try:
                price = k.em('font')[0].text + '.' + k.em('font')[1].text
            except:
                price = k.em.text
            stress = re.findall('[\u4e00-\u9fa50-9-]{3,}', k('li', 'shshopname')[0].text)[0]
            ulist.append([id, name, price, stress])
        except:
            continue
    return ulist


# 3. 打印商品信息，将之前保存好的商品内容打印出来
def printGoodsList(ulist, page=1):
    splt1 = '{0:<5}\t{1:<10}\t{2:<60}\t{3:<20}\t{4:<20}'
    splt2 = '{0:<5}\t{1:<10}\t{2:<45}\t{3:<20}\t{4:<20}'
    if page == 1:
        print(splt1.format('序号', '店铺id', '商品名称', '商品价格', '店家地址'))
    for k in range(0, len(ulist)):
        print(splt2.format(50*(page - 1) + k + 1, ulist[k][0], ulist[k][1], ulist[k][2], ulist[k][3]))


def main():
    # 'http://www.yiwugo.com/search/s.html?cpage=1&q=连衣裙'
    depth = 2
    keyword = '小火车'
    for k in range(1, depth + 1):
        url = 'http://www.yiwugo.com/search/s.html?cpage=' + str(k) + '&q=' + keyword
        html = get_HTMLText(url)
        ulist = parsePage(html)
        printGoodsList(ulist, k)

main()

View Code

posted @ 2020-11-21 13:46 CarreyB 阅读(343) 评论(0) 编辑收藏举报

刷新页面返回顶部

Carrrey

009 Python网络爬虫与信息提取淘宝商品比价定向爬虫

公告

Carrrey

009 Python网络爬虫与信息提取 淘宝商品比价定向爬虫

公告

009 Python网络爬虫与信息提取淘宝商品比价定向爬虫