Python 实现爬取数据功能（一）

前沿

由于老婆大人的要求，她自己有一个某网站的账户，数据大概有20多万条，她想把其账户下的数据转到另外一个账户中。但是网站提供的功能是只能每次导出20条到Excel 中，然而将Excel 导入到另外一个账户中, 另外数据有限制，最多能查1万行。

作为程序猿的老公，还是有义务去分担这个痛点的。。。。因此，用一周空余时间+周末时间学习了python ，以及实现了数据自动导出的功能。哈哈哈也当作自己的学习动力之一吧。成果如下：

数据爬取代码

hello.py

import requests
import json
import logging
import csv
import datetime
import time
import sys
import os
import xlwt

# 文件输出目录
output_dir = "/Users/xxxx/" 
# 最大可搜索条数
maxNumber = 10000
# 每页大小
pageSize = 200

//这里定义根据第几页、页面大小以及手机号进行搜索（可以自定义参数，具体需要根据自己实际业务使用）
def  my_frist_func(page, size, phone):
    mkdir(phone)

    # 这两句消除警告
    logging.captureWarnings(True)
    fromIndex = (page-1)*size

    # 这里是使用Charles 抓包下来的请求参数，可以自己修改参数
    data = {“”}  
    //这里是使用Charles 抓包下来的 header 设置
    headers =   { "Cookie":"CNZZDATA1261213312=270079913-1559008705-%7C1559524218; JSESSIONID=F042C28D5F0253C34E0EA5A003176565; last_login=xxxx; remember=1; 
            }

    url = 'https://xxxx'
    response = requests.post(url, json=data, headers=headers, verify=False) #添加verify=False SSLError 消失
    if response.status_code is not 200:
        print(response.status_code)
        return
    //获取数据 转成json 对象
    resp = json.loads(response.text)
    list = resp["result"]["resume"]["list"]
    totoal = resp["result"]["resume"]["total"]
    //重新整理数据对象模型
    res_list = []
    year = datetime.datetime.now().year
    
    for people in list:
        newPeople = {}
        basicInfo = people["basicInfo"]
        newPeople["姓名"] = basicInfo["Fname"]
        res_list.append(newPeople)

    file_full_path = output_dir + phone + "/第" + str(page)+"页.xls"
    //将新对象输出到Excel 中
    write_excel_people(file_full_path, res_list)
    next_from = fromIndex + size
     //打印进度条
    if next_from < totoal and next_from < maxNumber:
        process = (next_from / totoal) * 100
        _output = sys.stdout
        _output.write(f"\r  percent:{process:.2f}% [{phone}开头共计[{totoal}条]")
        _output.flush()
        my_frist_func(page=page+1, size=pageSize, phone=phone)
    elif next_from >= maxNumber:
        print(f"\n超出第1万行,[{phone}开头共计[{totoal}条]")
    else:
        print(f"\n打印结束,[{phone}开头共计[{totoal}条]")


def mkdir(dirName):
    # 去除首位空格
    path = output_dir + dirName
    path=path.strip()
    # 去除尾部 \ 符号
    path=path.rstrip("\\")

    # 判断路径是否存在
    # 存在     True
    # 不存在   False
    isExists=os.path.exists(path)

    # 判断结果
    if not isExists:
        # 如果不存在则创建目录
        # 创建目录操作函数
        os.makedirs(path)
        print('创建目录成功' + str(dirName))
        return True
    else:
        # 如果目录存在则不创建，并提示目录已存在
        return False

def write_excel_people(path, list=[]):
    count = len(list)
    if count <= 0:
        return

    xls = xlwt.Workbook()
    sht1 = xls.add_sheet('Sheet1')

    for row in range(count):
        data = list[row]
        keys  = data.keys()
        column = 0
        for key in keys:
            if row == 0:
                sht1.write(row, column, key)
            sht1.write(row+1, column, data[key])
            column = column + 1
    xls.save(path)





if __name__ == '__main__':

    phones = ['123','123']

    # for phone in phones :
    for phone in iter(phones):
        my_frist_func(page=1, size=pageSize, phone=phone)

如何使用方式

1、打开Pycharm 直接运行 main

2、打开终端 ,cd 到当前文件下，然后执行命令

>> python3 hello.py

(如果出现如下错误：ModuleNotFoundError: No module named 'xlwt' ，则执行命令pip install xlwt 安装xlwt 插件这是用来处理Excel的)

发表于 2019-06-26 11:33 kingBo0259 阅读(389) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Python 实现爬取数据功能（一）

公告

导航