爬虫 - 随笔分类 - 市丸银

爬虫开启定时任务

摘要：1、导入模块 import datetime import time 2、代码 def time_task(): while True: now = datetime.datetime.now() # print(now.hour, now.minute) if now.hour == 0 and 阅读全文

posted @ 2020-08-05 17:49 市丸银阅读(262) 评论(0) 推荐(0)

Filder配置及使用教程

摘要：https://www.cnblogs.com/woaixuexi9999/p/9247705.html 阅读全文

posted @ 2019-11-29 17:42 市丸银阅读(356) 评论(0) 推荐(0)

scrapy selector选择器

摘要：这部分内容属于补充内容 1、xpath() 2、css() 3、正则表达式 # 多个值，列表 response.xpath('//a/text()').re('(.*?):\s(.*)') # 取第一个值 response.xpath('//a/text()').re_first('(.*?):\s 阅读全文

posted @ 2019-11-25 21:00 市丸银阅读(118) 评论(0) 推荐(0)

代理的使用

摘要：1、代理池： https://github.com/Python3WebSpider/ProxyPool 从网络上获取代理判断是否可用储存到redis 定期检测代理地址的有效性 api：通过url获取代理 2、使用过程代理为None，若ip被封禁(响应状态码)，从代理池中获取新的代理，请求使用阅读全文

posted @ 2019-11-24 20:46 市丸银阅读(121) 评论(0) 推荐(0)

pymysql总结

摘要：一、创建数据库 import pymysql conn = pymysql.connect(host='ip', user='root', password='密码') # 以字典的形式返回操作结果 cursor = conn.cursor(cursor=pymysql.cursors.DictCu 阅读全文

posted @ 2019-11-21 18:03 市丸银阅读(181) 评论(0) 推荐(0)

pyquery解析库

摘要：语法和jquey几乎一致安装 conda install pyquery 一、初始化标准用法 from pyquery import PyQuery as pq import requests # r = requests.get(url='http://www.baidu.com') html 阅读全文

posted @ 2019-11-21 13:00 市丸银阅读(245) 评论(0) 推荐(0)

urllib基本用法(了解)

摘要：一、urllib.urlopen 1、urlopen from urllib import request r = request.urlopen('http://www.baidu.com/') # 获取状态码 print(r.status) # 获取相应头 print(r.getheaders( 阅读全文

posted @ 2019-11-20 23:43 市丸银阅读(467) 评论(0) 推荐(0)

保存数据到txt

摘要：join用的不错 a = "Hello, world" b = "你好，世界" c = "How are you?" with open(file='a.txt', mode='w', encoding='utf-8') as f: f.write('\n'.join([a, b, c])) f.w 阅读全文

posted @ 2019-11-20 17:57 市丸银阅读(311) 评论(0) 推荐(0)

保存数据到csv

摘要：csv 逗号分隔值一、写入 1、列表单行添加 import csv # with open(file='a.csv', mode='w', encoding='utf-8', newline='') as f: write = csv.writer(f) write.writerow(['id' 阅读全文

posted @ 2019-11-20 17:49 市丸银阅读(800) 评论(0) 推荐(0)

scrapy-splash

摘要：官网：https://github.com/scrapy-plugins/scrapy-splash 1、安装： pip install scrapy-splash 2、运行splash docker run -p 8050:8050 scrapinghub/splash 3、配置setting文件阅读全文

posted @ 2019-11-20 13:44 市丸银阅读(145) 评论(0) 推荐(0)

urllib parse

摘要：1、urlparse 作用：解析url from urllib import parse url = "https://book.qidian.com/info/1004608738" result = parse.urlparse(url=url) print(result) 结果： ParseR 阅读全文

posted @ 2019-11-20 12:43 市丸银阅读(150) 评论(0) 推荐(0)

Splash

摘要：官网： https://splash.readthedocs.io/en/stable/index.html 常用接口(API) 1、render.html 格式： http://10.63.32.49:8050/render.html?url=https://www.baidu.com&wait= 阅读全文

posted @ 2019-11-19 13:00 市丸银阅读(662) 评论(0) 推荐(0)

Splash简单应用

摘要：jd->iphone import requests from lxml import etree # search_key = 'iphone' jd_url = "https://search.jd.com/Search?keyword={}&enc=utf-8&wq={}&pvid=1a54a 阅读全文

posted @ 2019-11-18 12:30 市丸银阅读(152) 评论(0) 推荐(0)

unbuntu18.04安装启用splash

摘要：官网：https://splash.readthedocs.io/en/stable/ 1、安装Docker https://www.cnblogs.com/wt7018/p/11880666.html 2、pull the image sudo docker pull scrapinghub/sp 阅读全文

posted @ 2019-11-18 10:57 市丸银阅读(430) 评论(0) 推荐(0)

Ubuntu18.04安装docker

摘要：参考 https://www.runoob.com/docker/ubuntu-docker-install.html 1.卸载 sudo apt-get remove docker docker-engine docker.io containerd runc 2.安装Docker sudo ap 阅读全文

posted @ 2019-11-18 10:33 市丸银阅读(30699) 评论(2) 推荐(4)

selenium等待

摘要：1、隐式等待查找节点，如果第一时间没有找到，则等待10秒，然后再去查找，如果没有找到则爬出异常 from selenium import webdriver # browser = webdriver.Chrome() browser.implicitly_wait(10) browser.get 阅读全文

posted @ 2019-11-17 21:15 市丸银阅读(151) 评论(0) 推荐(0)

selenium chrome headless无界面引擎

摘要：注意：PhantomJS已被舍弃 chrome headless 在打开浏览器之前添加参数 import time import sys from selenium import webdriver from selenium.webdriver.common.keys import Keys fr 阅读全文

posted @ 2019-11-17 00:40 市丸银阅读(271) 评论(0) 推荐(0)

基于selenium爬取京东

摘要：爬取iphone 注意：browser对象会发生变化，当对当前网页做任意操作时 import time from selenium import webdriver from selenium.webdriver.common.keys import Keys # if __name__ == '_ 阅读全文

posted @ 2019-11-17 00:13 市丸银阅读(294) 评论(0) 推荐(0)

selenium

摘要：注意：浏览器对象(browser)每次操作页面，都会发生变化，包含下拉页面，踩过坑一、打开百度搜索python为例 from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.baidu. 阅读全文

posted @ 2019-11-16 18:52 市丸银阅读(131) 评论(0) 推荐(0)

ChromeDriver安装

摘要：Chrome的驱动 0、安装selenium pip3 install -i https://pypi.douban.com/simple selenium 1、查看chrom版本 chrome://version/ 2、下载 http://chromedriver.storage.googleap 阅读全文

posted @ 2019-11-16 15:58 市丸银阅读(181) 评论(0) 推荐(0)

市丸银

知行合一

随笔分类 - 爬虫

公告