2020 年 6月随笔档案 - udbful

8_3 scrapy模拟登录人人网

摘要：一、创建项目二、更改设置（setting等）三、编码 1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class RenrenSpider(scrapy.Spider): 6 name = 'renren' 7 allowed_domains = 阅读全文

posted @ 2020-06-28 17:33 udbful 阅读(160) 评论(0) 推荐(0) 编辑

2_1 关键字yeild

摘要：阅读全文

posted @ 2020-06-25 01:00 udbful 阅读(97) 评论(0) 推荐(0) 编辑

1_4 回调函数

摘要：工阅读全文

posted @ 2020-06-25 00:58 udbful 阅读(69) 评论(0) 推荐(0) 编辑

8_2 scrapy入门实战之CrawlSpider（微信小程序社区教程爬取示例）

摘要：CrawlSpider可用于有规则的网站，对其整站的爬取一、创建项目 scrapy startproject wxapp cd wxapp scrapy genspider -t crawl wxapp_spider wxapp-union.com 二、更改setting.py ROBOTSTXT 阅读全文

posted @ 2020-06-24 09:48 udbful 阅读(288) 评论(0) 推荐(0) 编辑

8_1 scrapy入门实战（爬取糗事百科段子）

摘要：开始参考https://www.cnblogs.com/sruzzg/p/13060159.html 一、创建scrapy爬虫工程demo scrapy startproject demo 快捷创建了一个demo新工程二、在工程中生成一个scrapy爬虫qiushibaike 1：进入工程 cd 阅读全文

posted @ 2020-06-23 22:05 udbful 阅读(222) 评论(0) 推荐(0) 编辑

7_3

摘要：滑块验证 OpenCV+python https://www.jb51.net/article/161503.htm?tdsourcetag=s_pcqq_aiomsg python+selenium... https://www.cnblogs.com/ohahastudy/p/11493971. 阅读全文

posted @ 2020-06-22 16:53 udbful 阅读(127) 评论(0) 推荐(0) 编辑

7_2 tesseract处理验证码

摘要：一、简单验证码识别处理 1 """""" 2 3 4 import pytesseract 5 from PIL import Image 6 from urllib import request 7 import time 8 9 def main(): 10 # 这个url也可以通过登录页面分析阅读全文

posted @ 2020-06-22 15:05 udbful 阅读(175) 评论(0) 推荐(0) 编辑

7_1 tesseract 安装及使用

摘要：1、安装tesseract OCR，即Optical Character Recognition，光学字符识别，是指通过扫描字符，然后通过其形状将其翻译成电子文本的过程。对于图形验证码来说，它们都是一些不规则的字符，这些字符确实是由字符稍加扭曲变换得到的内容。 tesseract下载地址：链接：阅读全文

posted @ 2020-06-22 10:50 udbful 阅读(1116) 评论(0) 推荐(0) 编辑

6_7 selenium使用代理IP

摘要：1 """""" 2 3 from selenium import webdriver 4 5 driver_path = r"D:\install\chromedriver\chromedriver.exe" 6 options = webdriver.ChromeOptions() 7 opti 阅读全文

posted @ 2020-06-20 23:10 udbful 阅读(184) 评论(0) 推荐(0) 编辑

6_6 模拟浏览器的前进后退&窗口句柄切换

摘要：一、可以使用driver.get()方法打开多个窗口但是会覆盖，所以可以用前进后退进行操作 from selenium import webdriver import time driver_path = r"D:\install\chromedriver\chromedriver.exe" dri 阅读全文

posted @ 2020-06-20 22:49 udbful 阅读(221) 评论(0) 推荐(0) 编辑

6_5 selenium操作cookie

摘要：1 """selenium操作cookie""" 2 3 4 from selenium import webdriver 5 6 driver_path = r"D:\install\chromedriver\chromedriver.exe" 7 driver = webdriver.Chrom 阅读全文

posted @ 2020-06-20 21:56 udbful 阅读(143) 评论(0) 推荐(0) 编辑

6_4 行为链

摘要：在上面的实例中，一些交互动作都是针对某个节点执行的。比如，对于输入框，我们就调用它的输入文字和清空文字方法；对于按钮，就调用它的点击方法。其实，还有另外一些操作，它们没有特定的执行对象，比如鼠标拖曳、键盘按键等，这些动作用另一种方式来执行，那就是动作链。 1 """行为链""" 2 3 from s 阅读全文

posted @ 2020-06-20 21:39 udbful 阅读(194) 评论(0) 推荐(0) 编辑

6_3 selenium操作表单元素

摘要：Selenium可以驱动浏览器来执行一些操作，也就是说可以让浏览器模拟执行一些动作。比较常见的用法有：输入文字时用send_keys()方法，清空文字时用clear()方法，点击按钮时用click()方法。示例如下： 1 """selenium操作表单元素""" 2 # 常见的表单元素： 3 # i 阅读全文

posted @ 2020-06-20 00:24 udbful 阅读(329) 评论(0) 推荐(0) 编辑

6_2 selenium定位元素的方法

摘要：webdriver 提供了一系列的元素定位方法，常用的有以下几种： find_element_by_id() # 通过元素ID定位 find_element_by_name() # 通过元素Name定位 find_element_by_class_name() # 通过类名定位 find_eleme 阅读全文

posted @ 2020-06-19 21:46 udbful 阅读(199) 评论(0) 推荐(0) 编辑

6_1 selenium 安装与 chromedriver安装

摘要：selenium简介 selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题。 selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器安装sele 阅读全文

posted @ 2020-06-17 22:39 udbful 阅读(256) 评论(0) 推荐(0) 编辑

4_4 写入csv文件

摘要：1 """写入csv文件""" 2 3 4 import csv 5 6 # 方法一： 7 def write_csv_demo1(): 8 headers = ['username', 'age', 'height'] 9 values = [ 10 ('张三', 18, 180), 11 ('李阅读全文

posted @ 2020-06-15 22:22 udbful 阅读(190) 评论(0) 推荐(0) 编辑

4_3 读取csv文件

摘要：1 """读取csv文件""" 2 3 4 import csv 5 6 def readcsv_demo1(): 7 """采用列表形式，下标操作""" 8 with open('csvwriter.csv', 'r') as fp: 9 # reader是一个迭代器 10 reader = cs 阅读全文

posted @ 2020-06-15 22:14 udbful 阅读(178) 评论(0) 推荐(0) 编辑

4_2 json字符串转Python对象

摘要：1 """json字符串到Python对象""" 2 3 4 import json 5 6 json_str = '[{"username": "张三", "age": 18, "country": "china"}, {"username": "lisi", "age": 20, "countr 阅读全文

posted @ 2020-06-15 22:01 udbful 阅读(739) 评论(0) 推荐(0) 编辑

4_1 ptyhon对象转json字符串

摘要：1 """ptyhon对象转json字符串""" 2 3 4 import json 5 6 persons = [ 7 {'username': '张三', 'age': 18, 'country': 'china'}, 8 {'username': 'lisi', 'age': 20, 'cou 阅读全文

posted @ 2020-06-15 21:59 udbful 阅读(108) 评论(0) 推荐(0) 编辑

20 古诗文网站诗文爬取（正则方法）

摘要：1 """古诗文网爬虫""" 2 3 4 import re 5 import requests 6 7 def parse_page(url): 8 headers = { 9 'User-Agent': 'Mozilla/5.0', 10 } 11 12 response = requests. 阅读全文

posted @ 2020-06-14 23:47 udbful 阅读(246) 评论(0) 推荐(0) 编辑

1_3 常用函数示例（zip/）

摘要：1、zip函数： 1 """zip函数""" 2 3 4 """ 5 zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。 6 如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同，利用 * 号操作符，可以将元组解压为列表。 7 " 阅读全文

posted @ 2020-06-14 23:44 udbful 阅读(123) 评论(0) 推荐(0) 编辑

1_2 lambda表达式

摘要：1 """lambda表达式""" 2 3 """ 4 定义函数的语法格式： 5 def 函数名([形参1，形参2， ...]): 6 函数体 7 8 当函数体中只有一行return语句时，函数的定义可以用lambda表达式代替 9 lambda [形参1，形参2， ...]: 关于形参的表达式 1 阅读全文

posted @ 2020-06-13 23:44 udbful 阅读(199) 评论(0) 推荐(0) 编辑

1_1 函数的一些重要特征

摘要：1 """函数的一些重要特征""" 2 3 4 """ 5 在Python中，一切皆为对象。所以，函数也是对象，从而函数可以被赋值给变量。 6 """ 7 def add(num1, num2): 8 return num1 + num2 9 10 print(add) # <function ad 阅读全文

posted @ 2020-06-13 23:25 udbful 阅读(162) 评论(0) 推荐(0) 编辑

19 正则表达式小案例

摘要：一、常用案例 1 """正则表达式小案例""" 2 3 import re 4 5 # 1、验证手机号码 6 text = "13979391000" 7 ret = re.match('1[34578]\d{9}', text) 8 print(ret.group()) # 2、验证邮箱 text 阅读全文

posted @ 2020-06-12 23:19 udbful 阅读(238) 评论(0) 推荐(0) 编辑

18 中国天气网信息爬取（排序并可视化显示）

摘要：1 """中国天气网爬虫""" 2 3 import requests 4 from bs4 import BeautifulSoup 5 from pyecharts import Bar 6 7 8 HEADERS = { 9 'User-Agent': 'Mozilla/5.0' 10 } 1 阅读全文

posted @ 2020-06-12 17:32 udbful 阅读(323) 评论(0) 推荐(0) 编辑

17 中国天气网信息爬取

摘要：1 """中国天气网爬虫""" 2 3 import requests 4 from bs4 import BeautifulSoup 5 6 HEADERS = { 7 'User-Agent': 'Mozilla/5.0', 8 } 9 10 def parse_detail_page(url, 阅读全文

posted @ 2020-06-12 00:11 udbful 阅读(224) 评论(0) 推荐(0) 编辑

16 select和css选择器（提取元素详解）

摘要：# 1、获取所有tr标签 1 from bs4 import BeautifulSoup 2 text = """ 3 <table class="tablelist" cellpadding="0" cellspacing="0"> 4 <tbody> 5 <tr class="h"> 6 <td 阅读全文

posted @ 2020-06-11 15:33 udbful 阅读(853) 评论(0) 推荐(0) 编辑

15 Beautiful Soup（提取数据详解find_all()）

摘要：# 1、获取所有tr标签# 2、获取第2个tr标签# 3、获取所有class等于even的tr标签# 4_1、将所有id等于test,class也等于test的所有a标签提取出# 4_2、获取所有a标签下href属性的值# 5、获取所有的职位信息（纯文本） # 1、获取所有tr标签 1 from b 阅读全文

posted @ 2020-06-11 11:18 udbful 阅读(3379) 评论(0) 推荐(0) 编辑

14 天堂电影信息爬取

摘要：1 """电影天堂爬虫""" 2 3 4 import requests 5 from lxml import etree 6 7 BASE_DOMAIN = 'https://dytt8.net/' 8 HEADERS = { 9 'User-Agent': 'Mozilla/5.0' 10 } 阅读全文

posted @ 2020-06-11 01:19 udbful 阅读(345) 评论(0) 推荐(0) 编辑

13 爬取豆瓣电影网电影信息

摘要：1 """豆瓣电影爬虫""" 2 3 4 import requests 5 from lxml import etree 6 7 # 1、将目标网站上的页面爬取出来 8 headers = { 9 'User-Agent': 'Mozilla/5.0', 10 } 11 12 url = 'htt 阅读全文

posted @ 2020-06-10 01:41 udbful 阅读(129) 评论(0) 推荐(0) 编辑

12 lxml&XPath结合使用（提取数据详解）

摘要：实现： # 1、获取所有tr标签# 2、获取第2个tr标签# 3、获取所有class等于even的tr标签# 4、获取所有a标签及其属性值# 5、获取所有的职位信息（纯文本） 1 """lxml&XPath结合使用""" 2 3 4 from lxml import etree 5 6 parser 阅读全文

posted @ 2020-06-09 16:49 udbful 阅读(283) 评论(0) 推荐(0) 编辑

11 lxml库解析html代码

摘要：一、lxml库解析字符串 """lxml库解析html代码""" from lxml import etree text = """ <body> <div class="header clear"> <div class="inner"> <h1 class="logo_area" title=" 阅读全文

posted @ 2020-06-09 15:18 udbful 阅读(328) 评论(0) 推荐(0) 编辑

10 XPath安装及基本语法

摘要：一、安装 XPath安装（在chrome右上角找到选项--》更多工具--》扩展程序。把crx文件拖拽，即可安装）二、基本语法 https://www.w3school.com.cn/xpath/index.asp 阅读全文

posted @ 2020-06-09 11:28 udbful 阅读(431) 评论(0) 推荐(0) 编辑

5 自动登录授权网页

摘要：以下案例只对登录不需要验证码登录的网页才有效 1 """""" 2 3 4 # 大鹏主页:dapeng_url = "http://www.renren.com/880151247/profile" 5 # 人人网登录login_url = 'http://www.renren.com/PLogin 阅读全文

posted @ 2020-06-08 23:13 udbful 阅读(318) 评论(0) 推荐(0) 编辑

7 requests库基本使用

摘要：1、requests之get()方法 https://www.cnblogs.com/sruzzg/p/13041898.html 类似 1 """requests之get()方法""" 2 3 import requests 4 5 # response = requests.get('https 阅读全文

posted @ 2020-06-08 15:47 udbful 阅读(249) 评论(0) 推荐(0) 编辑

6 保存与加载cookie信息

摘要：1、保存cookie信息 1 """保存cookie信息""" 2 3 4 from urllib import request 5 from http.cookiejar import MozillaCookieJar 6 7 cookieFilename = 'cookie.txt' 8 # 声阅读全文

posted @ 2020-06-08 11:36 udbful 阅读(280) 评论(0) 推荐(0) 编辑

4 手动方式从浏览器中复制cookie请求网页

摘要：有些网站没有登录无法访问页面，登录后就可以访问。如人人网所有程序要模拟登录状态，方法有手动方法和使用帐号密码自动登录方式。本篇使用手动复cookie请求要访问的网页，并把请求到网页保存到本地 1 """使用手动方式从浏览器中复制cookie请求网页""" 2 3 4 from urllib imp 阅读全文

posted @ 2020-06-08 10:06 udbful 阅读(1869) 评论(0) 推荐(0) 编辑

3 ProxyHandle实现代理ip

摘要：快代理：https://www.kuaidaili.com/ops/ 西刺免费代理：http://www.xicidaili.com/ 代理云：http://www.dailiyun.com/ 1 """ProxyHandler实现代理ip""" 2 3 4 import urllib.reques 阅读全文

posted @ 2020-06-08 09:18 udbful 阅读(192) 评论(0) 推荐(0) 编辑

1 urllib库（了解）

摘要：urllib是python的基本库之一，内置四大模块，即request，error，parse，robotparser，常用的request，error，一个用于发送HTTP请求，一个用于处理请求的错误。parse用于对URL的处理，拆分，合并等 1、urllib库之urlopen函数 1 """u 阅读全文

posted @ 2020-06-07 23:08 udbful 阅读(185) 评论(0) 推荐(0) 编辑

1 python3创建virtualenv虚拟环境(Windows10)

摘要：一、安装virtualenv 1、安装命令 pip install virtualenv 2、查看python解释器路径 where python 二、创建虚拟环境 1、在控制台中，使用cd目录，切换到需要创建虚拟环境的目录 C:\Users\udbfu>d: D:\>cd Virtualenv 2 阅读全文

posted @ 2020-06-07 21:35 udbful 阅读(194) 评论(0) 推荐(0) 编辑

24 Scrapy爬虫的基本使用

摘要：主要有Request类、 Response类和Item类以及Scrapy爬虫支持的信息提取方法，有： Beautiful Soup lxml re XPath Selector CSS Selector等阅读全文

posted @ 2020-06-07 15:53 udbful 阅读(115) 评论(0) 推荐(0) 编辑

23 Scrapy爬虫第一个实例

摘要：一、Scrapy爬虫的常用命令二、建立第一个项目 https://docs.scrapy.org/en/latest/intro/tutorial.html 1、创建一个Scrapy爬虫工程 scrapy startproject python123demo 命令创建了一个python123dem 阅读全文

posted @ 2020-06-07 15:28 udbful 阅读(227) 评论(0) 推荐(0) 编辑

22 Scrapy框架简介

摘要：一、5+2结构： Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等 Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入S 阅读全文

posted @ 2020-06-07 12:30 udbful 阅读(145) 评论(0) 推荐(0) 编辑

21 Scrapy框架的安装

摘要：pip install scrapy (anaconda第三方库中并没有安装Scrapy需要自已安装) 测试：scrapy -h 以下表示测试安装成功阅读全文

posted @ 2020-06-07 11:36 udbful 阅读(156) 评论(0) 推荐(0) 编辑

19 正则表达式的基本知识

摘要：一、基本语法二、re库三、更多见Python之正则表达式 https://i.cnblogs.com/posts?cateId=1775942 阅读全文

posted @ 2020-06-05 21:43 udbful 阅读(166) 评论(0) 推荐(0) 编辑

18 “中国大学排名定向爬虫”实例介绍

摘要：一、功能描述及程序设计二、代码实现 1 """中国大学排名定向爬虫实例介绍""" 2 3 import requests 4 from bs4 import BeautifulSoup 5 import bs4 6 7 8 def getHTMLTest(url): 9 10 try: 11 r 阅读全文

posted @ 2020-06-05 20:42 udbful 阅读(226) 评论(0) 推荐(0) 编辑

17 基于bs4库的HTML内容查找方法

摘要：一、对find_all()方法举例 """基于bs4库的HTML内容查找方法""" import requests from bs4 import BeautifulSoup import re url = "https://python123.io/ws/demo.html" r = reques 阅读全文

posted @ 2020-06-05 16:13 udbful 阅读(289) 评论(0) 推荐(0) 编辑

16 信息标记形式及信息提取的一般方法

摘要："""信息提取的一般方法""" import requests from bs4 import BeautifulSoup url = "https://python123.io/ws/demo.html" r = requests.get(url) demo = r.text soup = Bea 阅读全文

posted @ 2020-06-05 00:50 udbful 阅读(153) 评论(0) 推荐(0) 编辑

15 基于bs4库的HTML格式化和编码

摘要：一、格式化主要用prettify()方法 """基于bs4库的HTML格式化""" import requests from bs4 import BeautifulSoup #方法一：下行遍历 url = "https://python123.io/ws/demo.html" r = reques 阅读全文

posted @ 2020-06-05 00:17 udbful 阅读(260) 评论(0) 推荐(0) 编辑

14 基于bs4库的HTML内容遍历方法

摘要：https://python123.io/ws/demo.html <html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces s 阅读全文

posted @ 2020-06-04 23:17 udbful 阅读(237) 评论(0) 推荐(0) 编辑

13 Beautiful Soup库的基本元素

摘要：举例： """Beautiful Soup库的基本元素"""import requestsfrom bs4 import BeautifulSoupurl = "https://python123.io/ws/demo.html"r = requests.get(url)demo = r.texts 阅读全文

posted @ 2020-06-04 22:20 udbful 阅读(217) 评论(0) 推荐(0) 编辑

12 Beautiful Soup库的安装

摘要：BeautifulSoup库的安装 Pip install BeautifulSoup4 (anaconda第三方库中已安装BeautifulSoup库) 测试 1 """BeautifulSoup安装测试""" 2 3 4 import requests 5 from bs4 import Bea 阅读全文

posted @ 2020-06-04 17:02 udbful 阅读(320) 评论(0) 推荐(0) 编辑

11 实例5：IP地址归属地的自动查询

摘要：IP地址归属地的自动查询 1 """IP地址归属地查询""" 2 3 4 import requests 5 6 #url = "http://m.ip138.com/ip.asp?ip=" 7 url = "https://www.ip138.com/iplookup.asp?ip=" 8 try 阅读全文

posted @ 2020-06-04 10:34 udbful 阅读(256) 评论(0) 推荐(0) 编辑

10 实例4：用多线程对视频的爬取

摘要：1 """使用多线程爬取梨视频视频数据""" 2 """https://www.cnblogs.com/zivli/p/11614103.html""" 3 4 5 import requests 6 import re 7 from lxml import etree 8 from multipr 阅读全文

posted @ 2020-06-04 10:25 udbful 阅读(241) 评论(0) 推荐(0) 编辑

9 实例3：网络图片的爬取和存储

摘要：网络图片的爬取和存储 1 """网络图片的爬取和存储""" 2 3 4 import requests 5 import os 6 7 url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg" 8 r 阅读全文

posted @ 2020-06-04 10:22 udbful 阅读(169) 评论(0) 推荐(0) 编辑

8 实例2：百度360搜索关键词提交

摘要：1 """百度搜索关键词提交""" 2 3 4 import requests 5 6 url = "https://www.baidu.com/s" 7 keyword = "Python" #中文也没问题 8 try: 9 kv = {'wd': 'keyword'} 10 r = reques 阅读全文

posted @ 2020-06-04 10:15 udbful 阅读(195) 评论(0) 推荐(0) 编辑

7 实例1：京东商品页面的爬取

摘要：1 """实例1：京东商品页面的爬取""" 2 3 4 import requests 5 6 url = "https://item.jd.com/100012545852.html" 7 try: 8 # 更改头部信息 9 kv = {'user-agent': 'Mozilla/5.0'} 1 阅读全文

posted @ 2020-06-04 10:05 udbful 阅读(252) 评论(0) 推荐(0) 编辑

6 网络爬虫引发的问题及Robots协议

摘要：6 网络爬虫引发的问题及Robots协议阅读全文

posted @ 2020-06-04 09:56 udbful 阅读(146) 评论(0) 推荐(0) 编辑

5 Requests库主要方法解析

摘要：Requests库主要方法解析 1 """Requests库主要方法解析""" 2 3 4 import requests 5 6 kv = {'key1': 'value1', 'key2': 'value2'} 7 r = requests.request('GET', 'http://pyth 阅读全文

posted @ 2020-06-04 09:53 udbful 阅读(140) 评论(0) 推荐(0) 编辑

4 HTTP协议及Requests库方法

摘要：1 """HTTP及requests库方法""" 2 3 4 import requests 5 6 # requests库head()方法：得到头部信息 7 r = requests.head("http://httpbin.org/get") 8 9 print(r.headers) 10 pr 阅读全文

posted @ 2020-06-04 09:49 udbful 阅读(189) 评论(0) 推荐(0) 编辑

3 爬取网页的通用代码框架

摘要：爬取网页的通用代码框架 1 """通用代码框架""" 2 3 4 import requests 5 6 def getHTMLText(url): 7 try: 8 r = requests.get(url, timeout = 30) 9 r.raise_for_status() # 如果状态码阅读全文

posted @ 2020-06-04 09:24 udbful 阅读(139) 评论(0) 推荐(0) 编辑

2 Requests库的get()方法

摘要：Requests库的get()方法 """2requests之get方法""" import requests url = "https://www.baidu.com/" r = requests.get(url) # 200 print(r.status_code) # <class 'requ 阅读全文

posted @ 2020-06-04 09:21 udbful 阅读(171) 评论(0) 推荐(0) 编辑

1 requests库的安装

摘要：requests库的安装 Pip install requests (anaconda第三方库中已安装requests库) 测试： """requests库的安装测试""" import requests r = requests.get("https://www.baidu.com/") # 输出阅读全文

posted @ 2020-06-04 09:17 udbful 阅读(274) 评论(0) 推荐(0) 编辑

06 2020 档案

公告