Python之爬虫1 - 随笔分类 - udbful

摘要：主要有Request类、 Response类和Item类以及Scrapy爬虫支持的信息提取方法，有： Beautiful Soup lxml re XPath Selector CSS Selector等阅读全文

posted @ 2020-06-07 15:53 udbful 阅读(115) 评论(0) 推荐(0) 编辑

摘要：一、Scrapy爬虫的常用命令二、建立第一个项目 https://docs.scrapy.org/en/latest/intro/tutorial.html 1、创建一个Scrapy爬虫工程 scrapy startproject python123demo 命令创建了一个python123dem 阅读全文

posted @ 2020-06-07 15:28 udbful 阅读(227) 评论(0) 推荐(0) 编辑

22 Scrapy框架简介

摘要：一、5+2结构： Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等 Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入S 阅读全文

posted @ 2020-06-07 12:30 udbful 阅读(145) 评论(0) 推荐(0) 编辑

21 Scrapy框架的安装

摘要：pip install scrapy (anaconda第三方库中并没有安装Scrapy需要自已安装) 测试：scrapy -h 以下表示测试安装成功阅读全文

posted @ 2020-06-07 11:36 udbful 阅读(156) 评论(0) 推荐(0) 编辑

19 正则表达式的基本知识

摘要：一、基本语法二、re库三、更多见Python之正则表达式 https://i.cnblogs.com/posts?cateId=1775942 阅读全文

posted @ 2020-06-05 21:43 udbful 阅读(166) 评论(0) 推荐(0) 编辑

18 “中国大学排名定向爬虫”实例介绍

摘要：一、功能描述及程序设计二、代码实现 1 """中国大学排名定向爬虫实例介绍""" 2 3 import requests 4 from bs4 import BeautifulSoup 5 import bs4 6 7 8 def getHTMLTest(url): 9 10 try: 11 r 阅读全文

posted @ 2020-06-05 20:42 udbful 阅读(226) 评论(0) 推荐(0) 编辑

17 基于bs4库的HTML内容查找方法

摘要：一、对find_all()方法举例 """基于bs4库的HTML内容查找方法""" import requests from bs4 import BeautifulSoup import re url = "https://python123.io/ws/demo.html" r = reques 阅读全文

posted @ 2020-06-05 16:13 udbful 阅读(289) 评论(0) 推荐(0) 编辑

16 信息标记形式及信息提取的一般方法

摘要："""信息提取的一般方法""" import requests from bs4 import BeautifulSoup url = "https://python123.io/ws/demo.html" r = requests.get(url) demo = r.text soup = Bea 阅读全文

posted @ 2020-06-05 00:50 udbful 阅读(153) 评论(0) 推荐(0) 编辑

15 基于bs4库的HTML格式化和编码

摘要：一、格式化主要用prettify()方法 """基于bs4库的HTML格式化""" import requests from bs4 import BeautifulSoup #方法一：下行遍历 url = "https://python123.io/ws/demo.html" r = reques 阅读全文

posted @ 2020-06-05 00:17 udbful 阅读(260) 评论(0) 推荐(0) 编辑

14 基于bs4库的HTML内容遍历方法

摘要：https://python123.io/ws/demo.html <html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces s 阅读全文

posted @ 2020-06-04 23:17 udbful 阅读(237) 评论(0) 推荐(0) 编辑

13 Beautiful Soup库的基本元素

摘要：举例： """Beautiful Soup库的基本元素"""import requestsfrom bs4 import BeautifulSoupurl = "https://python123.io/ws/demo.html"r = requests.get(url)demo = r.texts 阅读全文

posted @ 2020-06-04 22:20 udbful 阅读(217) 评论(0) 推荐(0) 编辑

12 Beautiful Soup库的安装

摘要：BeautifulSoup库的安装 Pip install BeautifulSoup4 (anaconda第三方库中已安装BeautifulSoup库) 测试 1 """BeautifulSoup安装测试""" 2 3 4 import requests 5 from bs4 import Bea 阅读全文

posted @ 2020-06-04 17:02 udbful 阅读(320) 评论(0) 推荐(0) 编辑

11 实例5：IP地址归属地的自动查询

摘要：IP地址归属地的自动查询 1 """IP地址归属地查询""" 2 3 4 import requests 5 6 #url = "http://m.ip138.com/ip.asp?ip=" 7 url = "https://www.ip138.com/iplookup.asp?ip=" 8 try 阅读全文

posted @ 2020-06-04 10:34 udbful 阅读(256) 评论(0) 推荐(0) 编辑

10 实例4：用多线程对视频的爬取

摘要：1 """使用多线程爬取梨视频视频数据""" 2 """https://www.cnblogs.com/zivli/p/11614103.html""" 3 4 5 import requests 6 import re 7 from lxml import etree 8 from multipr 阅读全文

posted @ 2020-06-04 10:25 udbful 阅读(241) 评论(0) 推荐(0) 编辑

9 实例3：网络图片的爬取和存储

摘要：网络图片的爬取和存储 1 """网络图片的爬取和存储""" 2 3 4 import requests 5 import os 6 7 url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg" 8 r 阅读全文

posted @ 2020-06-04 10:22 udbful 阅读(169) 评论(0) 推荐(0) 编辑

8 实例2：百度360搜索关键词提交

摘要：1 """百度搜索关键词提交""" 2 3 4 import requests 5 6 url = "https://www.baidu.com/s" 7 keyword = "Python" #中文也没问题 8 try: 9 kv = {'wd': 'keyword'} 10 r = reques 阅读全文

posted @ 2020-06-04 10:15 udbful 阅读(195) 评论(0) 推荐(0) 编辑

7 实例1：京东商品页面的爬取

摘要：1 """实例1：京东商品页面的爬取""" 2 3 4 import requests 5 6 url = "https://item.jd.com/100012545852.html" 7 try: 8 # 更改头部信息 9 kv = {'user-agent': 'Mozilla/5.0'} 1 阅读全文

posted @ 2020-06-04 10:05 udbful 阅读(252) 评论(0) 推荐(0) 编辑

6 网络爬虫引发的问题及Robots协议

摘要：6 网络爬虫引发的问题及Robots协议阅读全文

posted @ 2020-06-04 09:56 udbful 阅读(146) 评论(0) 推荐(0) 编辑

5 Requests库主要方法解析

摘要：Requests库主要方法解析 1 """Requests库主要方法解析""" 2 3 4 import requests 5 6 kv = {'key1': 'value1', 'key2': 'value2'} 7 r = requests.request('GET', 'http://pyth 阅读全文

posted @ 2020-06-04 09:53 udbful 阅读(140) 评论(0) 推荐(0) 编辑

4 HTTP协议及Requests库方法

摘要：1 """HTTP及requests库方法""" 2 3 4 import requests 5 6 # requests库head()方法：得到头部信息 7 r = requests.head("http://httpbin.org/get") 8 9 print(r.headers) 10 pr 阅读全文

posted @ 2020-06-04 09:49 udbful 阅读(189) 评论(0) 推荐(0) 编辑

随笔分类 - Python之爬虫1

公告