python 爬虫 - 随笔分类 - cltt

Scrapy 爬虫实战1—股票数据爬取

摘要：功能描述获取股票列表： http://quote.eastmoney.com/stock_list.html' 获取个股信息： https://www.laohu8.com/stock/ 步骤步骤1：建立工程和Spider模板 > scrapy startproject laohustocks 阅读全文

posted @ 2020-06-16 17:19 cltt 阅读(783) 评论(0) 推荐(0)

Scrapy爬虫的基本使用

摘要：Scrapy爬虫的使用步骤步骤1：创建一个工程和Spider模板步骤2：编写Spider 步骤3：编写Item Pipeline 步骤4：优化配置策略 Scrapy爬虫的数据类型 Request类；Response类：Item类 Request类 class scrapy.http.Reques 阅读全文

posted @ 2020-06-15 10:03 cltt 阅读(171) 评论(0) 推荐(0)

scrapy 常见问题

摘要：scrapy -h 出现这个问题的原因是attrs的版本不够解决办法 pip3 install attrs==19.2.0 -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com 阅读全文

posted @ 2020-06-09 22:41 cltt 阅读(164) 评论(0) 推荐(0)

yield 关键字的使用

摘要：yield关键字的使用 yield —— 生成器生成器是一个不断产生值的函数包含yield语句的函数是一个生成器生成器每一次产生一个值（yield语句），函数被冻结，被唤醒后再产生一个值生成器写法 def gen(n): for i in range(n): yield i**2 for i 阅读全文

posted @ 2020-06-09 11:09 cltt 阅读(168) 评论(0) 推荐(0)

Scrapy 爬虫框架

摘要：5+2 的结构 Scrapy爬虫框架解析 Engine模块(不需要用户修改)：控制所有模块之间的数据流；根据条件触发事件 Downloader模块（不需要用户修改）：根据请求下载网页 Scheduler模块（不需要用户修改）：对所有爬取请求进行调度管理 Downloader Middleware中间阅读全文

posted @ 2020-06-01 14:05 cltt 阅读(208) 评论(0) 推荐(0)

股票数据爬虫

摘要：老虎社区 'https://www.laohu8.com/stock/' 百度股票不行了 import requests import re from bs4 import BeautifulSoup import collections import traceback def getHtmlTe 阅读全文

posted @ 2020-05-31 15:06 cltt 阅读(422) 评论(2) 推荐(0)

实战 7 淘宝商品信息定向爬虫

摘要：import requests import re def getHTMLText(url): try: #淘宝用了反爬虫机制，必须提取cookie让他认为是用户在操作 headers = { "user-agent": "Mozilla/5.0", "cookie": "miid=16121344 阅读全文

posted @ 2020-05-21 12:20 cltt 阅读(1009) 评论(0) 推荐(0)

正则表达式

摘要：正则表达式：regular expression regex RE 正则表达式是用来简洁表达一组字符串的表达式通用的字符串表达框架简洁表达一组字符串的表达式针对字符串表达“简洁”和“特征”思想的工具判断某字符串的特征归属正则表达式在文本处理中十分常用表达文本类型的特征（病毒、入侵等）同阅读全文

posted @ 2020-05-21 08:37 cltt 阅读(201) 评论(0) 推荐(0)

实战6 中国大学排名

摘要：功能描述输入：大学排名URL链接输出：大学排名信息的屏幕输出（排名，大学名称，总分）技术路线：requests-bs4 定向爬虫：仅对输入URL进行爬取，不扩展爬取程序的结构设计步骤1：从网络上获取大学排名网页内容 getHTMLText() 步骤2：提取网页内容中信息到合适的数据结构阅读全文

posted @ 2020-05-19 11:28 cltt 阅读(203) 评论(0) 推荐(0)

信息标记

摘要：HTML的信息标记：HTML通过预定义的<>...</>标签形式组织不同类型的信息信息标记的三种形式：XML，JSON，YAML XML JSON subkey如下所示： JSON实例 YMAL YMAL：多行文本总结来说有以下几种 YMAL实例三种信息标记形式的比较 XML 最早的通用信息阅读全文

posted @ 2020-05-18 22:31 cltt 阅读(312) 评论(0) 推荐(0)

Beautifulsoup

摘要：Beautiful Soup：解析HTML页面信息标记与提取方法获取网页源代码 import requests from bs4 import BeautifulSoup kv = {'user-agent':'Mozilla/5.0'} url = "https://python123.io/w 阅读全文

posted @ 2020-05-17 22:37 cltt 阅读(367) 评论(0) 推荐(0)

实例5：IP地址归属地的自动查询

摘要：#ip查询全代码 import requests import time url='http://www.ip138.com/ips138.asp?ip=202.204.80.112' r = requests.get(url) print(r.status_code) print(r.reques 阅读全文

posted @ 2020-05-17 22:14 cltt 阅读(1865) 评论(0) 推荐(1)

实例4：网络图片的爬取和存储

摘要：网络图片链接的格式：http://www.example.com/picture.jpg 图片爬取代码 import requests import os #url = 'https://image.baidu.com/search/detail?ct=503316480&z=&tn=baiduim 阅读全文

posted @ 2020-05-17 17:18 cltt 阅读(430) 评论(0) 推荐(0)

实例3：百度360搜索关键词提交

摘要：百度搜索 import requests keyword = 'Python' try: kv = {'wd':keyword} r = requests.get('http://www.baidu.com/s',params=kv) print(r.request.url) r.raise_for 阅读全文

posted @ 2020-05-17 16:34 cltt 阅读(1255) 评论(0) 推荐(0)

爬虫实战2 亚马逊

摘要：import requests r= requests.get('https://www.amazon.cn/dp/B01MYH8A99') print(r.status_code) r.encoding = r.apparent_encoding print(r.text) print(r.req 阅读全文

posted @ 2020-05-17 11:58 cltt 阅读(429) 评论(0) 推荐(0)

爬虫实战1 京东

摘要：url="https://item.jd.com/100012881854.html" kv = {'user-agent':'Mozilla/5.0'} r = requests.get(url,headers = kv) print(r.status_code) print(r.encoding 阅读全文

posted @ 2020-05-17 11:51 cltt 阅读(488) 评论(0) 推荐(1)

爬虫带来的问题

摘要：爬虫的限制来源审查发布公告 Robots协议实例 Robots协议基本语法 robots协议都在根目录下 Robots协议的遵守方式使用网络爬虫：自动或人工识别robots.txt,再进行内容爬取。约束性如何遵守阅读全文

posted @ 2020-05-17 11:38 cltt 阅读(186) 评论(0) 推荐(0)

requests 简介

摘要：import requests r = requests.get('http://www.baidu.com') print(r.status_code) r.encoding = 'utf-8'#不然会乱码 print(r.text) 200<!DOCTYPE html><!--STATUS OK 阅读全文

posted @ 2020-05-17 09:05 cltt 阅读(274) 评论(0) 推荐(0)

随笔分类 - python 爬虫