晨星_star

手撸分布式爬虫框架

摘要：手撸分布式爬虫框架分布式爬虫：分布式爬虫: 分布式进程和进程间通信的内容案例：爬取 2000 个百度百科网络爬虫词条以及相关词条的标题、摘要和链接等信息，采用分布式结构改写基础爬虫，使功能更加强大爬虫结构：模式：分布式爬虫采用主从模式。主从模式是指由一台主机作为控制节点负责所有运行网络阅读全文

posted @ 2020-11-08 11:07 晨星_star 阅读(215) 评论(0) 推荐(0)

爬虫基础框架组成

摘要：爬虫基础框架爬虫调度器：基础模块： URL 管理器、HTML 下载器、HTML 解析器和数据存储器等模块调度器：初始化各个模块，然后通过 crawl(root_url)方法传入入口 URL，方法内部实现按照运行流程控制各个模块的工作 spider 调度: from firstSpider.D 阅读全文

posted @ 2020-11-08 10:05 晨星_star 阅读(301) 评论(0) 推荐(0)

soup 解析

摘要： BeautifulSoup 数据解析 + 提取 soup = Beautifulsoup(html_str,'lxml',from_enconding='utf-8') soup = Beautifulsoup(open('index.html')) print(soup.prettify) # 输阅读全文

posted @ 2020-11-08 09:37 晨星_star 阅读(541) 评论(0) 推荐(0)

tesserocr 使用

摘要： tesserocr 使用：简单识别： import tesserocr from PIL import Image image = Image.open('code.jpg') result = tesserocr.image_to_text(image) print(result) 多余线条干扰阅读全文

posted @ 2020-09-30 23:07 晨星_star 阅读(475) 评论(0) 推荐(0)

tesserocr 安装

摘要： tesserocr 介绍： tesserocr 是 Python 的一个 OCR 识别库，但其实是对 tesseract 做的一层 Python API 封装，所以它的核心是 tesseract。因此，在安装 tesserocr 之前，我们需要先安装 tesseract 验证码，可以通过 OC 阅读全文

posted @ 2020-09-30 22:53 晨星_star 阅读(194) 评论(0) 推荐(0)

python 环境配置

摘要： python环境安装： windows: 下载地址：https://www.python.org/downloads anaconda: https://www.continuum.io/downloads / https://mirrors.tuna.tsinghua.edu.cn/anacond 阅读全文

posted @ 2020-09-28 10:20 晨星_star 阅读(272) 评论(0) 推荐(0)

scrapy 爬取股票

摘要： scrapy 爬取股票 stock.py # -*- coding: utf-8 -*- import scrapy from items import StockstarItem, StockstarItemLoader class StockSpider(scrapy.Spider): name 阅读全文

posted @ 2020-09-27 16:28 晨星_star 阅读(502) 评论(0) 推荐(0)

爬虫多进程优化

摘要：爬虫优化--多进程多进程： from qunar import get_all_data from qunar import dep_list from multiprocessing import Pool # 多进程 if __name__ == "__main__": pool=Pool() 阅读全文

posted @ 2020-09-27 16:18 晨星_star 阅读(195) 评论(0) 推荐(0)

爬虫监控

摘要：数据监控： import requests import urllib import time import pymongo # 必须写在外面，否则无法导入 client=pymongo.MongoClient('localhost',27017) book_qunar=client['qunar' 阅读全文

posted @ 2020-09-27 16:12 晨星_star 阅读(270) 评论(0) 推荐(0)

selenium 爬取去哪儿

摘要： selenium 爬取去哪儿 import requests import urllib.request import time import random from selenium import webdriver from selenium.webdriver.common.by import 阅读全文

posted @ 2020-09-27 16:07 晨星_star 阅读(265) 评论(0) 推荐(0)