xgqfrms™, xgqfrms® : xgqfrms's offical website of cnblogs! xgqfrms™, xgqfrms® : xgqfrms's offical website of GitHub!

Python Crawler All In One

Python Crawler All In One

Python Spider / Python 爬虫

# coding:utf8

__author__ = 'xgqfrms'
__editor__ = 'vscode'
__version__ = '1.0.1'
__copyright__ = """
  Copyright (c) 2012-2050, xgqfrms; mailto:xgqfrms@xgqfrms.xyz
"""

# 导入整个模块
import logging

import url_manager
import html_downloader
import html_parser
import html_outputer
# import url_manager, html_downloader, html_parser, html_outputer

# 导入部分模块
# from spider import url_manager
# from spider import html_downloader
# from spider import html_parser
# from spider import html_outputer

# import group modules
# from spider import url_manager, html_downloader, html_parser, html_outputer

# from url_manager import UrlManager
# from html_downloader import HtmlDownloader
# from html_parserimport HtmlParser
# from html_outputer import HtmlOutputer


class SpiderMain(object):
  # 构造函数 初始化
  def __init__(self):
    # urls 管理
    self.urls = url_manager.UrlManager()
    # html 下载
    self.downloader = html_downloader.HtmlDownloader()
    # html 解析
    self.parser = html_parser.HtmlParser()
    # html 输出结果
    self.outputer = html_outputer.HtmlOutputer()
  # function
  def craw(self, root_url):
    # 计数
    count = 1
    self.urls.add_new_url(root_url)
    while self.urls.has_new_url():
      try:
        new_url = self.urls.get_new_url()
        print 'craw %d:%s' %(count,new_url)
        html_cont = self.downloader.download(new_url)
        new_urls,new_data = self.parser.parse(new_url,html_cont)
        self.urls.add_new_urls(new_urls)
        self.outputer.collect_data(new_data)
        print new_data['title']
        #
        if count == 10:
          break
        count = count + 1
      # except Exception, e:
      except Exception as e:
        logging.exception(e)
        print 'craw failed' + e
    self.outputer.output_html()
    print 'finished'

# main 程序的入口
if __name__ == '__main__':
  # 实例化
  obj_spider = SpiderMain()
  root_url = 'http://baike.baidu.com/view/21087.htm'
  # 启动 crawler
  obj_spider.craw(root_url)

Pycham CE

bug

简单爬虫架构

爬虫调度端:启动爬虫

程序中三个模块:

  1. url管理器,管理还未爬取和已经爬取的 url 进行统一管理;
  2. url管理器将待爬取的url传送给网页下载器,进行下载,然后以字符串的形式传递给网页解析器进行解析;
  3. 解析出有价值的数据,然后解析出来的新的url又补充到url管理器,这样就形成了一个循环;

爬虫动态运行流程

URL管理器

URL管理器 , 三种实现方式:

  1. 小型使用内存,set()可去除列表中重复的元素
  2. 永久存储使用关系数据库 MySQL
  3. 大型公司出于性能考虑,使用缓存数据库 Redis

网页下载器

Python urllib与urllib2区别

https://www.cnblogs.com/yuxc/archive/2011/08/01/2124073.html

https://www.zhihu.com/question/27621722

# coding:utf8
import urllib2
import cookielib
 
url="http://www1.szu.edu.cn"
 
print "第一种方法"
response1=urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
 
 
print "第二种方法"
request=urllib2.Request(url) 
request.add_header("user-agent", "Mozilla/5.0 ")
response2=urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())
 
 
print "第三种方法"
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3=urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

HTTP request

HTTPS, Proxy

cookies

Python 常用网页解析器

第一种:
Regex 模糊匹配

后三种:
结构化 解析

第一种模糊匹配

其他三种结构化匹配

本课程选用第三方插件Beautiful Soup来做解析器

DOM 结构化解析:

HTML DOM 树

https://www.crummy.com/software/BeautifulSoup/

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html

https://www.crummy.com/software/BeautifulSoup/bs4/download/

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


$ pip install beautifulsoup4

$ python -m pip install --upgrade pip

PyCharm 设置 Python Script 模板内容:

创建.py文件时自动添加 # coding utf8 文件头

File > Settings > Editor > File and Code Templates > Python Script>

coding utf8

爬虫抓取策略

Scrapy

Scrapy 2.3.0

$ pip install scrapy

web spiders

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

一个开放源代码和协作框架,用于从网站提取所需的数据。 以一种快速,简单但可扩展的方式。

https://scrapy.org/

ref

https://gaocegege.com/Blog/spider/scrapy

refs

Python 我自成佛自度我,佛度凡尘我度佛。

https://www.cnblogs.com/xgqfrms/tag/Spider/

https://www.imooc.com/learn/563

Python语言程序设计

https://www.icourse163.org/course/BIT-268001

Python网络爬虫与信息提取

https://www.icourse163.org/course/BIT-1001870001

https://www.youtube.com/watch?v=ZMjhBB17KVY

https://python123.io/index/courses/3044

https://python123.io/index/opencourses

https://www.runoob.com/w3cnote/python-spider-intro.html

爬虫实战

https://www.imooc.com/video/10688/0

(🐞 反爬虫测试!打击盗版⚠️)如果你看到这个信息, 说明这是一篇剽窃的文章,请访问 https://www.cnblogs.com/xgqfrms/ 查看原创文章!



©xgqfrms 2012-2021

www.cnblogs.com/xgqfrms 发布文章使用:只允许注册用户才可以访问!

原创文章,版权所有©️xgqfrms, 禁止转载 🈲️,侵权必究⚠️!


posted @ 2020-08-04 23:27  xgqfrms  阅读(388)  评论(13编辑  收藏  举报