python爬虫入门(1)-开发环境配置

所谓的爬虫，就是通过模拟点击浏览器发送网络请求，接收站点请求响应，获取互联网信息的一组自动化程序。也就是,只要浏览器(客户端)能做的事情，爬虫都能够做。
现在的互联网大数据时代，给予我们的是生活的便利以及海量数据爆炸式的出现在网络中。除了网页，还有各种手机APP，例如微信、微博、抖音，一天产生高达亿计的状态更新信息，百度任意一个词条，也可以得到无数相关信息，但是我们得到了海量的信息，但是大多数都是无效的垃圾信息，在海量的信息中，要筛选来得到有用的信息，手动筛选不仅效率慢，还费时；但利用爬虫和数据库，不仅可以保存相关特定的数据，还可以通过特定的程序来筛选出有用的信息。
可以用来爬虫的语言有很多，常见的有PHP,JAVA,C#,C++,Python等，由于Python有足够相关的库，用python做爬虫相对比较简单，且功能比较齐全。
下面就先了解一下基于python爬虫基本配置：
1、由于一直是用基于anaconda平台的python，继续使用，具体安装可以参考python入门基础(1)—安装
2、为了方便大家安装，特意汇总了一下可能需要安装python相关库，直接在Anaconda prompt中粘贴以下代码，即可自动安装，直至安装完成。

pip3 install urllib3 -i https://mirrors.aliyun.com/pypi/simple/
pip3 install requests -i https://mirrors.aliyun.com/pypi/simple/
pip3 install selenium -i https://mirrors.aliyun.com/pypi/simple/
pip3 install chromedriver -i https://mirrors.aliyun.com/pypi/simple/
pip3 install phantomjs -i https://mirrors.aliyun.com/pypi/simple/
pip3 install lxml -i https://mirrors.aliyun.com/pypi/simple/
pip3 install beautifulsoup4 -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pyquery -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pymysql -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pymongo -i https://mirrors.aliyun.com/pypi/simple/
pip3 install redis -i https://mirrors.aliyun.com/pypi/simple/
pip3 install flask -i https://mirrors.aliyun.com/pypi/simple/
pip3 install django -i https://mirrors.aliyun.com/pypi/simple/
pip3 install jupyter -i https://mirrors.aliyun.com/pypi/simple/
pip3 install splash -i https://mirrors.aliyun.com/pypi/simple/
pip3 install docker -i https://mirrors.aliyun.com/pypi/simple/
pip3 install scapy -i https://mirrors.aliyun.com/pypi/simple/
pip3 install spyder -i https://mirrors.aliyun.com/pypi/simple/
pip3 install matplotlib -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pandas -i https://mirrors.aliyun.com/pypi/simple/
pip3 install scikit-learn -i https://mirrors.aliyun.com/pypi/simple/
pip3 install GeckoDriver -i https://mirrors.aliyun.com/pypi/simple/
pip3 install PhantomJS -i https://mirrors.aliyun.com/pypi/simple/
pip3 install aiohttp -i https://mirrors.aliyun.com/pypi/simple/
pip3 install tesserocr -i https://mirrors.aliyun.com/pypi/simple/
pip3 install RedisDump -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Tornado -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Charles -i https://mirrors.aliyun.com/pypi/simple/
pip3 install mitmproxy -i https://mirrors.aliyun.com/pypi/simple/
pip3 install mitmdump -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Appium -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pyspider -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapy-Splash -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapy-Redis -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Docker -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapyd -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapyd-Client -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapyd API -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Scrapydrt -i https://mirrors.aliyun.com/pypi/simple/
pip3 install Gerapy -i https://mirrors.aliyun.com/pypi/simple/
pip3 install pygame -i https://mirrors.aliyun.com/pypi/simple/
pip3 install nbconvert -i https://mirrors.aliyun.com/pypi/simple/

至于每个库是具体是做什么用的，大致可以分为以下几类：

1）请求库：requests 、Selenium、ChromeDriver、GeckoDriver、phantomJS、aiohttp，主要是用来实现http请求操作。

2）解析库：lxml、BeautifulSoup、pyquery、tesserocr，抓取网页代码后，需要从网页中提取有用信息，解析库就提供了非常强劲的解析方法，可以高效、便捷从网页代码中提取有用的信息。

3）数据库：Mysql、MongoDB、redis，数据库作为数据存储的重要部分，也是爬虫中必不可少的一部分，主要用来存储提取到的有用信息。具体安装，自行百度一下。

4）存储库：PyMySQL，PyMongo、redis-py、redisDump，你可以将这几个当成是python程序与数据库mysql/mongodb/redis等接口，数据库用来提供存储服务，保存数据，而存储库是用来在两者之间进行交互的。

5）web库：flask、tornado库，主要是用web库程序来搭建一些API接口，供我们的爬虫使用。

6）app库：Charles、mitmproxy、mitmdump、Appium，手机APP数据量也非常大，爬虫也可以抓取APP的数据，因此可以用一些抓包技术来抓取数据。

7）爬虫框架：pyspider、Scrapy、Scrapy-Splash、Scrapy-redis等，框架可以简化代码，架构清晰，只需要关心爬取逻辑。

8）部署相关库：Docker、Scrapyd、Scrapyd-Client、Scrapyd API 、Scrapydrt、Gerapy。

如果想要大规模抓取数据，就会用到分布式爬虫。对于分布式当使用分布式爬虫，需要多台主机，每台主机有多个爬虫任务，但源代码其实只有一份。
将一份代码同时部署到多台主机上，进行协同运行。Scrapy有一个扩展组件，叫作 Scrapyd，只需要安装该扩展组件，即可远程管理Scrapy 任务，
包括部署源码、启动任务、监听任务等。另外，还有 Scrapyd-Client 和 Scrapyd API来帮助我们更方便地完成部署和监听操作。
另外一种部署方式，那就是 Docker 集群部署。只需要将爬虫制作为 Docker 镜像，只要主机安装了Docker，就可以直接运行爬虫，而无需再去担心环境配置、版本问题。

后面我们再分别针对上述几个部分分别作详细介绍。

posted @ 2023-10-24 10:55 PursuitingPeak 阅读(412) 评论(0) 收藏举报

刷新页面返回顶部

codingchen

python爬虫入门(1)-开发环境配置

公告