pyspider安装
操作系统
CentOS Linux release 7.0.1406 (Core)
Python环境
Python安装
安装依赖:
yum install gcc # 安装python必须
yum install zlib # 以下四个安装setuptools必须,如果安装在python后,则需要重新make python
yum install zlib-devel
yum install openssl
yum install openssl-devel
cd Python-2.7.13
./configure --prefix=/python2.7
make
make install
配置环境变量
# vi ~/.bash_profile
export PATH=/python2.7/bin:$PATH
安装pip
依赖:setuptools
依赖:six-1.10.0.tar.gz packaging-16.8.tar.gz pyparsing-2.2.0.tar.gz appdirs-1.4.3.tar.gz
cd pip-9.0.1
# python setup.py install
安装pyspider
从github下载最新版pyspider
依赖系统包:
tcl protobuf libcurl-devel libxslt-devel libxml2
使用yum install 安装他们。。。
cd pyspider # 安装依赖包并安装 pip install -r requirements.txt python setup.py install
由于requirements.txt中的mysql-connector无法下载,所以选择安装其它版本的mysql-connector
pip install mysql-connector==2.1.4
安装mysql数据库
用yum安装完后,参考http://www.itnose.net/detail/6310643.html,完成数据库的安装。
# 重启mysql service mysqld restart # mysql -u root # 修改root密码 mysql> use msyql mysql> update user set password=password('123456') where user='root'; # 创建数据库并授权 mysql> create database taskdb; mysql> create database projectdb; mysql> create database resultdb; mysql> create user 'pyspider'@'%'; mysql> create user pyspider@'localhost' identified by 'pyspider-pass'; mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on taskdb.* to 'pyspider'@'%'; mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on projectdb.* to 'pyspider'@'%'; mysql> grant select,insert,update,references,delete,create,drop,alter,index,trigger,create view,show view,execute,alter routine,create routine,create temporary tables,lock tables,event on resultdb.* to 'pyspider'@'%'; mysql> flush privileges; 修改配置文件(为集群做准备) vi /etc/my.cnf bind-address = 0.0.0.0 # 重启数据库 service mysqld restart
安装redis
下载redis,并解压到/root/training目录下
安装redis
cd /root/training/redis-2.8.12 make make test make install # 为集群做准备 cd /root/training/redis-3.2.8 cp redis.conf /etc/ vi /etc/redis.conf bind 0.0.0.0 # 启动 redis redis-server /etc/redis.conf &
启动成功标志:The server is now ready to accept connections on port 6379
防火墙
查看防火墙状态:
firewall-cmd --state
自己两条配置:
iptables -A INPUT -s 127.0.0.1 -p tcp --dport 6379 -j ACCEPT
iptables -A INPUT -p tcp --dport 6379 -j DROP
关闭firewall:
systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动
如果不会配置,最好停止防火墙。
安装phantomjs
下载:wget https://bbuseruploads.s3.amazonaws.com/fd96ed93-2b32-46a7-9d2b-ecbc0988516a/downloads/396e7977-71fd-4592-8723-495ca4cfa7cc/phantomjs-2.1.1-linux-x86_64.tar.bz2?Signature=guF7TAUW11qr9nZXcTBHu7dg1ds%3D&Expires=1488510600&AWSAccessKeyId=AKIAIVFPT2YJYYZY3H4A&versionId=null&response-content-disposition=attachment%3B%20filename%3D%22phantomjs-2.1.1-linux-x86_64.tar.bz2%22
下载phantomjs-2.1.1-linux-x86_64.tar.bz2到/root目录下,解压
将 phantomjs/bin目录下的phantomjs文件拷贝到/python2.7/bin目录下
配置文件
====================================================================
pyspider配置文件如下:
{ "taskdb": "mysql+taskdb://pyspider:pyspider-pass@localhost:3306/taskdb", "projectdb": "mysql+projectdb://pyspider:pyspider-pass@localhost:3306/projectdb", "resultdb": "mysql+resultdb://pyspider:pyspider-pass@localhost:3306/resultdb", "message_queue": "redis://localhost:6379/db", "webui": { "port":5555, "username": "pyspider", "password": "pyspider-pass", "need-auth": true } }
=========================================
# 为安全起见,我们新建一个普通用户来存储配置文件
useradd -md /pyspider pyspider
# 保存配置文件
/pyspider/config.json
# 权限设置
chown -R pyspider:pyspider /pyspider
chmod 400 config.json
启动pyspider
启动pyspider
/anaconda2/bin/pyspider -c /pyspider/config.json
结果如下:
# pyspider -c /pyspider/config.json [W 170516 17:45:05 __init__:54] redis DB must zero-based numeric index, using 0 instead [I 170516 17:45:05 result_worker:49] result_worker starting... [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:06 __init__:54] redis DB must zero-based numeric index, using 0 instead [I 170516 17:45:06 processor:211] processor starting... [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead [I 170516 17:45:07 tornado_fetcher:638] fetcher starting... [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:07 __init__:54] redis DB must zero-based numeric index, using 0 instead [I 170516 17:45:09 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 170516 17:45:09 scheduler:647] scheduler starting... phantomjs fetcher running on port 25555 [I 170516 17:45:09 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead [W 170516 17:45:10 __init__:54] redis DB must zero-based numeric index, using 0 instead [I 170516 17:45:10 app:76] webui running on 0.0.0.0:5555
///目前这块还有问题
安装supervisor,监控所有进程
supervisor用来监控pyspider进程,如果停止则立即启动,下载supervisor-3.3.1到/root目录下,并解压。
cd /root/supervisor-3.3.1 python setup.py install
或
pip install supervisor
创建默认的配置文件并设置
# /python2.7/bin/echo_supervisord_conf > /python2.7/conf/supervisor.conf
; Sample supervisor config file. ; ; For more information on the config file, please see: ; http://supervisord.org/configuration.html ; ; Notes: ; - Shell expansion ("~" or "$HOME") is not supported. Environment ; variables can be expanded using this syntax: "%(ENV_HOME)s". ; - Comments must have a leading space: "a=b ;comment" not "a=b;comment". [unix_http_server] file=/tmp/supervisor.sock ; (the path to the socket file) chmod=0700 ; socket file mode (default 0700) chown=root:root ; socket file uid:gid owner ;username=user ; (default is no username (open server)) ;password=123 ; (default is no password (open server)) [inet_http_server] ; inet (TCP) server disabled by default port=127.0.0.1:9001 ; (ip_address:port specifier, *:port for all iface) username=supervisor ; (default is no username (open server)) password=123 ; (default is no password (open server)) [supervisord] logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log) logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB) logfile_backups=10 ; (num of main logfile rotation backups;default 10) loglevel=info ; (log level;default info; others: debug,warn,trace) pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid) nodaemon=false ; (start in foreground if true;default false) minfds=1024 ; (min. avail startup file descriptors;default 1024) minprocs=200 ; (min. avail process descriptors;default 200) ;umask=022 ; (process file creation umask;default 022) ;user=chrism ; (default is current user, required if root) ;identifier=supervisor ; (supervisord identifier, default is 'supervisor') ;directory=/tmp ; (default is not to cd during start) ;nocleanup=true ; (don't clean up tempfiles at start;default false) ;childlogdir=/tmp ; ('AUTO' child log dir, default $TEMP) ;environment=KEY="value" ; (key value pairs to add to environment) ;strip_ansi=false ; (strip ansi escape codes in logs; def. false) ; the below section must remain in the config file for RPC ; (supervisorctl/web interface) to work, additional interfaces may be ; added by defining them in separate rpcinterface: sections [rpcinterface:supervisor] supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface [supervisorctl] serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL for a unix socket ;serverurl=http://127.0.0.1:9001 ; use an http:// url to specify an inet socket username=suppervisor ; should be same as http_username if set password=123 ; should be same as http_password if set prompt=mysupervisor ; cmd line prompt (default "supervisor") history_file=~/.sc_history ; use readline history if available ; The below sample program section shows all possible program subsection values, ; create one or more 'real' program: sections to be able to control them under ; supervisor. ;[program:theprogramname] ;command=/bin/cat ; the program (relative uses PATH, can take args) ;process_name=%(program_name)s ; process_name expr (default %(program_name)s) ;numprocs=1 ; number of processes copies to start (def 1) ;directory=/tmp ; directory to cwd to before exec (def no cwd) ;umask=022 ; umask for process (default None) ;priority=999 ; the relative start priority (default 999) ;autostart=true ; start at supervisord start (default: true) ;startsecs=1 ; # of secs prog must stay up to be running (def. 1) ;startretries=3 ; max # of serial start failures when starting (default 3) ;autorestart=unexpected ; when to restart if exited after running (def: unexpected) ;exitcodes=0,2 ; 'expected' exit codes used with autorestart (default 0,2) ;stopsignal=QUIT ; signal used to kill process (default TERM) ;stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10) ;stopasgroup=false ; send stop signal to the UNIX process group (default false) ;killasgroup=false ; SIGKILL the UNIX process group (def false) ;user=chrism ; setuid to this UNIX account to run the program ;redirect_stderr=true ; redirect proc stderr to stdout (default false) ;stdout_logfile=/a/path ; stdout log path, NONE for none; default AUTO ;stdout_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stdout_logfile_backups=10 ; # of stdout logfile backups (default 10) ;stdout_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0) ;stdout_events_enabled=false ; emit events on stdout writes (default false) ;stderr_logfile=/a/path ; stderr log path, NONE for none; default AUTO ;stderr_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stderr_logfile_backups=10 ; # of stderr logfile backups (default 10) ;stderr_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0) ;stderr_events_enabled=false ; emit events on stderr writes (default false) ;environment=A="1",B="2" ; process environment additions (def no adds) ;serverurl=AUTO ; override serverurl computation (childutils) ; The below sample eventlistener section shows all possible ; eventlistener subsection values, create one or more 'real' ; eventlistener: sections to be able to handle event notifications ; sent by supervisor. ;[eventlistener:theeventlistenername] ;command=/bin/eventlistener ; the program (relative uses PATH, can take args) ;process_name=%(program_name)s ; process_name expr (default %(program_name)s) ;numprocs=1 ; number of processes copies to start (def 1) ;events=EVENT ; event notif. types to subscribe to (req'd) ;buffer_size=10 ; event buffer queue size (default 10) ;directory=/tmp ; directory to cwd to before exec (def no cwd) ;umask=022 ; umask for process (default None) ;priority=-1 ; the relative start priority (default -1) ;autostart=true ; start at supervisord start (default: true) ;startsecs=1 ; # of secs prog must stay up to be running (def. 1) ;startretries=3 ; max # of serial start failures when starting (default 3) ;autorestart=unexpected ; autorestart if exited after running (def: unexpected) ;exitcodes=0,2 ; 'expected' exit codes used with autorestart (default 0,2) ;stopsignal=QUIT ; signal used to kill process (default TERM) ;stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10) ;stopasgroup=false ; send stop signal to the UNIX process group (default false) ;killasgroup=false ; SIGKILL the UNIX process group (def false) ;user=chrism ; setuid to this UNIX account to run the program ;redirect_stderr=false ; redirect_stderr=true is not allowed for eventlisteners ;stdout_logfile=/a/path ; stdout log path, NONE for none; default AUTO ;stdout_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stdout_logfile_backups=10 ; # of stdout logfile backups (default 10) ;stdout_events_enabled=false ; emit events on stdout writes (default false) ;stderr_logfile=/a/path ; stderr log path, NONE for none; default AUTO ;stderr_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stderr_logfile_backups=10 ; # of stderr logfile backups (default 10) ;stderr_events_enabled=false ; emit events on stderr writes (default false) ;environment=A="1",B="2" ; process environment additions ;serverurl=AUTO ; override serverurl computation (childutils) ; The below sample group section shows all possible group values, ; create one or more 'real' group: sections to create "heterogeneous" ; process groups. ;[group:thegroupname] ;programs=progname1,progname2 ; each refers to 'x' in [program:x] definitions ;priority=999 ; the relative start priority (default 999) ; The [include] section can just contain the "files" setting. This ; setting can list multiple files (separated by whitespace or ; newlines). It can also contain wildcards. The filenames are ; interpreted as relative to this file. Included files *cannot* ; include files themselves. ;[include] ;files = relative/directory/*.ini [group:pyspider] programs=pyspider-fetcher,pyspider-processor [program:pyspider-fetcher] command=/python2.7/bin/pyspider -c /pyspider/config.json fetcher autorestart=true autostart=true user=root group=pyspider stopasgroup=true [program:pyspider-processor] command=/python2.7/bin/pyspider -c /pyspider/config.json processor autorestart=true autostart=true user=root group=pyspider stopasgroup=true stderr_logfile=/var/Spider/Log/Process/spider_process_err.log stdout_logfile=/var/Spider/Log/Process/spider_process_out.log
启动supervisor
# supervisord -c /etc/supervisor.conf
注:config.json配置修改后需要重载
# supervisorctl reload
目前为止pyspider已安装完成
登陆pyspider
http://ip:5555/
排错:
ImportError: pycurl: libcurl link-time ssl backend (nss) is different from compile-time ssl backend (none/other)
# pip uninstall pycurl
# export PYCURL_SSL_LIBRARY=nss
# pip install pycurl
ImportError: No module named _sqlite3
# find / -name _sqlite*.so
/usr/lib64/python2.7/lib-dynload/_sqlite3.so
/usr/lib64/python2.7/site-packages/_sqlitecache.so
# cp /usr/lib64/python2.7/lib-dynload/_sqlite3.so /python2.7/lib/python2.7/lib-dynload/