[Keras] Install and environment setting
Documentation: https://keras.io/
老笔记
1. 利用anaconda 管理python库是明智的选择。
conda update conda conda update anaconda conda update --all conda install mingw libpython pip install --upgrade --no-deps theano pip install keras
2. 测试theano
python执行:
import theano theano.test()
import theano执行失败证明theano安装不成功,我在theano.test()时出现如下错误:
ERROR: Failure: ImportError (No module named nose_parameterized)
安装nose_parameterized即可,cmd执行:
pip install nose_parameterized
测试过程长,没完没了。
默认是tensorflow
un@un-UX303UB$ cat ~/.keras/keras.json { "image_dim_ordering": "tf", "backend": "tensorflow", "epsilon": 1e-07, "floatx": "float32"
}
un@un-UX303UB$ vim ~/.keras/keras.json un@un-UX303UB$ cat ~/.keras/keras.json { "image_dim_ordering": "th", "backend": "theano", "epsilon": 1e-07, "floatx": "float32" }
可通过代码测试配置:[Keras] Develop Neural Network With Keras Step-By-Step
并行 Keras 训练
-
Docker + Spark + Keras
Ref: http://blog.csdn.net/cyh_24/article/details/49683221
-
其他并行方案
[Link] TFSEQ Part I: 分布式训练的方案和效率对比
-
并行进阶
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Goto: eng.uber.com/horovod/
第一阶段:
1. Sofeware center安装docker。
2. Sequenceiq 公司提供了一个docker容器,里面安装好了spark
-
- 下载:docker pull sequenceiq/spark:1.5.1
- 安装:sudo docker run -it sequenceiq/spark:1.5.1 bash
bash-4.1# cd /usr/local/spark
bash-4.1# cp conf/spark-env.sh.template conf/spark-env.sh
bash-4.1# vi conf/spark-env.sh
末尾添加:
export SPARK_LOCAL_IP=<你的IP地址>
export SPARK_MASTER_IP=<你的IP地址>
-
- 启动master:bash-4.1# ./sbin/start-master.sh
- 启动slave:bash-4.1# ./sbin/start-slave.sh spark://localhost:7077
提交一个应用运行测试一下:
bash-4.1# ./bin/spark-submit examples/src/main/Python/pi.py
16/12/30 19:29:02 INFO spark.SparkContext: Running Spark version 1.5.1 16/12/30 19:29:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/12/30 19:29:03 INFO spark.SecurityManager: Changing view acls to: root 16/12/30 19:29:03 INFO spark.SecurityManager: Changing modify acls to: root 16/12/30 19:29:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 16/12/30 19:29:03 INFO slf4j.Slf4jLogger: Slf4jLogger started 16/12/30 19:29:03 INFO Remoting: Starting remoting 16/12/30 19:29:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@127.0.0.1:34787] 16/12/30 19:29:04 INFO util.Utils: Successfully started service 'sparkDriver' on port 34787. 16/12/30 19:29:04 INFO spark.SparkEnv: Registering MapOutputTracker 16/12/30 19:29:04 INFO spark.SparkEnv: Registering BlockManagerMaster 16/12/30 19:29:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-7e33d1ab-0b51-4f73-82c0-49d97c3f3c0d 16/12/30 19:29:04 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB 16/12/30 19:29:04 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/httpd-b150abe5-c149-4aa9-81fc-6a365f389cf4 16/12/30 19:29:04 INFO spark.HttpServer: Starting HTTP Server 16/12/30 19:29:04 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/12/30 19:29:04 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41503 16/12/30 19:29:04 INFO util.Utils: Successfully started service 'HTTP file server' on port 41503. 16/12/30 19:29:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator 16/12/30 19:29:04 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/12/30 19:29:04 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 16/12/30 19:29:04 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 16/12/30 19:29:04 INFO ui.SparkUI: Started SparkUI at http://127.0.0.1:4040 16/12/30 19:29:04 INFO util.Utils: Copying /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py to /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/userFiles-3cf70968-52fd-49e9-b35d-5eb5f029ec7a/pi.py 16/12/30 19:29:04 INFO spark.SparkContext: Added file file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py at file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py with timestamp 1483144144747 16/12/30 19:29:04 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/12/30 19:29:04 INFO executor.Executor: Starting executor ID driver on host localhost 16/12/30 19:29:05 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44598. 16/12/30 19:29:05 INFO netty.NettyBlockTransferService: Server created on 44598 16/12/30 19:29:05 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/12/30 19:29:05 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:44598 with 530.3 MB RAM, BlockManagerId(driver, localhost, 44598) 16/12/30 19:29:05 INFO storage.BlockManagerMaster: Registered BlockManager 16/12/30 19:29:05 INFO spark.SparkContext: Starting job: reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Got job 0 (reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) with 2 output partitions 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Missing parents: List() 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39), which has no missing parents 16/12/30 19:29:05 INFO storage.MemoryStore: ensureFreeSpace(4136) called with curMem=0, maxMem=556038881 16/12/30 19:29:05 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 530.3 MB) 16/12/30 19:29:05 INFO storage.MemoryStore: ensureFreeSpace(2760) called with curMem=4136, maxMem=556038881 16/12/30 19:29:05 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 530.3 MB) 16/12/30 19:29:05 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:44598 (size: 2.7 KB, free: 530.3 MB) 16/12/30 19:29:05 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861 16/12/30 19:29:05 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) 16/12/30 19:29:05 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 16/12/30 19:29:05 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (365 KB). The maximum recommended task size is 100 KB. 16/12/30 19:29:05 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 374548 bytes) 16/12/30 19:29:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 502351 bytes) 16/12/30 19:29:06 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 16/12/30 19:29:06 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0) 16/12/30 19:29:06 INFO executor.Executor: Fetching file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py with timestamp 1483144144747 16/12/30 19:29:06 INFO util.Utils: /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py has been previously copied to /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/userFiles-3cf70968-52fd-49e9-b35d-5eb5f029ec7a/pi.py 16/12/30 19:29:06 INFO python.PythonRunner: Times: total = 306, boot = 164, init = 7, finish = 135 16/12/30 19:29:06 INFO python.PythonRunner: Times: total = 309, boot = 162, init = 11, finish = 136 16/12/30 19:29:06 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 998 bytes result sent to driver 16/12/30 19:29:06 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 998 bytes result sent to driver 16/12/30 19:29:06 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 419 ms on localhost (1/2) 16/12/30 19:29:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 450 ms on localhost (2/2) 16/12/30 19:29:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/12/30 19:29:06 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) finished in 0.464 s 16/12/30 19:29:06 INFO scheduler.DAGScheduler: Job 0 finished: reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39, took 0.668884 s Pi is roughly 3.146120 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/12/30 19:29:06 INFO ui.SparkUI: Stopped Spark web UI at http://127.0.0.1:4040 16/12/30 19:29:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler 16/12/30 19:29:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/12/30 19:29:06 INFO storage.MemoryStore: MemoryStore cleared 16/12/30 19:29:06 INFO storage.BlockManager: BlockManager stopped 16/12/30 19:29:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 16/12/30 19:29:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/12/30 19:29:06 INFO spark.SparkContext: Successfully stopped SparkContext 16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 16/12/30 19:29:07 INFO util.ShutdownHookManager: Shutdown hook called 16/12/30 19:29:07 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae
恭喜,跑了一个spark的应用程序!
3. 安装elephas:
unsw@unsw-UX303UB$ pythonpip install elephas pythonpip: command not found unsw@unsw-UX303UB$ pip install elephas Collecting elephas Downloading elephas-0.3.tar.gz Requirement already satisfied: keras in /usr/local/anaconda3/lib/python3.5/site-packages (from elephas) Collecting hyperas (from elephas) Downloading hyperas-0.3.tar.gz Requirement already satisfied: pyyaml in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas) Requirement already satisfied: theano in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas) Requirement already satisfied: six in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas) Collecting hyperopt (from hyperas->elephas) Downloading hyperopt-0.1.tar.gz (98kB) 100% |████████████████████████████████| 102kB 1.7MB/s Collecting entrypoints (from hyperas->elephas) Downloading entrypoints-0.2.2-py2.py3-none-any.whl Requirement already satisfied: jupyter in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas) Requirement already satisfied: nbformat in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas) Requirement already satisfied: nbconvert in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas) Requirement already satisfied: numpy>=1.7.1 in /usr/local/anaconda3/lib/python3.5/site-packages (from theano->keras->elephas) Requirement already satisfied: scipy>=0.11 in /usr/local/anaconda3/lib/python3.5/site-packages (from theano->keras->elephas) Requirement already satisfied: nose in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperopt->hyperas->elephas) Collecting pymongo (from hyperopt->hyperas->elephas) Downloading pymongo-3.4.0-cp35-cp35m-manylinux1_x86_64.whl (359kB) 100% |████████████████████████████████| 368kB 1.5MB/s Requirement already satisfied: networkx in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperopt->hyperas->elephas) Collecting future (from hyperopt->hyperas->elephas) Downloading future-0.16.0.tar.gz (824kB) 100% |████████████████████████████████| 829kB 1.5MB/s Requirement already satisfied: decorator>=3.4.0 in /usr/local/anaconda3/lib/python3.5/site-packages (from networkx->hyperopt->hyperas->elephas) Building wheels for collected packages: elephas, hyperas, hyperopt, future Running setup.py bdist_wheel for elephas ... done Stored in directory: /home/unsw/.cache/pip/wheels/b6/fe/74/8e079673e5048a583b547a0dc5d83a7fea883933472da1cefb Running setup.py bdist_wheel for hyperas ... done Stored in directory: /home/unsw/.cache/pip/wheels/85/7d/da/b417ee5e31b62d51c75afa6eb2ada9ccf8b7aff2de71d82c1b Running setup.py bdist_wheel for hyperopt ... done Stored in directory: /home/unsw/.cache/pip/wheels/4b/0f/9d/1166e48523d3bf7478800f250b0fceae31ac6a08b8a7cca820 Running setup.py bdist_wheel for future ... done Stored in directory: /home/unsw/.cache/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a Successfully built elephas hyperas hyperopt future Installing collected packages: pymongo, future, hyperopt, entrypoints, hyperas, elephas Successfully installed elephas-0.3 entrypoints-0.2.2 future-0.16.0 hyperas-0.3 hyperopt-0.1 pymongo-3.4.0
第二阶段:
√ 如果你的机器有多个CPU(假设24个):
你可以只开一个docker,然后很简单的使用spark结合elephas来并行(利用24个cpu)计算CNN。
√ 如果你的机器有多个GPU(假设4个):
你可以开4个docker镜像,修改每个镜像内的~/.theanorc来选择特定的GPU来并行(4个GPU)计算。(需自行安装cuda)
End.