[Keras] Install and environment setting

Documentation: https://keras.io/

老笔记

1. 利用anaconda 管理python库是明智的选择。

conda update conda
conda update anaconda
conda update --all
conda install mingw libpython
pip install --upgrade --no-deps theano
pip install keras

 

2. 测试theano

python执行:

import theano
theano.test()

import theano执行失败证明theano安装不成功,我在theano.test()时出现如下错误:

ERROR: Failure: ImportError (No module named nose_parameterized)

安装nose_parameterized即可,cmd执行:

pip install nose_parameterized

 测试过程长,没完没了。

 

默认是tensorflow

un@un-UX303UB$ cat ~/.keras/keras.json 
{
  "image_dim_ordering": "tf",
  "backend": "tensorflow",
  "epsilon": 1e-07,
  "floatx": "float32"
}
un@un-UX303UB$ vim ~/.keras/keras.json un@un-UX303UB$ cat ~/.keras/keras.json {   "image_dim_ordering": "th",   "backend": "theano",   "epsilon": 1e-07,   "floatx": "float32" }

 

可通过代码测试配置:[Keras] Develop Neural Network With Keras Step-By-Step

 

 

并行 Keras 训练

  • Docker + Spark + Keras

Ref: http://blog.csdn.net/cyh_24/article/details/49683221

 

  •  其他并行方案

【经验分享】如何使用keras进行多主机分布式训练

keras开发者文档 11:多GPU和分布式训练

[Link] TFSEQ Part I: 分布式训练的方案和效率对比

 

  • 并行进阶

[深度学习] 分布式模式介绍(一)

[深度学习] 分布式Tensorflow介绍(二)

[深度学习] 分布式Pytorch 1.0介绍(三)

[深度学习] 分布式Horovod介绍(四)

Ref: horovod/horovo

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

 

 

第一阶段:

1. Sofeware center安装docker。

2. Sequenceiq 公司提供了一个docker容器,里面安装好了spark

    • 下载:docker pull sequenceiq/spark:1.5.1
    • 安装:sudo docker run -it sequenceiq/spark:1.5.1 bash
bash-4.1# cd /usr/local/spark 
bash-4.1# cp conf/spark-env.sh.template conf/spark-env.sh 
bash-4.1# vi conf/spark-env.sh

末尾添加:
export SPARK_LOCAL_IP=<你的IP地址>
export SPARK_MASTER_IP=<你的IP地址>

 

    • 启动master:bash-4.1# ./sbin/start-master.sh
    • 启动slave:bash-4.1# ./sbin/start-slave.sh spark://localhost:7077

 

提交一个应用运行测试一下:

bash-4.1# ./bin/spark-submit examples/src/main/Python/pi.py
16/12/30 19:29:02 INFO spark.SparkContext: Running Spark version 1.5.1
16/12/30 19:29:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/30 19:29:03 INFO spark.SecurityManager: Changing view acls to: root
16/12/30 19:29:03 INFO spark.SecurityManager: Changing modify acls to: root
16/12/30 19:29:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/12/30 19:29:03 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/12/30 19:29:03 INFO Remoting: Starting remoting
16/12/30 19:29:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@127.0.0.1:34787]
16/12/30 19:29:04 INFO util.Utils: Successfully started service 'sparkDriver' on port 34787.
16/12/30 19:29:04 INFO spark.SparkEnv: Registering MapOutputTracker
16/12/30 19:29:04 INFO spark.SparkEnv: Registering BlockManagerMaster
16/12/30 19:29:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-7e33d1ab-0b51-4f73-82c0-49d97c3f3c0d
16/12/30 19:29:04 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB
16/12/30 19:29:04 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/httpd-b150abe5-c149-4aa9-81fc-6a365f389cf4
16/12/30 19:29:04 INFO spark.HttpServer: Starting HTTP Server
16/12/30 19:29:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/12/30 19:29:04 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41503
16/12/30 19:29:04 INFO util.Utils: Successfully started service 'HTTP file server' on port 41503.
16/12/30 19:29:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/12/30 19:29:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/12/30 19:29:04 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/12/30 19:29:04 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/12/30 19:29:04 INFO ui.SparkUI: Started SparkUI at http://127.0.0.1:4040
16/12/30 19:29:04 INFO util.Utils: Copying /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py to /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/userFiles-3cf70968-52fd-49e9-b35d-5eb5f029ec7a/pi.py
16/12/30 19:29:04 INFO spark.SparkContext: Added file file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py at file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py with timestamp 1483144144747
16/12/30 19:29:04 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/12/30 19:29:04 INFO executor.Executor: Starting executor ID driver on host localhost
16/12/30 19:29:05 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44598.
16/12/30 19:29:05 INFO netty.NettyBlockTransferService: Server created on 44598
16/12/30 19:29:05 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/12/30 19:29:05 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:44598 with 530.3 MB RAM, BlockManagerId(driver, localhost, 44598)
16/12/30 19:29:05 INFO storage.BlockManagerMaster: Registered BlockManager
16/12/30 19:29:05 INFO spark.SparkContext: Starting job: reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Got job 0 (reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) with 2 output partitions
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39)
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Missing parents: List()
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39), which has no missing parents
16/12/30 19:29:05 INFO storage.MemoryStore: ensureFreeSpace(4136) called with curMem=0, maxMem=556038881
16/12/30 19:29:05 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 530.3 MB)
16/12/30 19:29:05 INFO storage.MemoryStore: ensureFreeSpace(2760) called with curMem=4136, maxMem=556038881
16/12/30 19:29:05 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 530.3 MB)
16/12/30 19:29:05 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:44598 (size: 2.7 KB, free: 530.3 MB)
16/12/30 19:29:05 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
16/12/30 19:29:05 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39)
16/12/30 19:29:05 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/12/30 19:29:05 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (365 KB). The maximum recommended task size is 100 KB.
16/12/30 19:29:05 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 374548 bytes)
16/12/30 19:29:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 502351 bytes)
16/12/30 19:29:06 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
16/12/30 19:29:06 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/12/30 19:29:06 INFO executor.Executor: Fetching file:/usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py with timestamp 1483144144747
16/12/30 19:29:06 INFO util.Utils: /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py has been previously copied to /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae/userFiles-3cf70968-52fd-49e9-b35d-5eb5f029ec7a/pi.py
16/12/30 19:29:06 INFO python.PythonRunner: Times: total = 306, boot = 164, init = 7, finish = 135
16/12/30 19:29:06 INFO python.PythonRunner: Times: total = 309, boot = 162, init = 11, finish = 136
16/12/30 19:29:06 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 998 bytes result sent to driver
16/12/30 19:29:06 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 998 bytes result sent to driver
16/12/30 19:29:06 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 419 ms on localhost (1/2)
16/12/30 19:29:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 450 ms on localhost (2/2)
16/12/30 19:29:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/12/30 19:29:06 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39) finished in 0.464 s
16/12/30 19:29:06 INFO scheduler.DAGScheduler: Job 0 finished: reduce at /usr/local/spark-1.5.1-bin-hadoop2.6/examples/src/main/python/pi.py:39, took 0.668884 s
Pi is roughly 3.146120
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/12/30 19:29:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/12/30 19:29:06 INFO ui.SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
16/12/30 19:29:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler
16/12/30 19:29:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/12/30 19:29:06 INFO storage.MemoryStore: MemoryStore cleared
16/12/30 19:29:06 INFO storage.BlockManager: BlockManager stopped
16/12/30 19:29:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/12/30 19:29:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/12/30 19:29:06 INFO spark.SparkContext: Successfully stopped SparkContext
16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/12/30 19:29:06 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/12/30 19:29:07 INFO util.ShutdownHookManager: Shutdown hook called
16/12/30 19:29:07 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fa0f4397-bef6-4261-b167-005113a0b5ae
Log

恭喜,跑了一个spark的应用程序!

 

3. 安装elephas:

unsw@unsw-UX303UB$ pythonpip install elephas
pythonpip: command not found
unsw@unsw-UX303UB$ pip install elephas
Collecting elephas
  Downloading elephas-0.3.tar.gz
Requirement already satisfied: keras in /usr/local/anaconda3/lib/python3.5/site-packages (from elephas)
Collecting hyperas (from elephas)
  Downloading hyperas-0.3.tar.gz
Requirement already satisfied: pyyaml in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas)
Requirement already satisfied: theano in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas)
Requirement already satisfied: six in /usr/local/anaconda3/lib/python3.5/site-packages (from keras->elephas)
Collecting hyperopt (from hyperas->elephas)
  Downloading hyperopt-0.1.tar.gz (98kB)
    100% |████████████████████████████████| 102kB 1.7MB/s 
Collecting entrypoints (from hyperas->elephas)
  Downloading entrypoints-0.2.2-py2.py3-none-any.whl
Requirement already satisfied: jupyter in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas)
Requirement already satisfied: nbformat in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas)
Requirement already satisfied: nbconvert in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperas->elephas)
Requirement already satisfied: numpy>=1.7.1 in /usr/local/anaconda3/lib/python3.5/site-packages (from theano->keras->elephas)
Requirement already satisfied: scipy>=0.11 in /usr/local/anaconda3/lib/python3.5/site-packages (from theano->keras->elephas)
Requirement already satisfied: nose in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperopt->hyperas->elephas)
Collecting pymongo (from hyperopt->hyperas->elephas)
  Downloading pymongo-3.4.0-cp35-cp35m-manylinux1_x86_64.whl (359kB)
    100% |████████████████████████████████| 368kB 1.5MB/s 
Requirement already satisfied: networkx in /usr/local/anaconda3/lib/python3.5/site-packages (from hyperopt->hyperas->elephas)
Collecting future (from hyperopt->hyperas->elephas)
  Downloading future-0.16.0.tar.gz (824kB)
    100% |████████████████████████████████| 829kB 1.5MB/s 
Requirement already satisfied: decorator>=3.4.0 in /usr/local/anaconda3/lib/python3.5/site-packages (from networkx->hyperopt->hyperas->elephas)
Building wheels for collected packages: elephas, hyperas, hyperopt, future
  Running setup.py bdist_wheel for elephas ... done
  Stored in directory: /home/unsw/.cache/pip/wheels/b6/fe/74/8e079673e5048a583b547a0dc5d83a7fea883933472da1cefb
  Running setup.py bdist_wheel for hyperas ... done
  Stored in directory: /home/unsw/.cache/pip/wheels/85/7d/da/b417ee5e31b62d51c75afa6eb2ada9ccf8b7aff2de71d82c1b
  Running setup.py bdist_wheel for hyperopt ... done
  Stored in directory: /home/unsw/.cache/pip/wheels/4b/0f/9d/1166e48523d3bf7478800f250b0fceae31ac6a08b8a7cca820
  Running setup.py bdist_wheel for future ... done
  Stored in directory: /home/unsw/.cache/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a
Successfully built elephas hyperas hyperopt future
Installing collected packages: pymongo, future, hyperopt, entrypoints, hyperas, elephas
Successfully installed elephas-0.3 entrypoints-0.2.2 future-0.16.0 hyperas-0.3 hyperopt-0.1 pymongo-3.4.0
Log

 

 

第二阶段:


 如果你的机器有多个CPU(假设24个):

你可以只开一个docker,然后很简单的使用spark结合elephas来并行(利用24个cpu)计算CNN。

 如果你的机器有多个GPU(假设4个):

你可以开4个docker镜像,修改每个镜像内的~/.theanorc来选择特定的GPU来并行(4个GPU)计算。(需自行安装cuda)


End.
posted @ 2016-12-21 08:50  郝壹贰叁  阅读(744)  评论(0编辑  收藏  举报