linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程

本文首发于个人博客https://kezunlin.me/post/6b505d27/，欢迎阅读最新内容！

full guide tutorial to install and configure deep learning environments on linux server

Quick Guide

prepare

tools

MobaXterm (for windows)
ssh + vscode

for windows:
drop files to MobaXterm to upload to server
use zip format

commands

view disk

du -d 1 -h
df -h

gpu and cpu usage

watch -n 1 nvidia-smi
top

view files and count

wc -l data.csv

# count how many folders
ls -lR | grep '^d' | wc -l
17

# count how many jpg files
ls -lR | grep '.jpg' | wc -l
1360

# view 10 images 
ls train | head
ls test | head

link datasets

# link 
ln -s srt dest
ln -s /data_1/kezunlin/datasets/ dl4cv/datasets

scp

scp -r node17:~/dl4cv  ~/git/
scp -r node17:~/.keras ~/

tmux for background tasks

tmux new -s notebook
tmux ls 
tmux attach -t notebook
tmux detach

wget download

# wget 
# continue donwload
wget -c url 

# background donwload for large file
wget -b -c url
tail -f wget-log

# kill background wget
pkill -9 wget

tips about training large model

terminal 1:

tmux new -s train
conda activate keras

time python train_alexnet.py

terminal 2:

tmux detach

tmux attach -t train

and then close vscode, otherwise bash training process will exit when we close vscode.

cuda driver and toolkits

see cuda-toolkit for cuda driver version

cudatookit version depends on cuda driver version.

install nvidia-drivers

sudo add-apt-repository ppa:graphics-drivers/ppa
sudp apt-get update

sudo apt-cache search nvidia-*
# nvidia-384
# nvidia-396
sudo apt-get -y install nvidia-418

# test 
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

reboot to test again
https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch

install cuda-toolkit(dirvers)

remove all previous nvidia drivers

sudo apt-get -y pruge nvidia-*

go to here and download cuda_10.1

wget -b -c http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
sudo sh cuda_10.1.243_418.87.00_linux.run

sudo ./cuda_10.1.243_418.87.00_linux.run

vim .bashrc
# for cuda and cudnn
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

check cuda driver version

> cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.87.00  Thu Aug  8 15:35:46 CDT 2019
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) 


>nvidia-smi
Tue Aug 27 17:36:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+


> nvidia-smi -L
GPU 0: Quadro RTX 8000 (UUID: GPU-acb01c1b-776d-cafb-ea35-430b3580d123)
GPU 1: Quadro RTX 8000 (UUID: GPU-df7f0fb8-1541-c9ce-e0f8-e92bccabf0ef)
GPU 2: Quadro RTX 8000 (UUID: GPU-67024023-20fd-a522-dcda-261063332731)
GPU 3: Quadro RTX 8000 (UUID: GPU-7f9d6a27-01ec-4ae5-0370-f0c356327913)

> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

install conda

./Anaconda3-2019.03-Linux-x86_64.sh 
[yes]
[yes]

config channels

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/

conda config --set show_channel_urls yes

install libraries

conclusions:

py37/keras: conda install -y tensorflow-gpu keras==2.2.5
py37/torch: conda install -y pytorch torchvision
py36/mxnet: conda install -y mxnet

keras 2.2.5 was released on 2019/8/23.
Add new Applications: ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2.

common libraries

conda install -y scikit-learn scikit-image pandas matplotlib pillow opencv seaborn
pip install imutils progressbar pydot pylint

pip install imutils to avoid downgrade for tensorflow-gpu

py37

cudatoolkit               10.0.130                  0    
cudnn                     7.6.0                cuda10.0_0    
tensorflow-gpu            1.13.1

py36

cudatoolkit        anaconda/pkgs/main/linux-64::cudatoolkit-10.1.168-0
cudnn              anaconda/pkgs/main/linux-64::cudnn-7.6.0-cuda10.1_0
tensorboard        anaconda/pkgs/main/linux-64::tensorboard-1.14.0-py36hf484d3e_0
tensorflow         anaconda/pkgs/main/linux-64::tensorflow-1.14.0-gpu_py36h3fb9ad6_0
tensorflow-base    anaconda/pkgs/main/linux-64::tensorflow-base-1.14.0-gpu_py36he45bfe2_0
tensorflow-estima~ anaconda/cloud/conda-forge/linux-64::tensorflow-estimator-1.14.0-py36h5ca1d4c_0
tensorflow-gpu     anaconda/pkgs/main/linux-64::tensorflow-gpu-1.14.0-h0d30ee6_0

imutils only support 36 and 37.
mxnet only support 35 and 36.

details

# remove py35
conda remove -n py35 --all

conda info --envs

conda create -n py37 python==3.7
conda activate py37

# common libraries
conda install -y scikit-learn pandas pillow opencv
pip install imutils

# imutils
conda search imutils  
# py36 and py37

# Name                       Version           Build  Channel             
imutils                        0.5.2          py27_0  anaconda/cloud/conda-forge
imutils                        0.5.2          py36_0  anaconda/cloud/conda-forge
imutils                        0.5.2          py37_0  anaconda/cloud/conda-forge

# tensorflow-gpu and keras
conda install -y tensorflow-gpu keras

# install pytorch
conda install -y pytorch torchvision

# install mxnet
# method 1: pip
pip search mxnet
mxnet-cu80[mkl]/mxnet-cu90[mkl]/mxnet-cu91[mkl]/mxnet-cu92[mkl]/mxnet-cu100[mkl]/mxnet-cu101[mkl]

# method 2: conda
conda install mxnet
# py35 and py36

TensorFlow Object Detection API

home page: home page

download tensorflow models and rename models-master to tfmodels

vim ~/.bashrc

export PYTHONPATH=/home/kezunlin/dl4cv:/data_1/kezunlin/tfmodels/research:$PYTHONPATH

source ~/.bashrc

jupyter notebook

conda activate py37
conda install -y jupyter

install kernels

python -m ipykernel install --user --name=py37
Installed kernelspec py37 in /home/kezunlin/.local/share/jupyter/kernels/py37

config for server

python -c "import IPython;print(IPython.lib.passwd())"
Enter password: 
Verify password: 
sha1:ef2fb2aacff2:4ea2998699638e58d10d594664bd87f9c3381c04

jupyter notebook --generate-config
Writing default config to: /home/kezunlin/.jupyter/jupyter_notebook_config.py

vim .jupyter/jupyter_notebook_config.py

c.NotebookApp.ip = '*'  
c.NotebookApp.password = u'sha1:xxx:xxx' 
c.NotebookApp.open_browser = False 
c.NotebookApp.port = 8888 
c.NotebookApp.enable_mathjax = True

run jupyter on background

tmux new -s notebook
jupyter notebook
# ctlr+b+d exit session and DO NOT close session
# ctlr+d exit session and close session

access web and input password

test

py37

import cv2
cv2.__version
import tensorflow as tf
import keras 
import torch
import torchvision

cat .keras/keras.json

{
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow",
    "image_data_format": "channels_last"
}

py36

import mxnet

train demo

export

# use CPU only
export CUDA_VISIBLE_DEVICES=""

# use gpu 0 1
export CUDA_VISIBLE_DEVICES="0,1"

code

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"

start train

python train.py

./keras folder

view keras models and datasets

ls .keras/
datasets  keras.json  models

models saved to /home/kezunlin/.keras/models/
datasets saved to /home/kezunlin/.keras/datasets/

models lists

xxx_kernels_notop.h5 for include_top = False
xxx_kernels.h5 for include_top = True

Datasets

mnist

cifar10

to skip download

wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
mv ~/Download/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz

to load data

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

flowers-17

animals

cat dog panda

panda images are WRONG !!!

counts

ls -lR animals/cat | grep ".jpg" | wc -l
1000
ls -lR animals/dog | grep ".jpg" | wc -l
1000
ls -lR animals/panda | grep ".jpg" | wc -l
1000

kaggle cats vs dogs

dogs-vs-cats

caltech101

caltech101

download background

wget -b -c http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz

Kaggle API

install and config

see kaggle-api

conda activate keras
conda install kaggle

# download kaggle.json
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

cat kaggle.json
{"username":"xxx","key":"yyy"}

or by export

export KAGGLE_USERNAME=xxx
export KAGGLE_KEY=yyy

tips

go to account and select 'Create API Token' and keras.json will be downloaded.

Ensure kaggle.json is in the location ~/.kaggle/kaggle.json to use the API.

check version

kaggle --version
Kaggle API 1.5.5

commands overview

commands

kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}

download datasets

kaggle competitions download -c dogs-vs-cats

show leaderboard

kaggle competitions leaderboard dogs-vs-cats --show
teamId  teamName                           submissionDate       score    
------  ---------------------------------  -------------------  -------  
71046  Pierre Sermanet                    2014-02-01 21:43:19  0.98533  
66623  Maxim Milakov                      2014-02-01 18:20:58  0.98293  
72059  Owen                               2014-02-01 17:04:40  0.97973  
74563  Paul Covington                     2014-02-01 23:05:20  0.97946  
74298  we've been in KAIST                2014-02-01 21:15:30  0.97840  
71949  orchid                             2014-02-01 23:52:30  0.97733

set default competition

kaggle config set --name competition --value dogs-vs-cats
- competition is now set to: dogs-vs-cats

kaggle config set --name competition --value dogs-vs-cats-redux-kernels-edition

dogs-vs-cats
dogs-vs-cats-redux-kernels-edition

submit

kaggle c submissions
- Using competition: dogs-vs-cats
- No submissions found

kaggle c submit -f ./submission.csv -m "first submit"

competition has already ended, so can not submit.

Nvidia-docker and containers

install

sudo apt-get -y install docker

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

restart (optional)

cat /etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

sudo systemctl enable docker
sudo systemctl start docker

if errors occur:
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
check /etc/docker/daemon.json

test

sudo docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi

Thu Aug 29 00:11:32 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:02:00.0 Off |                  Off |
| 43%   67C    P2   136W / 260W |  46629MiB / 48571MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     Off  | 00000000:03:00.0 Off |                  Off |
| 34%   54C    P0    74W / 260W |      0MiB / 48571MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     Off  | 00000000:82:00.0 Off |                  Off |
| 34%   49C    P0    73W / 260W |      0MiB / 48571MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 8000     Off  | 00000000:83:00.0 Off |                  Off |
| 33%   50C    P0    73W / 260W |      0MiB / 48571MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

add user to docker group, and no need to use sudo docker xxx

command refs

sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi
sudo nvidia-docker -t -i --privileged nvidia/cuda bash

sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda

Reference

History

20190821: created.

Copyright

Post author: kezunlin
Post link: https://kezunlin.me/post/6b505d27/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 3.0 unless stating additionally.

posted @ 2019-11-29 08:10 kezunlin 阅读(1559) 评论(0) 编辑收藏举报

刷新页面返回顶部

kezunlin

Live and Learn

linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程

Quick Guide

prepare

tools

commands

cuda driver and toolkits

install nvidia-drivers

install cuda-toolkit(dirvers)

install conda

config channels

install libraries

py37

py36

details

TensorFlow Object Detection API

jupyter notebook

install kernels

config for server

run jupyter on background

test

py37

py36

train demo

export

code

./keras folder

Datasets

mnist

cifar10

flowers-17

animals

kaggle cats vs dogs

caltech101

Kaggle API

install and config

commands overview

download datasets

show leaderboard

set default competition

submit

Nvidia-docker and containers

install

restart (optional)

test

command refs

Reference

History

Copyright

公告