DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Setup Tensorflow GPU on CentOS 7


首先创建用户,安装必要的软件:

useradd -m -s /bin/bash -G wheel leo
passwd leo
su - leo
sudo yum install git wget

本段代码不执行: 原本计划安装 asdf + Python 作为 Tensorflow 运行环境, 但 asdf 在 CentOS 7 上编译 Python 缺少 OpenSSL 包, 安装相关包后仍然报错,放弃:

sudo yum install git zlib-devel sqlite-devel openssl openssl-libs 
  openssl-devel libffi \
  readline-devel ncurses-devel
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.7.8
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
asdf plugin-add python
asdf install python 3.7.7

改用 conda 安装 tensorflow 成功,下面是详细过程。

Install GPU Driver

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/415.27/NVIDIA-Linux-x86_64-415.27.run
chmod 755 NVIDIA-Linux-x86_64-415.27.run
sudo ./NVIDIA-Linux-x86_64-415.27.run

安装程序提示更高版本驱动已安装,此过程终止, 检查驱动安装效果(smi means system management interface):

$ nvidia-smi
Wed Jun  3 07:18:08 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:01:00.0 Off |                  N/A |
| 28%   43C    P0    25W / 215W |      0MiB /  7979MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

输出包含两张表格,第一张主要是汇总信息,第二张主要是实时使用情况, 下面分别说明。

检查 GPU 内存容量

$ pip install gpustat   # or: conda install -c conda-forge gpustat
$ gpustat
centos7                    Wed Jun  3 08:34:47 2020  440.33.01
[0] GeForce RTX 2070 SUPER | 50'C,   0 % |     0 /  7979 MB |

与上面 nvidia-smi 报告的 Memory-Usage 部分一致,都是 7979MB。

专门查询内存部分:

$ nvidia-smi -q -d memory

==============NVSMI LOG==============

Timestamp                           : Wed Jun  3 07:50:00 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    FB Memory Usage
        Total                       : 7979 MiB
        Used                        : 0 MiB
        Free                        : 7979 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB


$ lspci | grep NVIDIA    # get the device ID is `01:00.0`
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e84 (rev a1)
...

$ lspci -v -s 01:00.0 | grep Memory
        Memory at de000000 (32-bit, non-prefetchable) [size=16M]
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=32M]

一个命令的 FB memory 显示设备内容容量,也是 7979 MB,即 8GB, BAR1 memory 为 256MB,是CPU或者其他应用可以使用的内存,与 lspci 给出的结果一致。 具体含义参考 man nvidia-smi 中相关章节的说明。

监控 GPU 内存实时使用情况

每两秒刷新一下监控状态:

nvidia-smi -l 2        # by -l option
gpustat -cp -i 2       # by -i option
watch -n 2 nvidia-smi  # by -n option of watch

Note: 如果 watch 后面的命令有颜色, 可以为 watch 命令添加 -c 选项使其正确解析颜色编码。

可以指定输出内容和格式(仍然是输出到屏幕):

nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

查看哪些进程使用显卡:lsof /dev/nvidia*.

Install CUDA 10.2 and CUPTI

根据 CUDA Compatibility, Cuda 10.2 支持 49 服务器的 驱动版本:440.33.

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run
# make sure CUPTI in the installation list (in cuda command line tools menu)


#: add '/usr/local/cuda-10.2/bin' into PATH
sudo echo 'export PATH=$PATH:/usr/local/cuda-10.2/bin' > /etc/profile.d/cuda.sh
sudo echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64' >> /etc/profile.d/cuda.sh
source /etc/profile.d/cuda.sh

#: add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root
sudo echo '/usr/local/cuda-10.2/lib64` >> /etc/ld.so.conf.d/cuda-10-2.conf
sudo ldocnfig

Verify Installation:

只要下面的文件在基本可以说明安装了:

$ whereis cuda
cuda: /usr/local/cuda
$ ls -l $(whereis cuda)

Install cuDNN SDK

Register as NVidia developer and download the following files:

  • cuDNN Runtime Library for RedHat/Centos 7.3 (RPM)
  • cuDNN Developer Library for RedHat/Centos 7.3 (RPM)
  • cuDNN Code Samples and User Guide for RedHat/Centos 7.3 (RPM)

Install these files:

sudo rpm -ivh libcudnn7-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-devel-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-doc-7.6.5.33-1.cuda10.2.x86_64.rpm

Verify Installation:

$ cat $(whereis cudnn | awk -F': ' '{print $2}') | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Install MiniConda, TensorFlow and run demo

conda create -n tfgpu
conda activate tfgpu
conda install tensorflow-gpu ipython

ipython

import tensorflow as tf

#: GPU verification

tf.test.is_built_with_cuda()  # True

#: at least one GPU working
tf.test.is_gpu_available()    # True

#: the first GPU name, where operations will run
tf.test.gpu_device_name()  # '/device:GPU:0'

#: print the list of all available GPU devices
tf.config.experimental.list_physical_devices('GPU')
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

from tensorflow.python.client import device_lib
res = device_lib.list_local_devices()
len(res)

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)


Published

May 19, 2020

Last Updated

Jun 3, 2020

Category

Tech

Tags

  • centos 25
  • gpu 1
  • python 136
  • tensorflow 3

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor