首先创建用户，安装必要的软件：

useradd -m -s /bin/bash -G wheel leo
passwd leo
su - leo
sudo yum install git wget

本段代码不执行：原本计划安装 asdf + Python 作为 Tensorflow 运行环境，但 asdf 在 CentOS 7 上编译 Python 缺少 OpenSSL 包，安装相关包后仍然报错，放弃：

sudo yum install git zlib-devel sqlite-devel openssl openssl-libs 
  openssl-devel libffi \
  readline-devel ncurses-devel
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.7.8
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
asdf plugin-add python
asdf install python 3.7.7

改用 conda 安装 tensorflow 成功，下面是详细过程。

Install GPU Driver

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/415.27/NVIDIA-Linux-x86_64-415.27.run
chmod 755 NVIDIA-Linux-x86_64-415.27.run
sudo ./NVIDIA-Linux-x86_64-415.27.run

安装程序提示更高版本驱动已安装，此过程终止，检查驱动安装效果（smi means system management interface）：

$ nvidia-smi
Wed Jun  3 07:18:08 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:01:00.0 Off |                  N/A |
| 28%   43C    P0    25W / 215W |      0MiB /  7979MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

输出包含两张表格，第一张主要是汇总信息，第二张主要是实时使用情况，下面分别说明。

检查 GPU 内存容量

$ pip install gpustat   # or: conda install -c conda-forge gpustat
$ gpustat
centos7                    Wed Jun  3 08:34:47 2020  440.33.01
[0] GeForce RTX 2070 SUPER | 50'C,   0 % |     0 /  7979 MB |

与上面 nvidia-smi 报告的 Memory-Usage 部分一致，都是 7979MB。

专门查询内存部分：

$ nvidia-smi -q -d memory

==============NVSMI LOG==============

Timestamp                           : Wed Jun  3 07:50:00 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    FB Memory Usage
        Total                       : 7979 MiB
        Used                        : 0 MiB
        Free                        : 7979 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB


$ lspci | grep NVIDIA    # get the device ID is `01:00.0`
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e84 (rev a1)
...

$ lspci -v -s 01:00.0 | grep Memory
        Memory at de000000 (32-bit, non-prefetchable) [size=16M]
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=32M]

一个命令的 FB memory 显示设备内容容量，也是 7979 MB，即 8GB， BAR1 memory 为 256MB，是CPU或者其他应用可以使用的内存，与 lspci 给出的结果一致。具体含义参考 man nvidia-smi 中相关章节的说明。

监控 GPU 内存实时使用情况

每两秒刷新一下监控状态：

nvidia-smi -l 2        # by -l option
gpustat -cp -i 2       # by -i option
watch -n 2 nvidia-smi  # by -n option of watch

Note: 如果 watch 后面的命令有颜色，可以为 watch 命令添加 -c 选项使其正确解析颜色编码。

可以指定输出内容和格式（仍然是输出到屏幕）：

nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

查看哪些进程使用显卡：lsof /dev/nvidia*.

Install CUDA 10.2 and CUPTI

根据 CUDA Compatibility, Cuda 10.2 支持 49 服务器的驱动版本：440.33.

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run
# make sure CUPTI in the installation list (in cuda command line tools menu)


#: add '/usr/local/cuda-10.2/bin' into PATH
sudo echo 'export PATH=$PATH:/usr/local/cuda-10.2/bin' > /etc/profile.d/cuda.sh
sudo echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64' >> /etc/profile.d/cuda.sh
source /etc/profile.d/cuda.sh

#: add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root
sudo echo '/usr/local/cuda-10.2/lib64` >> /etc/ld.so.conf.d/cuda-10-2.conf
sudo ldocnfig

Verify Installation:

只要下面的文件在基本可以说明安装了：

$ whereis cuda
cuda: /usr/local/cuda
$ ls -l $(whereis cuda)

Install cuDNN SDK

cuDNN Runtime Library for RedHat/Centos 7.3 (RPM)
cuDNN Developer Library for RedHat/Centos 7.3 (RPM)
cuDNN Code Samples and User Guide for RedHat/Centos 7.3 (RPM)

Install these files:

sudo rpm -ivh libcudnn7-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-devel-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-doc-7.6.5.33-1.cuda10.2.x86_64.rpm

Verify Installation:

$ cat $(whereis cudnn | awk -F': ' '{print $2}') | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Install MiniConda, TensorFlow and run demo

conda create -n tfgpu
conda activate tfgpu
conda install tensorflow-gpu ipython

ipython

import tensorflow as tf

#: GPU verification

tf.test.is_built_with_cuda()  # True

#: at least one GPU working
tf.test.is_gpu_available()    # True

#: the first GPU name, where operations will run
tf.test.gpu_device_name()  # '/device:GPU:0'

#: print the list of all available GPU devices
tf.config.experimental.list_physical_devices('GPU')
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

from tensorflow.python.client import device_lib
res = device_lib.list_local_devices()
len(res)

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Setup Tensorflow GPU on CentOS 7

Install GPU Driver

检查 GPU 内存容量

监控 GPU 内存实时使用情况

Install CUDA 10.2 and CUPTI

Install cuDNN SDK

Install MiniConda, TensorFlow and run demo

Published

Last Updated

Category

Tags

Contact