首先创建用户,安装必要的软件:
useradd -m -s /bin/bash -G wheel leo
passwd leo
su - leo
sudo yum install git wget
本段代码不执行: 原本计划安装 asdf + Python 作为 Tensorflow 运行环境, 但 asdf 在 CentOS 7 上编译 Python 缺少 OpenSSL 包, 安装相关包后仍然报错,放弃:
sudo yum install git zlib-devel sqlite-devel openssl openssl-libs
openssl-devel libffi \
readline-devel ncurses-devel
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.7.8
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
asdf plugin-add python
asdf install python 3.7.7
改用 conda 安装 tensorflow 成功,下面是详细过程。
Install GPU Driver
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/415.27/NVIDIA-Linux-x86_64-415.27.run
chmod 755 NVIDIA-Linux-x86_64-415.27.run
sudo ./NVIDIA-Linux-x86_64-415.27.run
安装程序提示更高版本驱动已安装,此过程终止, 检查驱动安装效果(smi means system management interface):
$ nvidia-smi
Wed Jun 3 07:18:08 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 207... Off | 00000000:01:00.0 Off | N/A |
| 28% 43C P0 25W / 215W | 0MiB / 7979MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
输出包含两张表格,第一张主要是汇总信息,第二张主要是实时使用情况, 下面分别说明。
检查 GPU 内存容量
$ pip install gpustat # or: conda install -c conda-forge gpustat
$ gpustat
centos7 Wed Jun 3 08:34:47 2020 440.33.01
[0] GeForce RTX 2070 SUPER | 50'C, 0 % | 0 / 7979 MB |
与上面 nvidia-smi
报告的 Memory-Usage 部分一致,都是 7979MB。
专门查询内存部分:
$ nvidia-smi -q -d memory
==============NVSMI LOG==============
Timestamp : Wed Jun 3 07:50:00 2020
Driver Version : 440.33.01
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
FB Memory Usage
Total : 7979 MiB
Used : 0 MiB
Free : 7979 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
$ lspci | grep NVIDIA # get the device ID is `01:00.0`
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e84 (rev a1)
...
$ lspci -v -s 01:00.0 | grep Memory
Memory at de000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
一个命令的 FB memory 显示设备内容容量,也是 7979 MB,即 8GB,
BAR1 memory 为 256MB,是CPU或者其他应用可以使用的内存,与 lspci
给出的结果一致。
具体含义参考 man nvidia-smi
中相关章节的说明。
监控 GPU 内存实时使用情况
每两秒刷新一下监控状态:
nvidia-smi -l 2 # by -l option
gpustat -cp -i 2 # by -i option
watch -n 2 nvidia-smi # by -n option of watch
Note: 如果 watch
后面的命令有颜色,
可以为 watch
命令添加 -c
选项使其正确解析颜色编码。
可以指定输出内容和格式(仍然是输出到屏幕):
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
查看哪些进程使用显卡:lsof /dev/nvidia*
.
Install CUDA 10.2 and CUPTI
根据 CUDA Compatibility, Cuda 10.2 支持 49 服务器的 驱动版本:440.33.
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
sudo sh cuda_10.2.89_440.33.01_linux.run
# make sure CUPTI in the installation list (in cuda command line tools menu)
#: add '/usr/local/cuda-10.2/bin' into PATH
sudo echo 'export PATH=$PATH:/usr/local/cuda-10.2/bin' > /etc/profile.d/cuda.sh
sudo echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64' >> /etc/profile.d/cuda.sh
source /etc/profile.d/cuda.sh
#: add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root
sudo echo '/usr/local/cuda-10.2/lib64` >> /etc/ld.so.conf.d/cuda-10-2.conf
sudo ldocnfig
Verify Installation:
只要下面的文件在基本可以说明安装了:
$ whereis cuda
cuda: /usr/local/cuda
$ ls -l $(whereis cuda)
Install cuDNN SDK
Register as NVidia developer and download the following files:
- cuDNN Runtime Library for RedHat/Centos 7.3 (RPM)
- cuDNN Developer Library for RedHat/Centos 7.3 (RPM)
- cuDNN Code Samples and User Guide for RedHat/Centos 7.3 (RPM)
Install these files:
sudo rpm -ivh libcudnn7-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-devel-7.6.5.33-1.cuda10.2.x86_64.rpm
sudo rpm -ivh libcudnn7-doc-7.6.5.33-1.cuda10.2.x86_64.rpm
Verify Installation:
$ cat $(whereis cudnn | awk -F': ' '{print $2}') | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
Install MiniConda, TensorFlow and run demo
conda create -n tfgpu
conda activate tfgpu
conda install tensorflow-gpu ipython
ipython
import tensorflow as tf
#: GPU verification
tf.test.is_built_with_cuda() # True
#: at least one GPU working
tf.test.is_gpu_available() # True
#: the first GPU name, where operations will run
tf.test.gpu_device_name() # '/device:GPU:0'
#: print the list of all available GPU devices
tf.config.experimental.list_physical_devices('GPU')
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
from tensorflow.python.client import device_lib
res = device_lib.list_local_devices()
len(res)
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)