Ubuntu14.04+Theano+OpenCL+libgpuarray实现GPU运算

上一篇博客介绍了如何使用Theano+logistic regression来实现kaggle上的数字手写识别，文末提到了CPU计算实在太慢，因此在做完这个实验之后，博主查阅了Theano的文档，了解到Theano官方仅支持CUDA进行GPU运算，不支持OpenCL，也就是说Theano官方仅支持N卡。原因是，CUDA和OpenCL是两个GPU计算平台，CUDA仅支持N卡，OpenCL支持所有的显卡，二者的具体区别还请自行查询。无奈博主的笔记本有一张intel的集成显卡和AMD的一张入门独显，而Theano非官方的提供了libgpuarray来支持OpenCL，因此博主花了大量的时间来尝试安装libgpuarray。

libgpuarray支持的OS有Debian6，Ubuntu14.04，MAC OS X10.11和win7，而网上能找到的成功安装libgpuarray的只有两篇博文，全是在MAC OS上，这里给出博文链接，供后面的同学参考：
https://www.robberphex.com/2016/05/521
http://codechina.org/2016/04/how-to-install-theano-on-mac-os-x-ei-caption-with-opencl-support/

博主的最初OS是win7，整个6月的空闲时间几乎都用在安装libgpuarray上了，遇到了无数个坑，然并卵，最终也没能成功。这里列出在win7上安装libgpuarray需要的一些环境，供后面的同学参考：

最新的AMD显卡驱动，具体可前往AMD官网查询
AMD APP SDK，其提供了OpenCL
Cmake >= 3.0 (cmake)
g++，一般我们可以通过wingw或TDW-GCC来安装
visual studio
clBLAS (clblas)
libcheck

7月份在win7上装了Ubuntu14.04的双系统，尝试在Ubuntu上实现Theano+OpenCL的GPU运算，最终libgpuarray算是安装成功吧，只是还不能用A卡来计算，具体问题文末介绍。下面介绍整个过程。

安装Ubuntu14.04双系统

我的win7/Ubuntu14.04双系统安装过程参考了http://m.blog.csdn.net/article/details?id=43987599 这篇博文比较简单，这里不再展开。

安装AMD显卡驱动

博主开始是死在了这里，AMD驱动装坏了好几次，装坏了的结果就是重启后不能进入图形界面。然后只能在tty或者initramfs进行修复，这对于博主这种第一次接触linux的人来说太困难了，往往修复好了还是不能用，只好重装系统，整个过程重装了七八次。这里我介绍一种安装驱动的方法，比较简单快速（至少我是一次就成功了）。

在安装好Ubuntu14.04之后，第一件事就是换驱动。找到附加驱动，如下图所示，系统初始使用的驱动是开源的，我们选择来自fglrx的专有驱动，然后点击“应用更改”按钮，静静的等它装完重启。

附加驱动.png

重启后打开终端，输入fglrxinfo，终端会返回显卡信息，如下所示：

marcovaldo@marcovaldong:~$ fglrxinfo
display: :0  screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7400M Series
OpenGL version string: 4.5.13399 Compatibility Profile Context 15.201.1151

再在终端输入fgl_glxgears，会跳出一个测试窗口（旋转的方块），这就证明显卡驱动安装成功。这里，博主找到了安装驱动的比较好的方法，供后面的同学参考。
http://forum.ubuntu.org.cn/viewtopic.php?t=445434
http://www.tuicool.com/articles/6N3e2ir

安装AMD APP SDK

前往AMD官网下载SDK（注意OS和位数），我这里下载的是Linux64位版AMD APP SDK 3.0。文件解压后出现一个.sh文件，终端输入命令

1	sudo sh AMD-APP-SDK-v3.0.130.136-GA-linux64.sh

AMDSDK默认会安装在/opt/下，这时候在终端输入clinfo命令会返回OpenCL平台信息和计算设备信息，下面给出我的笔记本的数据：

marcovaldo@marcovaldong:~$ clinfo
Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.0 AMD-APP (1800.11)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 2
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 AMD Radeon HD 7400M Series
  Device Topology:				 PCI[ B#1, D#0, F#0 ]
  Max compute units:				 2
  Max work items dimensions:			 3
    Max work items[0]:				 256
    Max work items[1]:				 256
    Max work items[2]:				 256
  Max work group size:				 256
  Preferred vector width char:			 16
  Preferred vector width short:			 8
  Preferred vector width int:			 4
  Preferred vector width long:			 2
  Preferred vector width float:			 4
  Preferred vector width double:		 0
  Native vector width char:			 16
  Native vector width short:			 8
  Native vector width int:			 4
  Native vector width long:			 2
  Native vector width float:			 4
  Native vector width double:			 0
  Max clock frequency:				 700Mhz
  Address bits:					 32
  Max memory allocation:			 134217728
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 16
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 2048
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 None
  Cache line size:				 0
  Cache size:					 0
  Global memory size:				 536870912
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Max pipe arguments:				 0
  Max pipe active reservations:			 0
  Max pipe packet size:				 0
  Max global variable size:			 0
  Max global variable preferred total size:	 0
  Max read/write image args:			 0
  Max on device events:				 0
  Queue on device max size:			 0
  Max on device queues:				 0
  Queue on device preferred size:		 0
  SVM capabilities:				 
    Coarse grain buffer:			 No
    Fine grain buffer:				 No
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 No
    Profiling :					 No
  Platform ID:					 0x7f98e6833430
  Name:						 Caicos
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 1.2 
  Driver version:				 1800.11
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 AMD-APP (1800.11)
  Extensions:					 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_image2d_from_buffer_read_only cl_khr_spir cl_khr_gl_event 
  Device Type:					 CL_DEVICE_TYPE_CPU
  Vendor ID:					 1002h
  Board name:					 
  Max compute units:				 4
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 1024
  Preferred vector width char:			 16
  Preferred vector width short:			 8
  Preferred vector width int:			 4
  Preferred vector width long:			 2
  Preferred vector width float:			 8
  Preferred vector width double:		 4
  Native vector width char:			 16
  Native vector width short:			 8
  Native vector width int:			 4
  Native vector width long:			 2
  Native vector width float:			 8
  Native vector width double:			 4
  Max clock frequency:				 2299Mhz
  Address bits:					 64
  Max memory allocation:			 2147483648
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 64
  Max image 2D width:				 8192
  Max image 2D height:				 8192
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 16
  Max size of kernel argument:			 4096
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 32768
  Global memory size:				 6161788928
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Global
  Local memory size:				 32768
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 2147483648
  Max global variable size:			 1879048192
  Max global variable preferred total size:	 1879048192
  Max read/write image args:			 64
  Max on device events:				 0
  Queue on device max size:			 0
  Max on device queues:				 0
  Queue on device preferred size:		 0
  SVM capabilities:				 
    Coarse grain buffer:			 No
    Fine grain buffer:				 No
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 1
  Error correction support:			 0
  Unified memory for Host and Device:		 1
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 Yes
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 No
    Profiling :					 No
  Platform ID:					 0x7f98e6833430
  Name:						 Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz
  Vendor:					 GenuineIntel
  Device OpenCL C version:			 OpenCL C 1.2 
  Driver version:				 1800.11 (sse2,avx)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 AMD-APP (1800.11)
  Extensions:					 cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

另外还要在/root/.bashrc文件中添加环境变量，具体如下：

# AMD APP SDK
export AMDAPPSDKROOT="/opt/AMDAPPSDK-3.0"
export AMDAPPSDKSAMPLESROOT="/opt/AMDAPPSDK-3.0/""
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/opt/AMDAPP/lib/x86_64":"/opt/AMDAPP/lib/x86"
export ATISTREAMSDKROOT=$AMDAPPSDKROOT

到这里，AMD APP SDK就算是安装好了，下面再给出我参考的几篇博文：
https://www.blackmoreops.com/2013/11/22/install-amd-app-sdk-kali-linux/
http://blog.csdn.net/vblittleboy/article/details/8979288

升级python

Ubuntu14.04自带的python版本是2.7.6的，我这里把它升级成了2.7.11的，具体方法是在终端输入下面三条命令：

1
2
3

sudo add-apt-repository ppa:fkrull/deadsnakes-python2.7
sudo apt-get update  
sudo apt-get upgrade

安装libgpuarray

为了防止安装过程出现错误影响整个python的环境，这里我们使用python的虚拟环境。

sudo apt-get install python-virtualenv
sudo apt-get install python-pip
virtualenv venv
source venv/bin/activate

然后我们就进入了python的一个虚拟环境venv，下面的操作全是在venv中进行的。首先安装Theano和libgpuarray的一些依赖包，具体要求看libgpuarray官方文档

1
2
3

pip install numpy
pip install Cython
pip install Scipy

安装scipy时可能会报错，可参考下面链接来修复：
http://stackoverflow.com/questions/11114225/installing-scipy-and-numpy-using-pip

然后是安装Theano，注意版本号为0.8.2的稳定Theano跟libgpuarray是不同步的，在使用时会报错，具体文末会提到。这里我安装的是Theano(0.9.0dev)：

1
2
3

pip install git+https://github.com/Theano/Theano.git
# 这里我使用的是robberphex的CSDN镜像，在此表示感谢
# pip install git+https://code.csdn.net/u010096836/theano.git

这里还用到了libcheck，因此装上它：

1	sudo apt-get install check

下面开始安装libgpuarray

git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
mkdir Build
cd Build
cmake . -DCMAKE_INSTALL_PREFIX=../venv/ -DCMAKE_BUILD_TYPE=Release
make install 
export LIBRARY_PATH=$LIBRARY_PATH:$PWD/../venv/lib
export CPATH=$CPATH:$PWD/../venv/
python setup.py build
python setup.py install

下面开始测试一下，Theano官方给出了一段测试程序，我们命名为test.py，程序如下：

from theano import function, config, shared, tensor, sandbox
import numpy
import time
 
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000
 
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

先是仅用Theano和CPU，结果如下：

(venv)marcovaldo@marcovaldong:~/desktop$ python test.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 7.7898850441 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
  1.62323285]
Used the cpu

再是加了THEANO_FLAGS=mode=FAST_RUN的：

(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,floatX=float32 python test.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.86811089516 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu
(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python test.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.84727883339 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu

下面使用OpenCL的时候就报错，网上没有找到有效的解决方法，希望有遇到过的大神给指点迷津，具体如下：

(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,device=opencl0:0,floatX=float32 python test.py
ERROR (theano.sandbox.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
  File "/home/marcovaldo/myvenv/venv/local/lib/python2.7/site-packages/theano/sandbox/gpuarray/__init__.py", line 96, in <module>
    init_dev(config.device)
  File "/home/marcovaldo/myvenv/venv/local/lib/python2.7/site-packages/theano/sandbox/gpuarray/__init__.py", line 47, in init_dev
    "Make sure Theano and libgpuarray/pygpu "
RuntimeError: ('Wrong major API version for gpuarray:', -9997, 'Make sure Theano and libgpuarray/pygpu are in sync.')
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.86138486862 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu

到这里，如果你没有下面的这个问题，你的libgpuarray应该就算装好了。

1
2

RuntimeError: ('Wrong major API version for gpuarray:', -9997, 'Make sure Theano and libgpuarray/pygpu are in sync.')
RuntimeError: ('Wrong major API version for gpuarray:', -9998, 'Make sure Theano and libgpuarray/pygpu are in sync.')

接下来我会抽时间翻译一下libgpuarray的官方安装文档，供后来的同学参考。

现在的深度计算工具都是官方支持N卡，A卡在这方面实在太吃亏了，希望各个深度学习工具能尽快做出支持A卡的API。

最后鸣谢robberphex和Tinyfool，二位的博客给我提供了思路。

Ubuntu14.04+Theano+OpenCL+libgpuarray实现GPU运算

Ubuntu14.04+Theano+OpenCL+libgpuarray实现GPU运算

安装Ubuntu14.04双系统

安装AMD显卡驱动

安装AMD APP SDK

升级python

安装libgpuarray

参考链接