GPU memory is limited. Although NVIDIA TITAN X, which has 12-GB memory, is expensive, sometimes 12 GB is not sufficient. GPU memory is smaller for subordinate GPU models. A numerical value that can be stored in one byte can be stored as an 8-bit integer in GPU memory, and thus the GPU memory is more effectively used. So this article describes a method to do so.

GPUs are usually used in programs for machine learning, especially for deep learning. GPU programming is troublesome, so it is a good way to use GPUs by using Theano, which is a mathematical language embedded in Python, when describing programs of neural networks and so on.

When using images for learning data, GPU memory is much more consumed when pixels are represented by 32-bit floating numbers because each number consumes 4 bytes. To reduce memory consumption, 16-bit floating number is supported by systems such as NVIDIA cuDNN ver. 2 as a predefined data type. However, this method still consumes multiple bytes of memory for each number, and systems such as Theano do not support 16-bit float number. If a pixel can be represented by an 8-bit integer when it is 256 shades. So, if it can be packed to GPU memory, much more data can be stored.

Several Theano programs for deep learning are described in Deep learning tutorial. This tutorial contains a program and explanation of logistic regression. This program contains a function called load_data(), which reads a data set into the GPU memory. This function generates a 32-bit floating array. However, by slightly modifying this function, an 8-bit integer array can be generated instead. The program (function) that generates a (32-bit floating) array, which is included in load_data(), is as follows.

def shared_dataset(data_xy, borrow=True): data_x, data_y = data_xy shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX), borrow=borrow) shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX), borrow=borrow) return shared_x, T.cast(shared_y, 'int32') # “T” means Theano

This can be modified as follows, then shared_x becomes an 8-bit integer array. It is converted to a 32-bit floating array just before using it. (Otherwise, if trying to use without conversion, Theano outputs error messages.) The values of the array must be, of course, normalized into range −128 to 127. (For example, if the original value is uint8, 128 must be subtracted.)

def shared_dataset(data_xy, borrow=True): data_x, data_y = data_xy shared_x = theano.shared(numpy.asarray(data_x,dtype='int8'), borrow=borrow) shared_y = theano.shared(numpy.asarray(data_y, dtype='int32'), borrow=borrow) returnT.cast(shared_x, 'float32'), shared_y

If shared_x is replaced by a mathematical expression as above, get_value function cannot be used for it. So a modification shown below is required.

train_set_x, train_set_y = datasets[0] # datasets[0] is a return value of load_data() (i.e., shared_x) ... # compute number of minibatches for training, validation and testing n_train_batches =train_set_x.get_value(borrow=True).shape[0] / batch_size ...

This program becomes as follows.

train_set_x, train_set_y = datasets[0] ... # compute number of minibatches for training, validation and testing n_train_batches =train_set_y.eval().shape[0] / batch_size ...

In this program, not only get_value(borrow=True) is replaced by eval(), but also train_set_x is replaced by train_set_y. The purpose of this replacement is to reduce the overhead.

Floating numbers can be normalized so that the mean value becomes 0;
however, it is difficult
for integers, so it would be better to subtract the mean value just before using it.
In addition, it would be better to adjust the range to be between −1 and 1.
The input variable is named **x** in this tutorial, so
**x** is to be replace by **((x − T.mean(x)) / 128.0)**.
For example, in the program of logistic
regression, the line below is to be replaced.

classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)

This must be replaced by the following line.

classifier = LogisticRegression(input=((x − T.mean(x)) / 128.0), n_in=28 * 28, n_out=10)

In the modified program, certain amount of computation is required every time contents of an array is extracted, the ratio of the overhead is less than 10% when the computation is large scale such as that of a convolutional neural network.