使用Theano实现kaggle手写识别

前面学习了机器学习的一些基础知识，这两天尝试着做了几个小实验，手写就是利用theano来实现kaggle上的入门问题——手写识别。

Theano是一个Python库，允许我们来定义、优化和评估涉及多维数组的数学表达式，因此其是实现深度学习框架的一个很重要的模块。Deep Learning Tutorials则介绍了如何使用theano来搭建我们需要的深度学习网络来解决我们的实际问题。这里跳过了theano的基本知识（有需要的同学可以阅读Theano basic tutorial），直接通过改造Deep Learning Tutorials中的Logistic Regression、Multilayer perceptron和Deep Convolutional Network实现了kaggle的入门问题——手写识别，虽然精度不高，希望能给后面的同学提供一点点帮助。

本篇博客首先介绍使用逻辑回归来解决手写识别问题。kaggle手写识别问题中的数据来源于MNIST，只不过，其将70000条数据分成了train和test两个数据集，train集中包含42000条数据，且包含每条数据的label；test集中包含28000条数据，不包含每条数据的label。因此我们是通过train中的42000条数据去训练模型，然后再用得到的模型去标记test集中的28000条数据。本文采用了类似十折交叉验证（10-fold cross validation）的方法：将train集中的数据分成了十份，最后一份作为模型训练过程中的测试集，每次训练时选取前九份中的一份作为验证集，其余八份作为训练集，这样我们就可以得到同一个模型的九组参数设置。在预测阶段，前面得到的九组参数设置同时参与预测，具体做法是，对于同一组测试数据，我们分别用九组参数设置去预测，得到九个预测值，我们将这九个预测值中出现次数最多的那个label作为改组测试数据的最终预测值。

实验代码

首先是训练数据的读取函数，该函数有两个参数，一个参数file表示训练数据集所在文件，另一个参数partion表示我们选择训练集中哪一部分来作为验证集。

def read_train(file):
    rawData = []
    file = csv.reader(open(file))
    for line in file:
        rawData.append(line)
    rawData.pop(0) # the first line is title
    data = numpy.array(rawData).astype(numpy.int32)
    # 数据已存到data变量中
    # 下面将整个数据集分成train/validation/test三部分
    train_x = [];   train_y = []
    valid_x = [];   valid_y = []
    test_x = [];    test_y = []
    for i in range(25200):
        train_x.append(data[i][1:785])
        train_y.append(data[i][0])
    for i in range(25200, 37800):
        valid_x.append(data[i][1:785])
        valid_y.append(data[i][0])
    for i in range(37800, 42000):
        test_x.append(data[i][1:785])
        test_y.append(data[i][0])
    del data
    def shared_dataset(data_xy, borrow=True):
        """ Function that loads the dataset into shared variables
        The reason we store our dataset in shared variables is to allow
        Theano to copy it into the GPU memory (when code is run on GPU).
        Since copying data into the GPU is slow, copying a minibatch everytime
        is needed (the default behaviour if the data is not in a shared
        variable) would lead to a large decrease in performance.
        """
        data_x, data_y = data_xy
        shared_x = theano.shared(numpy.asarray(data_x), borrow=borrow)
        shared_y = theano.shared(numpy.asarray(data_y), borrow=borrow)
        # When storing data on the GPU it has to be stored as floats
        # therefore we will store the labels as ``floatX`` as well
        # (``shared_y`` does exactly that). But during our computations
        # we need them as ints (we use labels as index, and if they are
        # floats it doesn't make sense) therefore instead of returning
        # ``shared_y`` we will have to cast it to int. This little hack
        # lets ous get around this issue
        return shared_x, T.cast(shared_y, 'int32')
    train_x, train_y = shared_dataset((train_x, train_y))
    valid_x, valid_y = shared_dataset((valid_x, valid_y))
    test_x, test_y = shared_dataset((test_x, test_y))
    rval = [(train_x, train_y), (valid_x, valid_y),
            (test_x, test_y)]
    return rval

然后是test集的读取函数，结构类似于上一个函数。

def read_test(file):
    rawData = []
    file = csv.reader(open(file))
    for line in file:
        rawData.append(line)
    rawData.pop(0)  # the first line is title
    data = numpy.array(rawData).astype(numpy.int32)
    test_set_x = []
    for i in range(len(data)):
        test_set_x.append(data[i])
    del data
    shared_data = theano.shared(numpy.asarray(test_set_x), borrow=True)
    return shared_data

下面是逻辑回归类，这个类来自于Deep Learning Tutorials中的Logistic Regression，未作改动。

class LogisticRegression(object):
    """
    Multi-class logistic regression class
    The logistic regression is fully described by a weight matrix :math:'W'
    and bias vector :math:'b'. Classification is done by projecting data
    points onto a set of hyperplanes, the distance to which is used to
    determine to a class membership probability.
    """
    def __init__(self, input, n_in, n_out):
        """
        Initialize the parameters of the logistic regression
        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                        architecture (one minibatch)
        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                        which the datapoints lie
        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                        which the labels lie
        """
        # start-snippet-1
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(value=numpy.zeros((n_in, n_out),
                               dtype=theano.config.floatX), name='W', borrow=True)
        # initialize the biases b as a vector of n_out 0s
        self.b = theano.shared(value=numpy.zeros((n_out,),
                                dtype=theano.config.floatX), name='b', borrow=True)
        # symbolic expression for computing the matrix of class-membership
        # probabilities
        # Where:
        # W is a matrix where column-k represent the separation hyperplane for
        # class-k
        # x is a matrix where row-j represents input training sample-j
        # b is a vector where element-k represent the free parameter of hyperplane-k
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
        # symbolic description of how to compute prediction as class whose
        # probability is matrix
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)
        # end-snippet-1
        # parameters of the model
        self.params = [self.W, self.b]
        # keep track of model input
        self.input = input
    def negative_log_likelihood(self, y):
        """
        Return the mean of the negative log-likelihood of the prediction of
        this model under a given target distribution.
        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                    correct label
        Note: we use the mean instead of the sum so that the learning rate
                is less dependent on the batch size
        """
        # start-snippet-2
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0, 1, 2,..., n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0, y[0]], LP[1, y[1]], LP[2, y[2]], ...,
        # LP[n-1, y[n-1]]] and T.mean(LP[T.arange(y.shape[0]), y]) is
        # the mean (across minibatch examples) of the elements in v, i.e.,
        # the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
        # end-snippet-2
    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
         over the total number of examples of the minibatch; zero one
         loss over the size of the minibatch
        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the correct label
        """
        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where a
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()

然后是模型训练函数，由Deep Learning Tutorials中的Logistic Regression中的模型训练函数改造而来，其中参数多了一个partion，用来指定train集中的哪一部分作为此次训练的验证集。

其中参数的选择我们会在实验结论部分给出。

def sgd_optimization_mnist(learning_rate=0.3, n_epochs=1000,
                       data_path=r'E:\Lab\digitrecognizer\train.csv', save_path=r'E:\Lab\digitrecognizer\my_best_model.pkl', partion=8, batch_size=800):
    # load dataset
    datasets = read_cross_train(data_path, partion)
    train_set_x, train_set_y = datasets[0]
    valid_set_x, valid_set_y = datasets[1]
    test_set_x, test_set_y = datasets[2]
    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size     # 25200 // batch_size
    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size     # 12600 // batch_size
    n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size       # 4200 // batch_size
    ######################
    # BUILD ACTUAL MODEL #
    ######################
    print('...building the model')
    # allocate symbolic variables for the data
    index = T.lscalar()  # index to a [mini]batch
    # generate symbolic variables for input (x and y represent a minibatch)
    x = T.imatrix('x')  # data, presented as rasterized images
    y = T.ivector('y')  # labels, presented as 1D vector of [int] labels
    # construct the logistic regression class
    # Each MNIST image has size 28*28
    classifier = LogisticRegression(input=x, n_in=28*28, n_out=10)
    # the cost we minimize during training is the negative log likelihood of
    # the model in symbolic format
    cost = classifier.negative_log_likelihood(y)
    # compiling a Theano function that computes the mistakes that are made by
    # the model on a minibatch
    test_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: test_set_x[index * batch_size: (index + 1) * batch_size],
            y: test_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
    validate_model = theano.function(
        inputs=[index],
        outputs=classifier.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
    # compute the gradient of cost with respect to theta = (W,b)
    g_W = T.grad(cost=cost, wrt=classifier.W)
    g_b = T.grad(cost=cost, wrt=classifier.b)
    # specify how to update the parameters of the model as a list of
    # (variable, update expression) pairs.
    updates = [(classifier.W, classifier.W - learning_rate * g_W),
               (classifier.b, classifier.b - learning_rate * g_b)]
    # compiling a Theano function `train_model` that returns the cost, but in
    # the same time updates the parameter of the model based on the rules
    # defined in `updates`
    train_model = theano.function(
        inputs=[index],
        outputs=cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
    ###############
    # TRAIN MODEL #
    ###############
    print('... training the model')
    patience = 5000  # look as this many examples regardless
    patience_increase = 2  # wait this much longer when a new best is
    # found
    improvement_threshold = 0.995  # a relative improvement of this much is
    # considered significant
    validation_frequency = min(n_train_batches, patience // 2)
    # go through this many
    # minibatche before checking the network
    # on the validation set; in this case we
    # check every epoch
    best_validation_loss = numpy.inf
    best_test_loss = numpy.inf
    test_score = 0.
    start_time = timeit.default_timer()
    done_looping = False
    epoch = 0
    while (epoch < n_epochs) and (not done_looping):
        epoch += 1
        for minibatch_index in range(n_train_batches):
            minibatch_avg_cost = train_model(minibatch_index)
            # iteration number
            iter = (epoch - 1) * n_train_batches + minibatch_index
            if (iter + 1) % validation_frequency == 0:
                # compute zero-one loss on validation set
                validation_losses = [validate_model(i) for i in range(n_valid_batches)]
                this_validation_loss = numpy.mean(validation_losses)
                print('epoch %i, minibatch %i/%i, validation error %f %%' % (
                epoch, minibatch_index + 1, n_train_batches, this_validation_loss * 100.))
                # if we got the best validation score until now
                if this_validation_loss < best_validation_loss:
                    # improve patience if loss improvement is good enough
                    if this_validation_loss < best_validation_loss * \
                            improvement_threshold:
                        patience = max(patience, iter * patience_increase)
                    best_validation_loss = this_validation_loss
                    # test it on the test set
                    test_losses = [test_model(i)
                                   for i in range(n_test_batches)]
                    test_score = numpy.mean(test_losses)
                    print(
                        ('     epoch %i, minibatch %i/%i, test error of best model %f %%')
                        % (epoch, minibatch_index + 1, n_train_batches, test_score * 100.))
                    # save the best model
                    with open(save_path, 'wb') as f:
                        pickle.dump(classifier, f)
            if patience < iter:
                done_looping = True
                break
    end_time = timeit.default_timer()
    print(
        ('Optimization complete with best validation score of %f %%, with test performance %f %%')
        % (best_validation_loss * 100., test_score * 100.))
    print('The code run for %d epochs, with %f epochs/sec' % (
        epoch, 1. * epoch / (end_time - start_time)))
    print(('The code for file ' +
           os.path.split(__file__)[1] +
           ' ran for %.1fs' % ((end_time - start_time))), file=sys.stderr)

最后是预测函数，该函数有两个参数，参数data_path表示测试集所在文件，参数has_label表示测试集中是否含有label，有label的话预测函数会计算出错误率，没有label的话函数会将测试结果保存在文件answer.csv中。

def predict_cross(has_label=1, data_path=r'E:\Lab\digitrecognizer\train.csv'):
    if has_label == 1:
        # load test set
        # data_path = r'E:\Lab\digitrecognizer\train.csv'
        test_set_x, test_set_y = read_train(data_path)
        test_set_x = test_set_x.get_value()
        test_set_y = test_set_y.get_value()
        # from different pkl document load different model
        classifiers = []
        for i in range(9):
            model_path = 'E:\Lab\digitrecognizer\model' + str(i) + '.pkl'
            classifier = pickle.load(open(model_path))
            classifiers.append(classifier)
        # compile 9 predictor functions with different params
        models = []
        for i in range(9):
            model = theano.function(inputs=[classifiers[i].input], outputs=classifiers[i].y_pred)
            models.append(model)
        # 做预测
        predicted_labels = []
        for i in range(9):
            predicted = models[i](test_set_x[:42000])
            predicted_labels.append(predicted)
        predicted_label = []
        for i in range(42000):
            dictionary = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}
            for j in range(9):
                dictionary[str(predicted_labels[j][i])] += 1
            # 字典处理，找到被9个模型预测最多的那个label作为该组输入的输出
            label_number = max(dictionary.values())
            for key in dictionary:
                if dictionary[key] == label_number:
                    predicted_label.append(int(key))
                    break
        err = 0.
        for i in range(42000):
            print("The %d th example's predict is %d, and it's target value is %d." % (i, predicted_label[i], test_set_y[i]))
            if predicted_label[i] != test_set_y[i]:
                err += 1
        print("The error rate of the combined model in 1000 examples is %f ." % (err/420))
        errorate = [] # [8.875, 8.800, 8.925, 8.975, 8.375, 8.300, 8.925, 9.225, 8.925]
        # 分别计算九个模型在整个训练集上的错误率
        for i in range(9):
            err = 0.
            for j in range(42000):
                if predicted_labels[i][j] != test_set_y[j]:
                    err += 1
            errorate.append(err/420)
        print("The mean error rate of 9 different models is %f" % numpy.mean(errorate))
        print(errorate)
    else:
        # load test set
        test_set_x = read_test(data_path)
        test_set_x = test_set_x.get_value()
        # from different pkl document load different model
        classifiers = []
        for i in range(9):
            model_path = 'E:\Lab\digitrecognizer\model' + str(i) + '.pkl'
            classifier = pickle.load(open(model_path))
            classifiers.append(classifier)
        # compile 9 predictor functions with different params
        models = []
        for i in range(9):
            model = theano.function(inputs=[classifiers[i].input], outputs=classifiers[i].y_pred)
            models.append(model)
        # 做预测
        predicted_labels = []
        for i in range(9):
            predicted = models[i](test_set_x[:28000])
            predicted_labels.append(predicted)
        predicted_label = []
        for i in range(28000):
            dictionary = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}
            for j in range(9):
                dictionary[str(predicted_labels[j][i])] += 1
            # 字典处理，找到被9个模型预测最多的那个label作为该组输入的输出
            label_number = max(dictionary.values())
            for key in dictionary:
                if dictionary[key] == label_number:
                    predicted_label.append(int(key))
                    break
        print(predicted_label)
        print(len(predicted_label))
        saveResult(predicted_label, r'E:\Lab\digitrecognizer\answer.csv')
def saveResult(result, file):
    with open(file, 'wb') as f:
        # file = open(f, 'wb')
        ob = csv.writer(f)
        ob.writerow(["ImageId", "Label"])
        ids = range(1, len(result)+1)
        ob.writerows(zip(ids, result))
        f.close()
def predict(file1, file2):
    # load the saved model
    classifier = pickle.load(open(file1))
    # compile a predictor function
    predict_model = theano.function(inputs=[classifier.input], outputs=classifier.y_pred)
    # make prediction
    data_path = r'E:\Lab\digitrecognizer\test.csv'
    test_set_x= read_test(data_path)
    test_set_x = test_set_x.get_value()
    predicted_values = predict_model(test_set_x[:28000])
    # 保存预测结果
    saveResult(predicted_values, file2)

最后是主函数，如下所示，当然我们可以根据自己的实际需要去做调整。

if __name__ == '__main__':
    start_time = timeit.default_timer()
    
    # 训练出9个模型
    for i in range(9):
        save_model_path = r'E:\Lab\digitrecognizer\my_model' + str(i) + '.pkl'
        sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                              data_path=r'E:\Lab\digitrecognizer\train.csv',
                              save_path=save_model_path, partion=i, batch_size=800)
    # 在train集上检验模型
    # predict_cross(has_label=1, data_path=r'E:\Lab\digitrecognizer\train.csv')
    # 在test集上做预测，并将最终结果存到文件answer.csv中
    # predict_cross(has_label=0, data_path=r'E:\Lab\digitrecognizer\test.csv')
    
    # 使用单独的模型在test集上做测试，并保存测试结果
    for i in range(9):
        model_path = r'E:\Lab\digitrecognizer\model' + str(i) + '.pkl'
        save_path = r'E:\Lab\digitrecognizer\answer' + str(i) + '.csv'
        predict(model_path, save_path)
    
    # 然后分别在kaggle上提交十组结果，观察正确率如何
    
    end_time = timeit.default_timer()
    print("Time consumption is %f sec." % (end_time-start_time))

结论

在参数的选择上，我们做了多组测试（该测试是选择train.csv中的前3/5座训练集，最后1/10做测试集，中间3/10做验证集得到的），最终选定了上面的参数，下面给出了一组对照数据。测试的机器是12年的，CPU为i3-2350M，主频为2。30GHZ，内存为6G，无GPU（机器太渣，没办法）。在写这篇博客时，我又试验了一下bacth_size大于1000的几种情况，等后续试验数据得出后，如果有更小的错误率，我会更新数据。

batch_size	learning_rate	epoches	seconds	epoches/sec	valid_err	test_err
300	0.13	126	2207	0.057	9.786	9.904
500	0.13	270	2736	0.097	10.064	9.700
800	0.13	162	1186	0.137	9.767	9.425
800	0.3	394	2922	0.135	9.98	9.875
800	0.03	202	1369	0.147	10.083	10.075
1000	0.13	201	1205	0.167	9.792	9.475
1400	0.13	408	2281	0.1789	9.6111	9.500
1600	0.13	334	1590	0.2101	9.089	8.4375
3200	0.13	714	1974	0.3622	9.083	8.5313
4200	0.13	834	2529	0.3297	9.2540	9.0714

单单针对MNIST数据集和逻辑回归模型，我们从上面这张表可以得出一些结论：batch_size越大，计算速度越快，精度慢慢有所提高；learning rate大的话epoches就会变大，也就是在整个train集上计算的次数多。

通过实验，我们得到9个模型在train.csv上的单独错误率分别为[7.783333333333333, 7.804761904761905, 7.457142857142857, 7.554761904761905, 7.9
97619047619048, 7.642857142857143, 7.571428571428571, 7.742857142857143, 7.16428
5714285715]，而综合9个模型去做预测的话在train.csv上的错误率为7.097619，比单独的模型的平均错误率减小了0.538%，可以说十折交叉验证的确提高了模型的性能，但感觉不明显。

然后，我们又综合9个模型去预测test.csv中的数据，提交到kaggle上的正确率为0.91429，下面截图为证。而9个模型分别单独去预测test.csv中的数据，得到的正确率分别为[0.91057, 0.90543, 0.90986, 0.91371, 0.91214]（数据还没有全部拿到，kaggle一天只允许提交5次），可以看到十折交叉验证的方法提高了0.4%的准确率，可能是实验方法不恰当，没有想象的高。

jietu1

另外一个感受就是，theano实在太慢了，逻辑回归求一个模型大概需要跑20分钟，后面的mlp至少需要跑80度分钟，而卷积网络时间就更长了。等整理完了mlp和卷积网络的实验数据，就入门libgpuarray，好像利用GPU来做运算，比CPU快很多。