前面写了使用theano来实现kaggle中的手写识别,分别用的是逻辑回归和多层感知机,这次加上卷积神经网络,将三者放在一块儿做个总结。下一步的工作是继续系统学习theano及其他深度学习工具,自己去实现更多模型和优化方法。

Logistic Regression

首先是逻辑回归,这里我用了个类似十折交叉验证的方法:将train.csv中的数据分成了十份,最后一份作为模型训练过程中的测试集,每次训练时选取前九份中的一份作为验证集,其余八份作为训练集,这样我们就可以得到同一个模型的九组参数设置。在预测阶段,前面得到的九组参数设置同时参与预测,具体做法是,对于同一组测试数据,我们分别用九组参数设置去预测,得到九个预测值,我们将这九个预测值中出现次数最多的那个label作为改组测试数据的最终预测值。

下面是代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def read_cross_train(file, partion):
rawData = []
file = csv.reader(open(file))
for line in file:
rawData.append(line)
rawData.pop(0) # remove the first line which is the label line
data = numpy.array(rawData).astype(numpy.int32)
# 数据已经存到data变量中
# 下面将整个数据集分成train/validation/test三部分,其中validation为数据集的第partion部分,test为数据集的最后百分之十
train_x = []; train_y = []
valid_x = []; valid_y = []
test_x = []; test_y = []
# validation set
for i in range(partion*4200, (partion+1)*4200):
valid_x.append(data[i][1:785])
valid_y.append(data[i][0])
data = numpy.delete(data, [i for i in range(partion*4200, (partion+1)*4200)], 0)
# train set
for i in range(0, 33600):
train_x.append(data[i][1:785])
train_y.append(data[i][0])
# test set
for i in range(33600, 37800):
test_x.append(data[i][1:785])
test_y.append(data[i][0])
del data
train_x, train_y = shared_dataset((train_x, train_y))
valid_x, valid_y = shared_dataset((valid_x, valid_y))
test_x, test_y = shared_dataset((test_x, test_y))
rval = [(train_x, train_y), (valid_x, valid_y),
(test_x, test_y)]
return rval
def read_test(file):
rawData = []
file = csv.reader(open(file))
for line in file:
rawData.append(line)
rawData.pop(0) # the first line is title
data = numpy.array(rawData).astype(numpy.int32)
test_set_x = []
for i in range(len(data)):
test_set_x.append(data[i])
del data
shared_data = theano.shared(numpy.asarray(test_set_x), borrow=True)
return shared_data

下面是逻辑回归类,这个类来自Deep Learning Tutorials中的Logistic Regression,未作改动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
class LogisticRegression(object):
"""
Multi-class logistic regression class
The logistic regression is fully described by a weight matrix :math:'W'
and bias vector :math:'b'. Classification is done by projecting data
points onto a set of hyperplanes, the distance to which is used to
determine to a class membership probability.
"""
def __init__(self, input, n_in, n_out):
"""
Initialize the parameters of the logistic regression
:type input: theano.tensor.TensorType
:param input: symbolic variable that describes the input of the
architecture (one minibatch)
:type n_in: int
:param n_in: number of input units, the dimension of the space in
which the datapoints lie
:type n_out: int
:param n_out: number of output units, the dimension of the space in
which the labels lie
"""
# start-snippet-1
# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(value=numpy.zeros((n_in, n_out),
dtype=theano.config.floatX), name='W', borrow=True)
# initialize the biases b as a vector of n_out 0s
self.b = theano.shared(value=numpy.zeros((n_out,),
dtype=theano.config.floatX), name='b', borrow=True)
# symbolic expression for computing the matrix of class-membership
# probabilities
# Where:
# W is a matrix where column-k represent the separation hyperplane for
# class-k
# x is a matrix where row-j represents input training sample-j
# b is a vector where element-k represent the free parameter of hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
# symbolic description of how to compute prediction as class whose
# probability is matrix
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
# end-snippet-1
# parameters of the model
self.params = [self.W, self.b]
# keep track of model input
self.input = input
def negative_log_likelihood(self, y):
"""
Return the mean of the negative log-likelihood of the prediction of
this model under a given target distribution.
:type y: theano.tensor.TensorType
:param y: corresponds to a vector that gives for each example the
correct label
Note: we use the mean instead of the sum so that the learning rate
is less dependent on the batch size
"""
# start-snippet-2
# y.shape[0] is (symbolically) the number of rows in y, i.e.,
# number of examples (call it n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain
# [0, 1, 2,..., n-1] T.log(self.p_y_given_x) is a matrix of
# Log-Probabilities (call it LP) with one row per example and
# one column per class LP[T.arange(y.shape[0]),y] is a vector
# v containing [LP[0, y[0]], LP[1, y[1]], LP[2, y[2]], ...,
# LP[n-1, y[n-1]]] and T.mean(LP[T.arange(y.shape[0]), y]) is
# the mean (across minibatch examples) of the elements in v, i.e.,
# the mean log-likelihood across the minibatch.
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
# end-snippet-2
def errors(self, y):
"""Return a float representing the number of errors in the minibatch
over the total number of examples of the minibatch; zero one
loss over the size of the minibatch
:type y: theano.tensor.TensorType
:param y: corresponds to a vector that gives for each example the correct label
"""
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where a
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
else:
raise NotImplementedError()
def shared_dataset(data_xy, borrow=True):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x), borrow=borrow)
shared_y = theano.shared(numpy.asarray(data_y), borrow=borrow)
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets ous get around this issue
return shared_x, T.cast(shared_y, 'int32')

然后是模型训练函数,由Deep Learning Tutorials中的Logistic Regression中的模型训练函数改造而来,其中参数多了一个partion,用来指定train.csv中的哪一部分作为此次训练的验证集。

其中参数的选择我们会在实验结论部分给出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
data_path=r'train.csv',
save_path=r'my_best_model.pkl', partion=8, batch_size=800):
# load dataset
# datasets = read_raw_train(data_path)
datasets = read_cross_train(data_path, partion)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size # 25200 // batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size # 12600 // batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size # 4200 // batch_size
######################
# BUILD ACTUAL MODEL #
######################
print('...building the model')
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
# generate symbolic variables for input (x and y represent a minibatch)
x = T.imatrix('x') # data, presented as rasterized images
y = T.ivector('y') # labels, presented as 1D vector of [int] labels
# construct the logistic regression class
# Each MNIST image has size 28*28
classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
# the cost we minimize during training is the negative log likelihood of
# the model in symbolic format
cost = classifier.negative_log_likelihood(y)
# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
validate_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# compute the gradient of cost with respect to theta = (W,b)
g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs.
updates = [(classifier.W, classifier.W - learning_rate * g_W),
(classifier.b, classifier.b - learning_rate * g_b)]
# compiling a Theano function `train_model` that returns the cost, but in
# the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
###############
# TRAIN MODEL #
###############
print('... training the model')
patience = 5000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience // 2)
# go through this many
# minibatche before checking the network
# on the validation set; in this case we
# check every epoch
best_validation_loss = numpy.inf
best_test_loss = numpy.inf
test_score = 0.
start_time = timeit.default_timer()
done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
epoch += 1
for minibatch_index in range(n_train_batches):
minibatch_avg_cost = train_model(minibatch_index)
# iteration number
iter = (epoch - 1) * n_train_batches + minibatch_index
if (iter + 1) % validation_frequency == 0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i in range(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print('epoch %i, minibatch %i/%i, validation error %f %%' % (
epoch, minibatch_index + 1, n_train_batches, this_validation_loss * 100.))
# if we got the best validation score until now
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * \
improvement_threshold:
patience = max(patience, iter * patience_increase)
best_validation_loss = this_validation_loss
# test it on the test set
test_losses = [test_model(i)
for i in range(n_test_batches)]
test_score = numpy.mean(test_losses)
print(
(' epoch %i, minibatch %i/%i, test error of best model %f %%')
% (epoch, minibatch_index + 1, n_train_batches, test_score * 100.))
# save the best model
with open(save_path, 'wb') as f:
pickle.dump(classifier, f)
if patience < iter:
done_looping = True
break
end_time = timeit.default_timer()
print(
('Optimization complete with best validation score of %f %%, with test performance %f %%')
% (best_validation_loss * 100., test_score * 100.))
print('The code run for %d epochs, with %f epochs/sec' % (
epoch, 1. * epoch / (end_time - start_time)))
print(('The code for file ' +
os.path.split(__file__)[1] +
' ran for %.1fs' % ((end_time - start_time))), file=sys.stderr)

下面是预测函数,该函数有两个参数,参数data_path表示测试集所在文件,参数has_label表示测试集中是否含有label,有label的话预测函数会计算出错误率,没有label的话函数会将测试结果保存在文件answer.csv中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def predict_cross(has_label=1, data_path=r'train.csv'):
if has_label == 1:
# load test set
# data_path = r'E:\Lab\digitrecognizer\train.csv'
test_set_x, test_set_y = read_train(data_path)
test_set_x = test_set_x.get_value()
test_set_y = test_set_y.get_value()
# from different pkl document load different model
classifiers = []
for i in range(9):
model_path = 'model' + str(i) + '.pkl'
classifier = pickle.load(open(model_path))
classifiers.append(classifier)
# compile 9 predictor functions with different params
models = []
for i in range(9):
model = theano.function(inputs=[classifiers[i].input], outputs=classifiers[i].y_pred)
models.append(model)
# 做预测
predicted_labels = []
for i in range(9):
predicted = models[i](test_set_x[:42000])
predicted_labels.append(predicted)
predicted_label = []
for i in range(42000):
dictionary = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}
for j in range(9):
dictionary[str(predicted_labels[j][i])] += 1
# 字典处理,找到被9个模型预测最多的那个label作为该组输入的输出
label_number = max(dictionary.values())
for key in dictionary:
if dictionary[key] == label_number:
predicted_label.append(int(key))
break
err = 0.
for i in range(42000):
# print("The %d th example's predict is %d, and it's target value is %d." % (i, predicted_label[i], test_set_y[i]))
if predicted_label[i] != test_set_y[i]:
err += 1
print("The error rate of the combined model in 1000 examples is %f ." % (err/420))
errorate = [] # [8.875, 8.800, 8.925, 8.975, 8.375, 8.300, 8.925, 9.225, 8.925]
# 分别计算九个模型在整个训练集上的错误率
for i in range(9):
err = 0.
for j in range(42000):
if predicted_labels[i][j] != test_set_y[j]:
err += 1
errorate.append(err/420)
print("The mean error rate of 9 different models is %f" % numpy.mean(errorate))
print(errorate)
else:
# load test set
test_set_x = read_test(data_path)
test_set_x = test_set_x.get_value()
# from different pkl document load different model
classifiers = []
for i in range(9):
model_path = 'model' + str(i) + '.pkl'
classifier = pickle.load(open(model_path))
classifiers.append(classifier)
# compile 9 predictor functions with different params
models = []
for i in range(9):
model = theano.function(inputs=[classifiers[i].input], outputs=classifiers[i].y_pred)
models.append(model)
# 做预测
predicted_labels = []
for i in range(9):
predicted = models[i](test_set_x[:28000])
predicted_labels.append(predicted)
predicted_label = []
for i in range(28000):
dictionary = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}
for j in range(9):
dictionary[str(predicted_labels[j][i])] += 1
# 字典处理,找到被9个模型预测最多的那个label作为该组输入的输出
label_number = max(dictionary.values())
for key in dictionary:
if dictionary[key] == label_number:
predicted_label.append(int(key))
break
# print(predicted_label)
# print(len(predicted_label))
saveResult(predicted_label, r'answer.csv')
def saveResult(result, file):
with open(file, 'wb') as f:
# file = open(f, 'wb')
ob = csv.writer(f)
ob.writerow(["ImageId", "Label"])
ids = range(1, len(result)+1)
ob.writerows(zip(ids, result))
f.close()
def predict(file1, file2):
# load the saved model
classifier = pickle.load(open(file1))
# compile a predictor function
predict_model = theano.function(inputs=[classifier.input], outputs=classifier.y_pred)
# make prediction
data_path = r'test.csv'
test_set_x= read_test(data_path)
test_set_x = test_set_x.get_value()
predicted_values = predict_model(test_set_x[:28000])
# 保存预测结果
saveResult(predicted_values, file2)

最后是主函数,如下所示,当然我们可以根据自己的实际需要去做调整。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
if __name__ == '__main__':
start_time = timeit.default_timer()
# 训练出9个模型
for i in range(9):
save_model_path = r'model' + str(i) + '.pkl'
sgd_optimization_mnist(learning_rate=0.13, n_epochs=300,
data_path=r'train.csv',
save_path=save_model_path, partion=i, batch_size=1600)
# 在train集上检验模型
predict_cross(has_label=1, data_path=r'train.csv')
# 在test集上做预测,并将最终结果存到文件answer.csv中
predict_cross(has_label=0, data_path=r'test.csv')
# 使用单独的模型在test集上做测试,并保存测试结果
for i in range(9):
model_path = r'model' + str(i) + '.pkl'
save_path = r'answer' + str(i) + '.csv'
predict(model_path, save_path)
# 然后分别在kaggle上提交十组结果,观察正确率如何
# sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,data_path=r'E:\Lab\digitrecognizer\train.csv', save_path=r'E:\Lab\digitrecognizer\my_best_model.pkl', partion=8, batch_size=1400)
end_time = timeit.default_timer()
print("Time consumption is %f sec." % (end_time-start_time))

结论

在参数的选择上,我们做了多组测试(该测试是选择train.csv中的前3/5座训练集,最后1/10做测试集,中间3/10做验证集得到的),最终选定了上面的参数,下面给出了一组对照数据。测试的机器是12年的,CPU为i3-2350M,主频为2。30GHZ,内存为6G,无GPU(机器太渣,没办法)。在写这篇博客时,我又试验了一下bacth_size大于1000的几种情况,等后续试验数据得出后,如果有更小的错误率,我会更新数据。

batch_size learning_rate epoches seconds epoches/sec valid_err test_err
300 0.13 126 2207 0.057 9.786 9.904
500 0.13 270 2736 0.097 10.064 9.700
800 0.13 162 1186 0.137 9.767 9.425
800 0.3 394 2922 0.135 9.98 9.875
800 0.03 202 1369 0.147 10.083 10.075
1000 0.13 201 1205 0.167 9.792 9.475
1400 0.13 408 2281 0.1789 9.6111 9.500
1600 0.13 334 1590 0.2101 9.089 8.4375
3200 0.13 714 1974 0.3622 9.083 8.5313
4200 0.13 834 2529 0.3297 9.2540 9.0714

单单针对MNIST数据集和逻辑回归模型,我们从上面这张表可以得出一些结论:batch_size越大,计算速度越快,精度慢慢有所提高;learning rate大的话epoches就会变大,也就是在整个train集上计算的次数多。

通过实验,我们得到9个模型在train.csv上的单独错误率分别为[8.352380952380953, 7.071428571428571, 7.304761904761905, 6.809523809523809, 7.311904761904762, 7.604761904761904, 7.607142857142857, 7.609523809523809, 7.228571428571429],平均错误率为7.4333333333333336,而综合9个模型去做预测的话在train.csv上的错误率为 7.433333,所以说,上面这种方法的效果不明显。

然后,我们又综合9个模型去预测test.csv中的数据,提交到kaggle上的正确率为0.91429,而9个模型分别单独去预测test.csv中的数据,得到的正确率分别为[0.91057, 0.90543, 0.90986, 0.91371, 0.91214](数据还没有全部拿到,kaggle一天只允许提交5次),可以看到十折交叉验证的方法提高了0.4%的准确率,可能是实验方法不恰当,没有想象的高。

另外之前在老电脑上没有N卡,只能用cpu来跑,跑一个逻辑回归模型大概要20多分钟,现在换了GTX960M,速度明显提升,上面的整个程序跑下来用了700秒,快了十多倍。

MLP

这里不再使用上面的数据读取函数,而是用pandas来读取csv文件,具体函数如下。我使用了train.csv中的前八份来做training set,第九份做validation set,第十份做test set。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def load_data(path):
print('...loading data')
train_df = pandas.DataFrame.from_csv(path+'train.csv', index_col=False).fillna(0).astype(int)
test_df = pandas.DataFrame.from_csv(path+'test.csv', index_col=False).fillna(0).astype(int)
def shared_dataset(data_xy, borrow=True):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX), borrow=borrow)
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX), borrow=borrow)
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets ous get around this issue
return shared_x, T.cast(shared_y, 'int32')
train_set = [train_df.values[0:42000, 1:]/255.0, train_df.values[0:42000, 0]]
valid_set = [train_df.values[0:42000, 1:]/255.0, train_df.values[0:42000, 0]]
test_set = [train_df.values[0:42000, 1:]/255.0, train_df.values[0:42000, 0]]
predict_set = test_df.values/255.0
train_set_x, train_set_y = shared_dataset(train_set, borrow=True)
valid_set_x, valid_set_y = shared_dataset(valid_set, borrow=True)
test_set_x, test_set_y = shared_dataset(test_set, borrow=True)
predict_set = theano.shared(numpy.asarray(predict_set, dtype=theano.config.floatX), borrow=True)
datasets = [(train_set_x, train_set_y), (valid_set_x, valid_set_y), (test_set_x, test_set_y), predict_set]
return datasets

下面定义了一个类,作为多层感知机的隐含层。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class HiddenLayer(object):
def __init__(self, rng, input, n_in, n_out, W=None, b=None,
activation=T.tanh):
"""
Typical hidden layer of a MLP: units are fully-connected and have
sigmoid activation function. Weight matrix W is of shape (n_in,n_out)
and the bias vector b is of shape (n_out,).
NOTE: The nonlinearity used here is tanh
Hidden unit activation is given by: tanh(dot(input,W)+b)
:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights
:type input: theano.tensor.dmatrix
:param input: a symbolic tensor of shape (n_examples, n_in)
:type n_in: int
:param n_in: dimensionality of input
:type n_out: int
:param n_out: number of hidden units
:param W:
:param b:
:type activation: theano.Op or function
:param activation: Non linearity to be applied in the hidden layer
"""
self.input = input
# end-snippet-1
# 'W' is initialized with 'W_values' which is uniformely sampled
# from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
# for tanh activation function
# the output of uniform if converted using asarray to dtype
# theano.config.floatX so that the code is runnable on GPU
# Note: optimal initialization of weights is dependent on the
# activation function used (among other things).
# For example, results presented in [Xavier10] suggest that you
# should use 4 times larger initial weights for sigmoid
# compared to tanh
# We have no info for other function, so we use the same as tanh.
if W is None:
W_values = numpy.asarray(
rng.uniform(
low=-numpy.sqrt(6./(n_in+n_out)),
high=numpy.sqrt(6./(n_in+n_out)),
size=(n_in,n_out)
),
dtype=theano.config.floatX
)
if activation == theano.tensor.nnet.sigmoid:
W_values *= 4
W = theano.shared(value=W_values, name='W', borrow=True)
if b is None:
b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
b = theano.shared(value=b_values, name='b', borrow=True)
self.W = W
self.b = b
lin_output = T.dot(input, self.W) + self.b
self.output = (
lin_output if activation is None
else activation(lin_output)
)
# parameters of the model
self.params = [self.W, self.b]

下面是多层感知机的类,具体如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
class MLP(object):
"""Multi-layer Perceptron Class
A multilayer perceptron is a feedforward artificial neural network model
that has one layer or more of hidden units and nonlinear activations.
Intermediate layers usually have as activation function tanh or the
sigmoid function (defined here by a 'HiddenLayer' class) while the
top layer is a softmax layer (defined here by a 'LogisticRegression' class).
"""
def __init__(self, rng, input, n_in, n_hidden, n_out):
"""Initialize the parameters for the multilayer perceptron
:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights
:type input: theano.tensor.TensorType
:param input: symbolic variable that describes the input of the
architecture (one minibatch)
:type n_in: int
:param n_in: number of input units, the dimension of the space in which the datapoints lie
:type n_hidden: int
:param n_hidden: number of hidden units
:type n_out: int
:param n_out: number of output units, the dimension of the space in which the labels lie
"""
# Since we are dealing with a one hidden layer MLP, this will translate
# into a HiddenLayer with a tanh activation function connected to the
# LogisticRegression layer; the activation function can be replaced by
# sigmoid or any other nonlinear function
self.hiddenLayer = HiddenLayer(
rng=rng, input=input, n_in=n_in, n_out=n_hidden, activation=T.tanh
)
# The logistic regression layer gets as input the hiddenlayer units
# of the hidden layer
self.logRegressionLayer = LogisticRegression(
input=self.hiddenLayer.output, n_in=n_hidden, n_out=n_out
)
# end-snippet-2 start-snippet-3
# L1 norm; one regularization option is enforce L1 norm to be small
self.L1 = (abs(self.hiddenLayer.W).sum()+abs(self.logRegressionLayer.W).sum())
# square of L2 norm; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
(self.hiddenLayer.W**2).sum() + (self.logRegressionLayer.W**2).sum())
# negative log likelihood of the MLP is given by the negative
# log likelihood of the output of the model, computed in the
# logistic regression layer
# self.negative_log_likelihood = (self.logRegressionLayer.negative_log_likelihood)
# same holds for the function computing the number of errors
# self.errors = self.logRegressionLayer.errors
# the parameters of the model are the parameters of the two
# layer it is made out of
self.params = self.hiddenLayer.params + self.logRegressionLayer.params
# end-snippet-3
# keep track of model input
self.input = input
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.logRegressionLayer.p_y_given_x)[T.arange(y.shape[0]), y])
def errors(self, y):
if y.ndim != self.logRegressionLayer.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.logRegressionLayer.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where a
# represents a mistake in prediction
return T.mean(T.neq(self.logRegressionLayer.y_pred, y))
else:
raise NotImplementedError()
def __getstate__(self):
return self.__dict__

然后是多层感知机的训练函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def test_mlp(learning_rate=0.0005, L1_reg=0.00, L2_reg=0.0001, n_epochs=200,
path=r'', batch_size=20, n_hidden=500):
"""
Demonstrate stochastic gradient descent optimization for a multilayer perceptron
This is demonstrated on MNIST.
:type learning_rate: float
:param learning_rate: learning rate used (factor for the stochastic gradient)
:type L1_reg: float
:param L1_reg: L1-norm's weight when added to the cost (see regularization)
:type L2_reg: float
:param L2_reg: L2-norm's weight when added to the cost (see regularization)
:type n_epochs: int
:param n_epochs: maximal number of epochs to run the optimizer
:type dataset: string
:param dataset: the path of the MNIST dataset file from
http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
:param batch_size:
:param n_hidden:
:return:
"""
datasets = load_data(path)
# datasets = read_raw_train(dataset)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0]//batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]//batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0]//batch_size
######################
# BUILD ACTUAL MODEL #
######################
print('...building the model')
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
x = T.matrix('x') # the data is presented as rasterized images
y = T.ivector('y') # the labels are presented as 1D vector of [int] labels
rng = numpy.random.RandomState(1234)
# construct the MLP class
classifier = MLP(rng=rng, input=x, n_in=28*28, n_hidden=n_hidden, n_out=10)
# start-snippet-4
# the cost we minimize during training is the negative log likelihood of
# the model plus the regularization terms (L1 and L2): cost is expressed
# here symbolically
cost = (
classifier.negative_log_likelihood(y) + L1_reg*classifier.L1 + L2_reg*classifier.L2_sqr
)
# end-snippet-4
# compiling a Theano function that computes the mistakes that are made
# by the model on a minibatch
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index*batch_size: (index+1)*batch_size],
y: test_set_y[index*batch_size: (index+1)*batch_size]
}
)
validate_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index*batch_size:(index+1)*batch_size],
y: valid_set_y[index*batch_size:(index+1)*batch_size]
}
)
# start-snippet-5
# compute the gradient of cost with respect to theta (sorted in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params]
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs
# given two lists of the same length, A=[a1, a2, a3, a4] and
# B=[b1, b2, b3, b4], zip generates a list C of same size, where each
# element is a pair formed from the two lists:
# C=[(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates=[
(param, param-learning_rate*gparam)
for param, gparam in zip(classifier.params, gparams)
]
# compiling a Theano function 'train_model' that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in 'updates'
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates, # RMSprop(gparams, classifier.params, learning_rate),
givens={
x: train_set_x[index*batch_size: (index+1)*batch_size],
y: train_set_y[index*batch_size: (index+1)*batch_size]
}
)
# end-snippet-5
###############
# TRAIN MODEL #
###############
print('...training')
# early-stopping parameters
patience = 10000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is found
improvement_threshold = 0.995 # a relative improvement of this much is considered significant
validation_frequency = min(n_train_batches, patience//2)
# go through this many minibatche before checking the network
# on the validation set; in this case we check every epoch
best_validation_loss = numpy.inf
best_iter = 0
test_score = 0.
start_time = timeit.default_timer()
epoch = 0
done_looping = False
while (epoch < n_epochs) and (not done_looping):
epoch += 1
if epoch > 60:
learning_rate = 0.001
if epoch > 300:
learning_rate = 0.0005
for minibatch_index in range(n_train_batches):
minibatch_avg_cost = train_model(minibatch_index)
# iteration number
iter = (epoch-1)*n_train_batches + minibatch_index
if (iter+1)%validation_frequency==0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i in range(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print('epoch %i, minibatch %i/%i, validation error %f %%' %
(epoch, minibatch_index+1, n_train_batches, this_validation_loss*100.))
# if we got best validation score until now
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if (this_validation_loss<best_validation_loss*improvement_threshold):
patience = max(patience, iter*patience_increase)
best_validation_loss = this_validation_loss
best_iter = iter
# test it on the test set
test_losses = [test_model(i) for i in range(n_test_batches)]
test_score = numpy.mean(test_losses)
print((' epoch %i, minibatch %i/%i, test error of best model '
'%f %%') % (epoch, minibatch_index+1, n_train_batches, test_score*100.))
# save the best model
with open('mymlp_best_model.pkl', 'wb') as f:
pickle.dump(classifier, f)
if patience <= iter:
done_looping = True
break
end_time = timeit.default_timer()
print(('Optimization complete. Besat validation score of %f %% '
'obtained at iteration %i, with test performance %f %%') %
(best_validation_loss*100., best_iter+1, test_score*100.))
print(('The code for file ' + os.path.split(__file__)[1] +
'ran for %.2fm' % ((end_time - start_time)/60.)), file=sys.stderr)

最后是预测函数 和主函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def predict_kaggle(file1, file2):
# load the saved model
classifier = pickle.load(open(file1))
# compile a predictor function
predict_model = theano.function(inputs=[classifier.input], outputs=classifier.logRegressionLayer.y_pred,
allow_input_downcast=True)
# make prediction
data_path = r'test.csv'
datasets = load_data(r'')
test_set_x = datasets[3]
test_set_x = test_set_x.get_value()
predicted_values = predict_model(test_set_x[:28000])
# 保存预测结果
saveResult(predicted_values, file2)
if __name__ == '__main__':
file1 = r'mymlp_best_model.pkl'
file2 = r'answer_myMLP.csv'
test_mlp(learning_rate=0.1, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000, path=r'', batch_size=20, n_hidden=500)
predict_kaggle(file1, file2)

提交后的正确率为0.97486,如果将整个train.csv作为training set呢,最后的正确率达到0.97843,只是此时模型就过拟合了。

LeNet5

前面我们用逻辑回归和MLP实现了kaggle中的手写识别,下面我们用LeNet来实现之。读取数据的方式和上面一样,但是这里我不知道怎么用保存得到的模型去预测未标记的数据。因为前面的逻辑回归和MLP的模型都只有一层,很容易拿来去进行预测,关于这部分我会在写好预测函数后修改博客,现在先将预测部分写到训练函数的尾部。

LeNet的本质是卷积神经网络(Convolutional Neural Networks),其原理在前面的博客中有提到过,具体可阅读机器学习中使用的神经网络第五讲笔记,也可以阅读Deep Learning Tutorials中的Convolutional Neural Networks (LeNet)。下面给出LeNet的卷积池化层类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class LeNetConvPoolLayer(object):
"""Pool Layer of a convolutional network """
def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):
"""
Allocate a LeNetConvPoolLayer with shared variable internal parameters.
:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights
:type input: theano.tensor.dtensor4
:param input: symbolic image tensor, of shape image_shape
:type filter_shape: tuple or list of length 4
:param filter_shape: (number of filters, num input feature maps,
filter height, filter width)
:type image_shape: tuple or list of length 4
:param image_shape: (batch size, num input feature maps,
image height, image width)
:type poolsize: tuple or list of length 2
:param poolsize: the downsampling (pooling) factor (#rows, #cols)
"""
assert image_shape[1] == filter_shape[1]
self.input = input
# there are "num input feature maps * filter height * filter width"
# inputs to each hidden unit
fan_in = numpy.prod(filter_shape[1:])
# each unit in the lower layer receives a gradient from:
# "num output feature maps * filter height * filter width" /
# pooling size
fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) //
numpy.prod(poolsize))
# initialize weights with random weights
W_bound = numpy.sqrt(6. / (fan_in + fan_out))
self.W = theano.shared(
numpy.asarray(
rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
dtype=theano.config.floatX
),
borrow=True
)
# the bias is a 1D tensor -- one bias per output feature map
b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
self.b = theano.shared(value=b_values, borrow=True)
# convolve input feature maps with filters
conv_out = conv2d(
input=input,
filters=self.W,
filter_shape=filter_shape,
input_shape=image_shape
)
# pool each feature map individually, using maxpooling
pooled_out = pool.pool_2d(
input=conv_out,
ds=poolsize,
ignore_border=True
)
# add the bias term. Since the bias is a vector (1D array), we first
# reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
# thus be broadcasted across mini-batches and feature map
# width & height
self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
# store parameters of this layer
self.params = [self.W, self.b]
# keep track of model input
self.input = input

下面是LeNet的训练函数主体,LeNet模型共4层:layer0层首先将2828的输入降维到(28-5+1)(28-5+1)=2424,这一步就是卷积化,然后再进行22的池化,最后layer0层的输出就变成了1212;layer1层首先将1212的输入降维到(12-5+1)(12-5+1)=88,再进行22的池化,最后layer1层的输出变成了44;layer2层是隐含层,layer3层是逻辑回归层。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def test_CNN(learning_rate=0.03, n_epochs=300,
path=r'', batch_size=20, patience=10000, mu=0.9):
datasets = load_data(path)
# datasets = read_raw_train(dataset)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
predict_set = datasets[3]
nkerns = [20, 50]
# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0] // batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0] // batch_size
n_predict_batches = predict_set.get_value(borrow=True).shape[0] // batch_size
######################
# BUILD ACTUAL MODEL #
######################
print('...building the model')
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
x = T.matrix('x') # the data is presented as rasterized images
y = T.ivector('y') # the labels are presented as 1D vector of
# [int] labels
######################
# BUILD ACTUAL MODEL #
######################
print('... building the model')
# Reshape matrix of rasterized images of shape (batch_size, 28 * 28)
# to a 4D tensor, compatible with our LeNetConvPoolLayer
# (28, 28) is the size of MNIST images.
layer0_input = x.reshape((batch_size, 1, 28, 28))
# Construct the first convolutional pooling layer:
# filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
# maxpooling reduces this further to (24/2, 24/2) = (12, 12)
# 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12)
layer0 = LeNetConvPoolLayer(
rng,
input=layer0_input,
image_shape=(batch_size, 1, 28, 28),
filter_shape=(nkerns[0], 1, 5, 5),
poolsize=(2, 2)
)
# Construct the second convolutional pooling layer
# filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8)
# maxpooling reduces this further to (8/2, 8/2) = (4, 4)
# 4D output tensor is thus of shape (batch_size, nkerns[1], 4, 4)
layer1 = LeNetConvPoolLayer(
rng,
input=layer0.output,
image_shape=(batch_size, nkerns[0], 12, 12),
filter_shape=(nkerns[1], nkerns[0], 5, 5),
poolsize=(2, 2)
)
# the HiddenLayer being fully-connected, it operates on 2D matrices of
# shape (batch_size, num_pixels) (i.e matrix of rasterized images).
# This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
# construct a fully-connected sigmoidal layer
layer2 = HiddenLayer(
rng,
input=layer2_input,
n_in=nkerns[1] * 4 * 4,
n_out=500,
activation=T.tanh
)
# classify the values of the fully-connected sigmoidal layer
layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10)
classifier = [layer0, layer1, layer2, layer3]
# the cost we minimize during training is the NLL of the model
cost = layer3.negative_log_likelihood(y)
# create a function to compute the mistakes that are made by the model
test_model = theano.function(
[index],
layer3.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
validate_model = theano.function(
[index],
layer3.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# create a list of all model parameters to be fit by gradient descent
params = layer3.params + layer2.params + layer1.params + layer0.params
# create a list of gradients for all model parameters
grads = T.grad(cost, params)
# train_model is a function that updates the model parameters by
# SGD Since this model has many parameters, it would be tedious to
# manually create an update rule for each model parameter. We thus
# create the updates list by automatically looping over all
# (params[i], grads[i]) pairs.
updates = [
(param_i, param_i - learning_rate * grad_i) # param_i - mu * param_i + (1 + mu) * (mu * param_i - learning_rate * grad_i))
for param_i, grad_i in zip(params, grads)
]
def RMSprop(gparams, params, learning_rate, rho=0.9, epsilon=1e-6):
"""
param:rho,the fraction we keep the previous gradient contribution
"""
updates = []
for p, g in zip(params, gparams):
acc = theano.shared(p.get_value() * 0.)
acc_new = rho * acc + (1 - rho) * g ** 2
gradient_scaling = T.sqrt(acc_new + epsilon)
g = g / gradient_scaling
updates.append((acc, acc_new))
updates.append((p, p - learning_rate * g))
return updates
train_model = theano.function(
[index],
cost,
updates=updates, # RMSprop(grads, params, learning_rate, rho=mu, epsilon=1e-6),
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
###############
# TRAIN MODEL #
###############
print('...training')
# early-stopping parameters
# patience = 10000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is found
improvement_threshold = 0.995 # a relative improvement of this much is considered significant
validation_frequency = min(n_train_batches, patience // 2)
# go through this many minibatche before checking the network
# on the validation set; in this case we check every epoch
best_validation_loss = numpy.inf
best_iter = 0
test_score = 0.
start_time = timeit.default_timer()
epoch = 0
done_looping = False
while (epoch < n_epochs) and (not done_looping):
epoch += 1
for minibatch_index in range(n_train_batches):
minibatch_avg_cost = train_model(minibatch_index)
# iteration number
iter = (epoch - 1) * n_train_batches + minibatch_index
if (iter + 1) % validation_frequency == 0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i) for i in range(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print('epoch %i, minibatch %i/%i, validation error %f %%' %
(epoch, minibatch_index + 1, n_train_batches, this_validation_loss * 100.))
# if we got best validation score until now
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if (this_validation_loss < best_validation_loss * improvement_threshold):
patience = max(patience, iter * patience_increase)
best_validation_loss = this_validation_loss
best_iter = iter
# test it on the test set
test_losses = [test_model(i) for i in range(n_test_batches)]
test_score = numpy.mean(test_losses)
print((' epoch %i, minibatch %i/%i, test error of best model '
'%f %%') % (epoch, minibatch_index + 1, n_train_batches, test_score * 100.))
# save the best model
with open('kaggle_CNN_best_model.pkl', 'wb') as f:
pickle.dump(classifier, f)
if patience <= iter:
done_looping = True
break
end_time = timeit.default_timer()
print(('Optimization complete. Besat validation score of %f %% '
'obtained at iteration %i, with test performance %f %%') %
(best_validation_loss * 100., best_iter + 1, test_score * 100.))
print(('The code for file ' + os.path.split(__file__)[1] +
'ran for %.2fm' % ((end_time - start_time) / 60.)), file=sys.stderr)
# batch_size = 20
print('... Predicting.')
predict_model = theano.function(
[index],
outputs=layer3.y_pred,
givens={
x: predict_set[index * batch_size: (index + 1) * batch_size]
}
)
answer = []
for minibatch_index in range(n_predict_batches):
# print(predict_model(minibatch_index))
minibatch_answer = predict_model(minibatch_index)
for i in range(batch_size):
answer.append(minibatch_answer[i])
print(len(answer))
print('... saving predict answer.')
saveResult(answer, 'lenet.csv')

注意,这里我不知道怎么用保存下来的模型去预测未标记的数据,所以这里没有写单独的预测函数,而是把预测的部分放在了训练函数的末尾。

kaggle的手写识别中,其训练数据共42000组,我这里将训练数据按照18:1:1的比例分给了训练集/交叉验证集和测试集。然后将learning rate定为0.03,batch size是20,epoch是80,最后得到的正确率为0.99086,排名是135/1037。当我把42000组数据全部用来训练的话,模型跑到第24个epoch时,错误率就到了0了,也就是已经完全过拟合了,然后用这时候的模型去预测未标记的那28000组数据,提交后正确率为0.99129,排名是118/1037。

下一步的工作是继续系统学习theano及其他深度学习工具,自己去实现更多模型和优化方法。另外,在新电脑上搭建环境用了十多天,反复出错,有时间我会整理一篇博客,方便以后查找。