说来真的惭愧,从休学返校后就没有系统的学习过,更没有做多少coding的工作。最近一个月零零散散的看了tensorflow的一些基础,大概可以写一个小的demo了,就拿Titanic来练手了,最后的准确率只有0.77990,感觉应该还是特征提取得不够。完整代码见我的Github

Titanic是一个二分类的题目,具体描述见Titanic:Machine Learning from Disaster|Kaggle。简单分析一下数据,pclass代表客舱等级,sex代表性别,Age代表年龄,SibSp和Parch分别代表同船的同辈人的数量和同船的父母或者孩子的数量(就是说这两个数字表示了这个乘客有多少亲戚在船上),Fare代表票价,Embarked代表登船的港口(有三个,分别是Cherbourg,Queenstown和Southampton),这些都是非常明显的特征,直接拿来使用。

数据残缺,我们首先要做的是补齐数据:首先,有两条数据的Embarked缺失,我们分析同等客舱同等票价的人的Embarked应该一样,PassengerID为62的顾客为一等舱,票价为80,因此我们分析一等舱的三个不同登船地点的乘客票价,得出在Chergbourg登船的票价中位数为80,因此我们将这个缺失补齐为C。另外还有大量乘客的年龄缺失,我们先统计每个等级客舱男女乘客的年龄信息,然后将缺失的年龄用同一等级相同乘客的年龄的中位数来补齐。另外还有Cabin一项大量缺失,这个好像不能补齐,但我们分析了有Cabin和没有Cabin两种情况的生存率,发现有Cabin比没有Cabin的生存率大很多,因此我们把Cabin的缺失与否当成一个特征。Ticket一项暂时没有什么用,这里直接抛弃,后面在做进一步处理。Name一项我只取了其中的称谓,常见的有Mr.、Mrs.、Miss.、Master.等,我们把这个提取出来作了一个特征。另外,我们将pclass这种拆分成一个向量,例如pclass分成三个等级,我们就分别写成(1, 0, 0)、(0, 1, 0)和(0, 0, 1),男女性别分别写成(1, 0)和(0, 1),Embarked分别写成(1, 0, 0)、(0, 1, 0)和(0, 0, 1)。

综上,我们每条数据提取出15个特征,使用pandas分别处理好训练集和测试集。我们又将训练集前700条和剩下的291条分别作为训练集和交叉验证集。三组数据分别保存为train_set.csvvalidation_set.csvtestFeature.csv

下面开始写代码,这里我首先使用逻辑回归,具体代码如下。这里遇到一个问题,最开始写完代码跑起来之后发现准确率一直不变,模型根本就没有收敛,查询之后得知如果cost选取的函数不合理就会造成不收敛,参数得不到修正。训练完之后模型在交叉验证集集上的最高准确率为0.8901,提交后得到在测试集上的准确率为0.74641。其实在此之前我已经提交了很多次没有记录下来,不过这应该是逻辑回归比较好的一次结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def LogisticTrain(batch_size=50, epochs=100, rate=0.003, savepath='./logisticPredict.csv'):
testData = pd.read_csv('./testFeature.csv')
x = tf.placeholder(tf.float32, [None, 15], name="data")
y_ = tf.placeholder(tf.float32, [None, 2], name="label")
W = tf.Variable(tf.truncated_normal([15, 2], stddev=0.1), name="W")
b = tf.Variable(tf.truncated_normal([2], stddev=0.1), name="b")
y = tf.nn.softmax(tf.matmul(x, W) + b, name="predict_label")
# cross_entropy = - tf.reduce_sum(y_ * tf.log(y))
# cross_entropy = - tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-5, 1.0)))
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_))
train_step = tf.train.GradientDescentOptimizer(learning_rate=rate).minimize(cross_entropy)
predict_step = tf.argmax(y, 1)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
epoch = 1
best = 0
while epoch <= epochs:
print("epoch of logistic training: ", epoch)
train_X, train_Y = shuffle()
validation_X, validation_Y = getValidation()
for i in xrange(0, 700, batch_size):
sess.run(train_step, feed_dict={x: train_X[i:i + batch_size], y_: train_Y[i:i + batch_size]})
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("mini_batch", i, "~", i + batch_size, "of", epoch, "epochs")
print("accuracy on train set: {}".format(sess.run(accuracy, feed_dict={x: train_X, y_: train_Y})))
print("accuracy on validation set: {}, the current best accuracy: {}".format(
sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}), best))
# best = max(best, sess.run(accuracy, feed_dict={x: X, y_: Y}))
if best < sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}):
best = sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y})
# saver.save(sess, "./save.ckpt")
savePredict(sess.run(predict_step, feed_dict={x: testData}), savepath=savepath)
epoch += 1
print("The best accuracy: ", best)

下面是Neural Network,具体代码如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def NNtrain(batch_size=50, epochs=100, rate=0.003):
testData = pd.read_csv('./testFeature.csv')
x = tf.placeholder(tf.float32, [None, 15], name="data")
y_ = tf.placeholder(tf.float32, [None, 2], name='label')
with tf.name_scope("layer_in"):
W1 = tf.Variable(tf.truncated_normal([15, 256],stddev=0.1), name="W1")
b1 = tf.Variable(tf.truncated_normal([256], stddev=0.1), name="b1")
hidden1 = tf.nn.relu(tf.matmul(x, W1) + b1, name="hidden1")
with tf.name_scope("layer_hidden_1"):
W2 = tf.Variable(tf.truncated_normal([256, 128], stddev=0.1), name="W2")
b2 = tf.Variable(tf.truncated_normal([128], stddev=0.1), name="b2")
hidden2 = tf.nn.relu(tf.matmul(hidden1, W2) + b2, name="hidden2")
with tf.name_scope("layer_hidden_2"):
W3 = tf.Variable(tf.truncated_normal([128, 64]), name="W3")
b3 = tf.Variable(tf.truncated_normal([64]), name="b3")
hidden3 = tf.nn.relu(tf.matmul(hidden2, W3) + b3, name="hidden3")
with tf.name_scope("layer_out"):
W4 = tf.Variable(tf.truncated_normal([64, 2], stddev=0.1), name="W4")
b4 = tf.Variable(tf.truncated_normal([2], stddev=0.1), name="b4")
y = tf.nn.softmax(tf.matmul(hidden3, W4) + b4, name="y_out")
with tf.name_scope("cost"):
vars = tf.trainable_variables()
lossL2 = tf.add_n([tf.nn.l2_loss(v) for v in vars if 'b' not in v.name]) * 0.05
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_)+lossL2)
# cross_entropy = - tf.reduce_sum(y_ * tf.log(y + 1e-10))
with tf.name_scope("train"):
train_step = tf.train.GradientDescentOptimizer(learning_rate=rate).minimize(cost)
with tf.name_scope("predict"):
predict_step = tf.argmax(y, 1)
with tf.name_scope("save_params"):
saver = tf.train.Saver()
# init = tf.global_variables_initializer()
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
epoch = 1
best = 0
while epoch <= epochs:
print("epoch of neural network training: ", epoch)
train_X, train_Y = shuffle()
validation_X, validation_Y = getValidation()
for i in xrange(0, 700, batch_size):
sess.run(train_step, feed_dict={x: train_X[i:i + batch_size], y_: train_Y[i:i + batch_size]})
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("mini_batch", i, "~", i + batch_size, "of", epoch, "epochs")
print("accuracy on train set: {}".format(sess.run(accuracy, feed_dict={x: train_X, y_: train_Y})))
print("accuracy on validation set: {}, the current best accuracy: {}".format(sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}), best))
# best = max(best, sess.run(accuracy, feed_dict={x: X, y_: Y}))
if best < sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}):
best = sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y})
# saver.save(sess, "./save.ckpt")
savePredict(sess.run(predict_step, feed_dict={x: testData}))
epoch += 1
print("The best accuracy: ", best)

下面是CNN,具体代码如下。得到的准确率为0.77512,将rate改为0.001,得到准确率为0.77990。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def CNNTrain(batch_size=50, epochs=100, rate=0.003, filepath='./predict.csv'):
'''
We construct a three layers model: input layer, hidden layer and output layer
:param batch_size:
:return:
'''
testData = pd.read_csv('./testFeature.csv')
x = tf.placeholder(tf.float32, [None, 15], name="data")
y_ = tf.placeholder(tf.float32, [None, 2], name='label')
with tf.name_scope("layer_in"):
W_in = tf.Variable(tf.truncated_normal([15, 400],stddev=0.1), name="W1")
b_in = tf.Variable(tf.truncated_normal([400], stddev=0.1), name="b1")
hidden_in = tf.nn.relu(tf.matmul(x, W_in) + b_in, name="hidden1")
hidden_in_flat = tf.reshape(hidden_in, [-1, 20, 20, 1], name="hidden1_flat")
with tf.name_scope("conv_1"):
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(hidden_in_flat, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
with tf.name_scope("conv_2"):
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
with tf.name_scope("layer_hidden_1"):
W1 = weight_variable([5 * 5 * 64, 128])
b1 = bias_variable([128])
h_pool2_flat = tf.reshape(h_pool2, [-1, 5 * 5 * 64])
hidden1 = tf.nn.relu(tf.matmul(h_pool2_flat, W1) + b1)
with tf.name_scope("layer_hidden_2"):
W2 = weight_variable([128, 64])
b2 = bias_variable([64])
hidden2 = tf.nn.relu(tf.matmul(hidden1, W2) + b2)
with tf.name_scope("layer_out"):
W_out = weight_variable([64, 2])
b_out = bias_variable([2])
hidden_out = tf.nn.softmax(tf.matmul(hidden2, W_out) + b_out, name="output")
with tf.name_scope("cost"):
vars = tf.trainable_variables()
lossL2 = tf.add_n([tf.nn.l2_loss(v) for v in vars if 'b' not in v.name]) * 0.05
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, labels=y_)+lossL2)
# cross_entropy = - tf.reduce_sum(y_ * tf.log(y + 1e-10))
with tf.name_scope("train"):
train_step = tf.train.GradientDescentOptimizer(learning_rate=rate).minimize(cost)
with tf.name_scope("predict"):
predict_step = tf.argmax(hidden_out, 1)
with tf.name_scope("save_params"):
saver = tf.train.Saver()
# init = tf.global_variables_initializer()
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
epoch = 1
best = 0
while epoch <= epochs:
print("epoch: ", epoch)
train_X, train_Y = shuffle()
validation_X, validation_Y = getValidation()
for i in xrange(0, 700, batch_size):
sess.run(train_step, feed_dict={x: train_X[i:i + batch_size], y_: train_Y[i:i + batch_size]})
correct_prediction = tf.equal(tf.argmax(hidden_out, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("mini_batch", i, "~", i + batch_size, "of", epoch, "epochs")
print("accuracy on train set: {}".format(sess.run(accuracy, feed_dict={x: train_X, y_: train_Y})))
print("accuracy on validation set: {}, the current best accuracy: {}".format(sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}), best))
# best = max(best, sess.run(accuracy, feed_dict={x: X, y_: Y}))
if best < sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y}):
best = sess.run(accuracy, feed_dict={x: validation_X, y_: validation_Y})
# saver.save(sess, "./save.ckpt")
savePredict(sess.run(predict_step, feed_dict={x: testData}), filepath=filepath)
epoch += 1
print("The best accuracy: ", best)

大概就是这样,等学了新的模型或者有更好的方法再来更新。