入坑机器学习近一个月，学习资料主要是李航的《统计学习方法》、Peter Harrington的《机器学习实战》、Andrew Ng的Machine Learning和台大林田轩的机器学习基石。

只看视频不做笔记的学习效果实在太差，从Andrew Ng的Machine Learning的第六周开始，坚持笔记。

Evaluating a Learning Algorithm

Deciding What to Try Next

我们继续使用前面课程中提到的房价预测问题。假设我们实现了对房价进行预测的规范化后的线性回归，得到了模型参数。但当我们使用一组新的数据来测试得到的hypothesis时却得到了一个不能接受的错误率。为了提高模型的表现，我们应该采取什么措施呢？

获取更多的训练实例(Get more training examples)
删减一些特征(Try smaller sets of features)
添加特征(Try getting additonal features)
添加多项式特征(Try adding polynomial features)
减小惩罚项$\lambda$(Try decreasing $\lambda$)
增大惩罚项$\lambda$(Try increasing $\lambda$)

上面提到的几种是我们调整模型时常用到的方法。但是什么情况下采取哪种调整措施，人们通常是不清楚的。很多人在进行调整时采用了不恰当的措施，结果耗费了大量的时间却未能得到一个好的模型，事倍功半。下面我们就尝试着去探寻一下其中的规律。

6_1

课程中引入了诊断（machine learning diagnostic）这一新概念。诊断是指这样一种测试——我们通过该测试来得知我们的学习算法是否可以很好的工作，而该测试也能指导我们有针对性的去改善学习算法的性能。对学习算法进行诊断需要花费大量的时间，但这是非常值得的。诊断使我们能够准确且专业地了解算法的性能，在调整参数时就能够做到有法可依，有据可循，避免了瞎猜乱猜。

6_2

Evaluatinng a Hypothesis

在使用一组training set来训练我们的参数时，我们需要调整参数以使得training set对应的cost function足够小。但由此得到的hypothesis并不一定就是一个好的hypothesis，它有可能会造成过拟合（overfitting），如下图表示。

6_3

上图中的hypothesis是一个简单的一次多项式函数，我们可以通过画出其图像来直观的判断hypothesis的好坏。当问题涉及到大量的特征时（如上图右侧列出的那样），画出图像将是一件非常困难的事情，此时我们需要其他的方法来评估我们的hypothesis。

评估hypothesis的标准方法是这样的：假设我们有下图所示这样的一个data set（当然，现实中的data set的数据量要非常大），将原始的data set分成两部分，前一部分作为通常意义上的training set来训练hypothesis，后一部分则作为test set来检验hypothesis。通常，进行划分的比例为70/30（或者分给training set更多一些）。下面，我们使用m来表示training set的数据量，使用$m_{test}$来表示test set的数据量。需要注意的是，原始的data set中的数据可能遵循某种规律，我们在进行划分前最好现将顺序打乱(使用random函数等)，这样会收到更好的效果。

6_4

以线性回归（linear regression）为例，我们来介绍典型的训练和测试步骤。首先，使用前面分割到的training set来训练模型，得到参数$\theta$。训练的过程一般是通过梯度下降（gradient descent）等方法来最小化损失函数（cost function $J(\theta)$）。然后按照下面的函数来计算test set的错误率$J_{test}(\theta)$：
$$J_{test}(\theta)=\frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_{\theta}(x_{test}^{(i)})-y^{(i)})^2$$

上式中，我们使用平方误差度量（squared error metric）来计算线性回归的test set的错误率。换成逻辑回归（logistics regression）的话，计算公式如下：
$$J_{test}(\theta)=-\frac{1}{m_{test}}\sum_{i=1}^{m_{test}}y_{test}^{(i)}\log {h_{\theta}(x_{test}^{(i)})} + (1-y_{test}^{(i)})\log h_{\theta}(x_{test}^{(i)})$$

Model Selection and Train/Validation/Test Sets

接着上一部分的内容，我们引入一个新的数据集——交叉验证数据集（cross valid set）。此时，我们将原始的data set按照6/2/2的比例分割成training set、cross valid set、test set三部分，使用m、$m_{cv}$、$m_{test}$三个变量表示三个数据集的数据量。具体分割如下图所示：

6_5

对应的，计算错误率的公式也需要调整：
$$J_{train}(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$$

$$J_{cv}(\theta)=\frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}(h_{\theta}(x_{cv}^{(i)})-y_{cv}^{(i)})^2$$ $$J_{test}(\theta)=\frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_{\theta}(x_{test}^{(i)})-y_{test}^{(i)})^2$$

下面我们讨论选择模型的问题。选择合适的hypothesis使得模型“just fit”数据，既不“underfit”，也不“overfit”。
如下图所示，我们列出多个hypothesis供选择。我们使用前面的training set来训练，分别得到每个hypothesis的参数$\theta$，记为$\theta^{(1)}$，$\theta^{(2)}$，…，$\theta^{(10)}$。下一步，利用得到的参数计算每个hypothesis在cross valid set上的错误率：$J_{cv}(\theta^{(1)})$，$J_{cv}(\theta^{(2)})$，...，$J_{cv}(\theta^{(10)})$。选取$J_{cv}(\theta)$最小的一个hypothesis作为我们最终选取的hypothesis，计算其在test set上的错误率。

6_6

当然，可供我们选择的hypothesis实在太多了，绝非上图给出的这10个。很自然地引出“如何缩小hypothesis的选择区域”这一问题，我们留在将来进行讨论。

Bias vs. Variance

Diagnosing Bias vs. Variance

当我们训练得到的hypothesis的性能达不到期望值时，通常只可能存在两种问题：high bias问题和high variance问题。换句话说，通常情况下只可能存在两种问题：欠拟合问题（underfitting problem）和过拟合问题（overfitting problem）。只有准确的判断我们的hypothesis存在哪种问题，才能更好地去修正和改善它。

我们将hypothesis的维度记作d，如$h_\theta(x)=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4$的维度是$d=4$。下图为同一组data set对应的不同hypothesis的图像:

6_7

我们通过图像来观察$J_{train}(\theta)$和$J_{cv}(\theta)$随d变化的变化情况，图像见下图。

6_8

6_9

由上图图像可知，d越小，其对应的hypothesis越简单，也就越有可能存在欠拟合问题；d越大，其对应的hypothesis越复杂，也就越可能存在过拟合问题。换句话说，当出现high bias problem时，$J_{train}(\theta)$和$J_{cv}(\theta)$都会达到一个很高的值；当出现high variance problem时，$J_{train}(\theta)$的值很小，而$J_{cv}(\theta)$的值很大。

Regularization and Bias/Variance

下面我们继续讨论正则化系数$\lambda$（the regularization coefficient $\lambda$）对bias-variance的影响。此时损失函数（cost function）更新如下：
$$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(x)})-y^{(i)})^2+\frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2$$
如下图所示，给定hypothesis，为$\lambda$设置12个不同的取值。在每个$\lambda$的取值下，以损失函数最小化的原则分别求得参数$\theta^{(1)}$，$\theta^{(2)}$，...，$\theta^{(12)}$。然后计算$J_{cv}(\theta^{(1)})$，$J_{cv}(\theta^{(2)})$，...，$J_{cv}(\theta^{(12)})$。选出其中的最小值对应的参数$\theta$作为最终选定的hypothesis。最后计算我们选定的hypothesis在test set上的错误率，查看其性能。

6_15

以上是我们选取正则化系数$\lambda$的过程，下面我们来看一下$J_{train}(\theta)$、$J_{cv}(\theta)$随$\lambda$变化的变化情况。图像如下图所示：

6_16

补充一点课外资料。

查阅了大名鼎鼎的Pattern Recognization And Machine Learning，我们找到了其中对bias-variance的介绍，摘抄如下：

It is instructive to consider a frequentist viewpoint of the model complexity issue, known as the bias-variance trade-off.

As we shall see, there is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low riance. The model with the optimal predictive capability is the one that leads to the best balance between bias and variance.

6_10

The top row corresponds to a large value of the regularization coefficient $\lambda$ that gives low variance (because the red curves in the left plot looks similar) but high bias (because the two curves in the right plot are very different). Conversely, on the bottom row, for which $\lambda$ is small, there is large variance (shown by the high variability between the red curves in the left plot) but low bias (shown by the good fit between the average model fit and the original sinusoidal function).

We see that small values of $\lambda$ allow the model to become finely tuned to the noise on each individual data set leading to large variance. Conversely, a large value of $lambda$ pulls the weight parameters to towards zero leading to large bias.

另外在网上找到了首尔大学Biointelligence Lab的PPT（PPT传送门），将介绍bisa-variance的基业slide截图粘贴在下面。

再贴一个知乎上搜来的答案节选，全部答案见orange princle在“机器学习中的Bias(偏差)，Error(误差)，和Variance(方差)有什么区别和联系？”下的回答：

首先 $Error = Bias + Variance$

Error反映的是整个模型的准确度，Bias反映的是模型在样本上的输出与真实值之间的误差，即模型本身的精准度，Variance反映的是模型每一次输出结果与模型输出期望之间的误差，即模型的稳定性。

举一个例子，一次打靶实验，目标是为了打到10环，但是实际上只打到了7环，那么这里面的Error就是3。具体分析打到7环的原因，可能有两方面：一是瞄准出了问题，比如实际上射击瞄准的是9环而不是10环；二是枪本身的稳定性有问题，虽然瞄准的是9环，但是只打到了7环。那么在上面一次射击实验中，Bias就是1,反应的是模型期望与真实目标的差距，而在这次试验中，由于Variance所带来的误差就是2，即虽然瞄准的是9环，但由于本身模型缺乏稳定性，造成了实际结果与模型期望之间的差距。

在一个实际系统中，Bias与Variance往往是不能兼得的。如果要降低模型的Bias，就一定程度上会提高模型的Variance，反之亦然。造成这种现象的根本原因是，我们总是希望试图用有限训练样本去估计无限的真实数据。当我们更加相信这些数据的真实性，而忽视对模型的先验知识，就会尽量保证模型在训练样本上的准确度，这样可以减少模型的Bias。但是，这样学习到的模型，很可能会失去一定的泛化能力，从而造成过拟合，降低模型在真实数据上的表现，增加模型的不确定性。相反，如果更加相信我们对于模型的先验知识，在学习模型的过程中对模型增加更多的限制，就可以降低模型的variance，提高模型的稳定性，但也会使模型的Bias增大。Bias与Variance两者之间的trade-off是机器学习的基本主题之一，机会可以在各种机器模型中发现它的影子。

Learning Curves

观察学习曲线（learning curve）是一个评价hypothesis的好方法。给定一个hypothesis，我们观察损失函数$J_{train}(\theta)$、$J_{cv}(\theta)$随着训练样本大小（training set size）m的变化是怎样变化的。如下图所示，给定$h_{\theta}(x)=\theta_0+\theta_1x+\theta_2x^2，$随着m的增大，$J_{train}(\theta)$逐渐增大，$J_{cv}(\theta)$逐渐减小。很好理解，当m过小时，hypothesis几乎可以恰好地拟合training set，因此$J_{train}(\theta)$很小，此时hypothesis通常是overfitting的，所以$J_{cv}(\theta)$的值偏高；随着m的增大，hypothesis能够从training set中学习到更多的信息，能够更好的预测未知数据，因此$J_{cv}(\theta)$逐渐减小，而$J_{train}(\theta)$变大；当m达到一定的值时，hypothesis已经不能拟合所有的训练样本，因此$J_{train}(\theta)$变得很大，此时hypothesis逐渐出现了underfitting problem，而$J_{cv}(\theta)$也只能维持一定的水平而不能继续下降。

6_17

在出现high bias problem/high variance problem的情况下，我们看一下损失函数$J{train}(\theta)$、$J{cv}(\theta)$随着训练样本大小（training set size）m的变化是怎样变化的。图像见下面两张图。由下图可知，当出现high bias problem时，靠增大training set不太可能改善hypothesis的性能；由第二图知，当出现high variance problem时，增大training set应该是改善hypothesis的一个好方法。

6_18

6_19

Deciding What to Do Next Revisited

了解了bias/variance的概念，我们就知道博文最开始列出的几种措施分别是用来解决什么问题的了。

Get more training examples —–> fix high variance
Try smaller sets of features —–> fix high variance
Try getting additonal features —–> fix high bias
Try adding polynomial features —–> fix high bias
Try decreasing $\lambda$ —–> fix high bias
Try increasing $\lambda$ —–> fix high variance

Reference

写作内容均来自Cousera上Andrew Ng的Machine Learning课程，详见传送门

写作过程中参考了以下：

学霸女神Rachel____Zhang的CSDN博客，见传送门
Pattern Recognition And Machine Learning, P147-152
orange princle在“机器学习中的Bias(偏差)，Error(误差)，和Variance(方差)有什么区别和联系？”下的回答

Machine Learning第六周笔记一：评估学习算法和bias-variance