人工智能 - Would either L1L1 or L2L2 regularisation lower the MSE on the training and test data? - 吾爱随笔录

Would either L1L1 or L2L2 regularisation lower the MSE on the training and test data?

人工智能 linear-regression mean-squared-error l2-regularization l1-regularization

2021-11-03 09:57:32

Consider linear regression. The mean squared error (MSE) is 120.5 for the training dataset. We've reached the minimum for the training data.

Is it possible that by applying Lasso (L1 regularization) we would get a lower MSE for the training data? Would it get lower for the test data? Would this also hold for ridge regression (L2 regularization)?

1个回答

The answer is largely the same whether we consider $\ell_1$ or $\ell_2$ regularisation, so I will just speak generally about regularisation.

Mean square error for training data

Given some training data $\{(x_i, y_i)\}_{i = 1}^n$ , a linear regression line $Y = aX + b$ fit using the least squares method looks for coefficients that minimise the sum of squares, i.e. they are the minimisers given by

{a r g m i n}_{a, b} \sum_{i = 1}^{n} {(y_{i} - (a x_{i} + b))}^{2} .

$\mathrm{arg\,min}_{a, b} \sum_{i = 1}^n \left(y_i - (ax_i + b)\right)^2.$

This gives the same coefficients as minimising the mean square error

M S E ((x_{1}, y_{1}), \dots, (x_{n}, y_{n})) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - (a x_{i} + b))}^{2} .

$\mathrm{MSE}\left((x_1, y_1), \dots, (x_n, y_n)\right) = \frac{1}{n} \sum_{i = 1}^n \left(y_i - (ax_i + b)\right)^2.$

So, by definition, the coefficients $(a, b)$ minimise the MSE on the training data. Any regularisation will only increase the MSE on the training data.

Generalisation performance

The main point of regularisation is to prevent overfitting on the data and improve the generalisation performance (i.e. on the test set).

With an appropriate parameter for regularisation, you may obtain a smaller MSE on the test set. This depends on your dataset and the parameters you choose: strong regularisation may lead to underfitting, whereas weak regularisation might not make much difference to the coefficients that you fit.

其它你可能感兴趣的问题

上一篇权重向量形式是否暗示特征空间曲率？下一篇为什么使用 DQN 在 Q 值收敛之前获得最佳策略？