2023-04-06 15:51阅读: 172评论: 0推荐: 0

一、正则化

1、正则化的理解

当模型的复杂度>>数据的复杂度时，会出现过拟合现象，即模型过度拟合了训练数据，其泛化能力变差。为此，会通过数据增强、降维、正则化等方法防止模型过拟合。

arg min ω (L (w) + λ Ω (w)) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><munder><mrow data-mjx-texclass="OP"><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mrow data-mjx-texclass="ORD"><mi>ω</mi></mrow></munder><mo data-mjx-texclass="NONE"></mo><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>w</mi><mo stretchy="false">)</mo><mo>+</mo><mi>λ</mi><mi mathvariant="normal">Ω</mi><mo stretchy="false">(</mo><mi>w</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>

式子里面，前者为损失函数，后者为正则化项。
从数学角度理解，以线性回归为例，其损失函数为：

L (ω) = N \sum i = 1 | | ω T - y i | | 2 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>ω</mi><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>N</mi></mrow></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>ω</mi><mi>T</mi></msup><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mo stretchy="false">|</mo><mn>2</mn></msup></math>

可以得到：

W = (X T X) - 1 X T Y <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>W</mi><mo>=</mo><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><msup><mi>X</mi><mi>T</mi></msup><mi>Y</mi></math>

需要对 $X T X <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi></math>$ 求逆，才能得到解。
对于 $X N \times P <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>N</mi><mo>\times</mo><mi>P</mi></mrow></msub></math>$ , $x i \in R P <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>P</mi></msup></math>$ , 其中 $N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math>$ 为样本数， $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi></math>$ 为样本维度，当 $P >> N <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo>>></mo><mi>N</mi></math>$ 时，在数学上，表现为 $X T X <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi></math>$ 不可逆，在现象上，即为模型过拟合。
若以 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ 正则化方法进行约束，则有：

求导：

∂J(ω)∂ω=2(XTX+λI)W−2XTY<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mi>ω</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mi>ω</mi></mrow></mfrac><mo>=</mo><mn>2</mn><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo>+</mo><mi>λ</mi><mi>I</mi><mo stretchy="false">)</mo><mi>W</mi><mo>−</mo><mn>2</mn><msup><mi>X</mi><mi>T</mi></msup><mi>Y</mi></mtd></mtr></mtable></math>

解得：

W = (X T X + λ I) - 1 X T Y <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mi>W</mi><mo>=</mo><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo>+</mo><mi>λ</mi><mi>I</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><msup><mi>X</mi><mi>T</mi></msup><mi>Y</mi></mtd></mtr></mtable></math>

$X T X <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi></math>$ 为半正定矩阵， $λ I <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi><mi>I</mi></math>$ 为半角矩阵，故 $(X T X + λ I) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msup><mi>X</mi><mi>T</mi></msup><mi>X</mi><mo>+</mo><mi>λ</mi><mi>I</mi><mo stretchy="false">)</mo></math>$ 为正定矩阵，一定可逆，这从数学上解释了正则化的原因。

2、L1 正则化（Lasso Regression）与 L2 正则化（岭回归，Ridge Regression ）

神经网络中，参数 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 包括每一层的权重 $ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ω</mi></math>$ 和偏置 $b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi></math>$ ，通常情况下，只对权重 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 进行惩罚，因为其关联两个变量之间的相互作用，而 $b <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi></math>$ 仅控制一个参数，不进行正则化也不会导致太大方差，为了减小空间，所有层使用共同的权重衰减。

$L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ 与 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ 正则化，可以看作是在损失函数中引入了惩罚项。先来介绍下 $L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ 和 $L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ 正则化损失函数:

$L 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>1</mn></math>$ 损失函数，也叫最小绝对值偏差（LAD）或绝对值损失函数（LAE），即把目标值 $y i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mi>i</mi></msub></math>$ 与估计值 $f (x i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo></math>$ 的差的绝对值总和最小化：

L = n \sum i = 1 | y i - f (x i) | <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mi>L</mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></munderover><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>y</mi><mi>i</mi></msub><mo>-</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo stretchy="false">|</mo></mtd></mtr></mtable></math>

$L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mn>2</mn></math>$ 损失函数，也叫最小平方误差（LSE），即把目标值 $y i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mi>i</mi></msub></math>$ 与估计值 $f (x i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo></math>$ 的差的平方和最小化。

（ （ L = n \sum i = 1 （ y i - f (x i)) 2 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right" columnspacing="" rowspacing="3pt"><mtr><mtd><mi>L</mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></munderover><mi>（</mi><msub><mi>y</mi><mi>i</mi></msub><mo>-</mo><mi>f</mi><mo stretchy="false">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mtd></mtr></mtable></math>

3、L1 正则化的稀疏性

TODO...画图啊。。。好麻烦。。。有空在更

4、Dropout

$D r o p o u t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>o</mi><mi>u</mi><mi>t</mi></math>$ 给正则化提供了一个非常简便的方法，即在训练过程中，让神经元以超参数 $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ 的概率被激活或设置为0。参考CS231n笔记
值得注意的是，对于某一神经元的输入 $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ ，经过 $D r o p o u t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>o</mi><mi>u</mi><mi>t</mi></math>$ 后，期望值为 $E = p x + (1 - p) \times 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi><mo>=</mo><mi>p</mi><mi>x</mi><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>p</mi><mo stretchy="false">)</mo><mo>\times</mo><mn>0</mn></math>$ ，为了在测试时与训练时获得相同的预期输出，需要对其进行缩放，主要有在训练时缩放和预测时缩放，训练时缩放的好处在于，无论是否使用 $D r o p o u t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>o</mi><mi>u</mi><mi>t</mi></math>$ ,预测的代码可以保持不变。

 """
inverted dropout（反向随机失活）: 推荐实现方式.
在训练的时候drop和调整数值范围，测试时不用任何改变.
"""
p = 0.5 # 激活神经元的概率. p值更高 = 随机失活更弱
 
def train_step(X):
    # 3层neural network的前向传播
    H1 = np.maximum(0, np.dot(W1, X) + b1)
    U1 = (np.random.rand(*H1.shape) < p) / p # 第一个dropout mask. /p 即为缩放，保证期望相同
    H1 *= U1 # drop!
    H2 = np.maximum(0, np.dot(W2, H1) + b2)
    U2 = (np.random.rand(*H2.shape) < p) / p # 第二个dropout mask. /p 即为缩放，保证期望相同
    H2 *= U2 # drop!
    out = np.dot(W3, H2) + b3
    # 反向传播:计算梯度... (略)
    # 进行参数更新... (略)
 
def predict(X):
# 前向传播时模型集成
H1 = np.maximum(0, np.dot(W1, X) + b1) # 预测不用缩放了
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3

5、提前终止（Early Stopping）

我愿称之为懒人必备正则化方法，简单且有效。
在训练模型的后期，经常会出现训练集上的 $L o s s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>o</mi><mi>s</mi><mi>s</mi></math>$ 还在下降，但验证集上的 $L o s s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>o</mi><mi>s</mi><mi>s</mi></math>$ 反而开始上升，说明模型开始过拟合了，此时保存验证集上 $L o s s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi><mi>o</mi><mi>s</mi><mi>s</mi></math>$ 最低的模型即可。

6、多任务学习（Multi-task）

多任务学习是通过合并几个任务从而提高泛化能力的方式，可以视作对参数施加了软约束。当模型的一部分被多个任务共享时，这部分将被约束为更好的值，前提是共享的任务合理，会提高模型的泛化能力。
这部分好像没什么好说的，现在大部分的 $E 2 E <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi><mn>2</mn><mi>E</mi></math>$ 模型都会使用多任务， $B E V <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>B</mi><mi>E</mi><mi>V</mi></math>$ 感知里面也经常使用。

7、数据增强、添加噪声等

数据增强基本是在写大部分 $D a t a s e t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>a</mi><mi>t</mi><mi>a</mi><mi>s</mi><mi>e</mi><mi>t</mi></math>$ 和 $D a t a l o a d e r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mi>a</mi><mi>t</mi><mi>a</mi><mi>l</mi><mi>o</mi><mi>a</mi><mi>d</mi><mi>e</mi><mi>r</mi></math>$ 的时候必定经过的一个流程了，需要注意的是，对于分类任务，不能使用改变类别的变换，比如 $6 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>6</mn></math>$ 和 $9 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>9</mn></math>$ ，对于目标检测任务使用旋转或放缩变换时，注意检测框需要一并进行处理。

本文作者：Abyss_J

本文链接：https://www.cnblogs.com/abyss-130/p/17293042.html

posted @ 2023-04-06 15:51 Abyss_J 阅读(172) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

公告

昵称：Abyss_J 园龄：2年8个月粉丝：1 关注：2

昵称： Abyss_J
园龄： 2年8个月
粉丝： 1
关注： 2

执念不要太深呐|

2025年3月

日

一

二

三

四

五

六

abyss-130

一、正则化

1、正则化的理解

2、L1 正则化（Lasso Regression）与 L2 正则化（岭回归，Ridge Regression ）

3、L1 正则化的稀疏性

4、Dropout

5、提前终止（Early Stopping）

6、多任务学习（Multi-task）

7、数据增强、添加噪声等

公告

搜索

常用链接

我的标签

积分与排名

随笔分类

随笔档案

阅读排行榜

推荐排行榜

	"""
	inverted dropout（反向随机失活）: 推荐实现方式.
	在训练的时候drop和调整数值范围，测试时不用任何改变.
	"""
	p = 0.5 # 激活神经元的概率. p值更高 = 随机失活更弱

	def train_step(X):
	# 3层neural network的前向传播
	H1 = np.maximum(0, np.dot(W1, X) + b1)
	U1 = (np.random.rand(*H1.shape) < p) / p # 第一个dropout mask. /p 即为缩放，保证期望相同
	H1 *= U1 # drop!
	H2 = np.maximum(0, np.dot(W2, H1) + b2)
	U2 = (np.random.rand(*H2.shape) < p) / p # 第二个dropout mask. /p 即为缩放，保证期望相同
	H2 *= U2 # drop!
	out = np.dot(W3, H2) + b3
	# 反向传播:计算梯度... (略)
	# 进行参数更新... (略)

	def predict(X):
	# 前向传播时模型集成
	H1 = np.maximum(0, np.dot(W1, X) + b1) # 预测不用缩放了
	H2 = np.maximum(0, np.dot(W2, H1) + b2)
	out = np.dot(W3, H2) + b3

abyss-130

一、正则化

1、正则化的理解

2、L1 正则化（Lasso Regression） 与 L2 正则化（岭回归，Ridge Regression ）

3、L1 正则化的稀疏性

4、Dropout

5、提前终止（Early Stopping）

6、多任务学习（Multi-task）

7、数据增强、添加噪声等

公告

搜索

常用链接

我的标签

积分与排名

随笔分类

随笔档案

阅读排行榜

推荐排行榜

2、L1 正则化（Lasso Regression）与 L2 正则化（岭回归，Ridge Regression ）