Tensorflow2 深度学习十必知
博主根据自身多年的深度学习算法研发经验,整理分享以下十条必知。
含参考资料链接,部分附上相关代码实现。
独乐乐不如众乐乐,希望对各位看客有所帮助。
待回头有时间再展开细节说一说深度学习里的那些道道。
有什么技术需求需要有偿解决的也可以邮件或者QQ联系博主。
邮箱QQ同ID:gaozhihan@vip.qq.com
当然除了这十条,肯定还有其他“必知”,
欢迎评论分享更多,这里只是暂时拟定的十条,别较真哈。
主要学习其中的思路,切记,以下思路在个别场景并不适用 。
1.数据回流
[1907.05550] Faster Neural Network Training with Data Echoing
1
2
|
def data_echoing(factor): return lambda image, label: tf.data.Dataset.from_tensors((image, label)).repeat(factor) |
作用:
数据集加载后,在数据增广前后重复当前批次进模型的次数,减少数据的加载耗时。
等价于让模型看n次当前的数据,或者看n个增广后的数据样本。
2.AMP 自动精度混合
在bert4keras中使用混合精度和XLA加速训练 - 科学空间|Scientific Spaces
1
|
tf.config.optimizer.set_experimental_options({ "auto_mixed_precision" : True }) |
作用:
降低显存占用,加速训练,将部分网络计算转为等价的低精度计算,以此降低计算量。
3.优化器节省显存
3.1 [1804.04235]Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
mesh/optimize.py at master · tensorflow/mesh · GitHub
3.2 [1901.11150] Memory-Efficient Adaptive Optimization
google-research/sm3 at master · google-research/google-research (github.com)
作用:
节省显存,加速训练,
主要是对二阶动量进行特例化解构,减少显存存储。
4.权重标准化(归一化)
[2102.06171] High-Performance Large-Scale Image Recognition Without Normalization
deepmind-research/nfnets at master · deepmind/deepmind-research · GitHub
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
class WSConv2D(tf.keras.layers.Conv2D): def __init__( self , * args, * * kwargs): super (WSConv2D, self ).__init__( kernel_initializer = tf.keras.initializers.VarianceScaling( scale = 1.0 , mode = 'fan_in' , distribution = 'untruncated_normal' , ), use_bias = False , kernel_regularizer = tf.keras.regularizers.l2( 1e - 4 ), * args, * * kwargs ) self .gain = self .add_weight( name = 'gain' , shape = ( self .filters,), initializer = "ones" , trainable = True , dtype = self .dtype ) def standardize_weight( self , eps): mean, var = tf.nn.moments( self .kernel, axes = [ 0 , 1 , 2 ], keepdims = True ) fan_in = np.prod( self .kernel.shape[: - 1 ]) # Manually fused normalization, eq. to (w - mean) * gain / sqrt(N * var) scale = tf.math.rsqrt( tf.math.maximum( var * fan_in, tf.convert_to_tensor(eps, dtype = self .dtype) ) ) * self .gain shift = mean * scale return self .kernel * scale - shift def call( self , inputs): eps = 1e - 4 weight = self .standardize_weight(eps) return tf.nn.conv2d( inputs, weight, strides = self .strides, padding = self .padding.upper(), dilations = self .dilation_rate ) if self .bias is None else tf.nn.bias_add( tf.nn.conv2d( inputs, weight, strides = self .strides, padding = self .padding.upper(), dilations = self .dilation_rate ), self .bias) |
作用:
通过对kernel进行标准化或归一化,相当于对kernel做一个先验约束,以此加速模型训练收敛。
5.自适应梯度裁剪
deepmind-research/agc_optax.py at master · deepmind/deepmind-research · GitHub
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
def unitwise_norm(x): if len (tf.squeeze(x).shape) < = 1 : # Scalars and vectors axis = None keepdims = False elif len (x.shape) in [ 2 , 3 ]: # Linear layers of shape IO axis = 0 keepdims = True elif len (x.shape) = = 4 : # Conv kernels of shape HWIO axis = [ 0 , 1 , 2 , ] keepdims = True else : raise ValueError(f 'Got a parameter with shape not in [1, 2, 3, 4]! {x}' ) square_sum = tf.reduce_sum(tf.square(x), axis, keepdims = keepdims) return tf.sqrt(square_sum) def gradient_clipping(grad, var): clipping = 0.01 max_norm = tf.maximum(unitwise_norm(var), 1e - 3 ) * clipping grad_norm = unitwise_norm(grad) trigger = (grad_norm > max_norm) clipped_grad = (max_norm / tf.maximum(grad_norm, 1e - 6 )) return grad * tf.where(trigger, clipped_grad, tf.ones_like(clipped_grad)) |
作用:
防止梯度爆炸,稳定训练。通过梯度和参数的关系,对梯度进行裁剪,约束学习率。
6.recompute_grad
[1604.06174] Training Deep Nets with Sublinear Memory Cost
google-research/recompute_grad.py at master · google-research/google-research (github.com)
bojone/keras_recompute: saving memory by recomputing for keras (github.com)
作用:
通过梯度重计算,节省显存。
7.归一化
[2003.05569] Extended Batch Normalization (arxiv.org)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
from keras.layers.normalization.batch_normalization import BatchNormalizationBase class ExtendedBatchNormalization(BatchNormalizationBase): def __init__( self , axis = - 1 , momentum = 0.99 , epsilon = 1e - 3 , center = True , scale = True , beta_initializer = 'zeros' , gamma_initializer = 'ones' , moving_mean_initializer = 'zeros' , moving_variance_initializer = 'ones' , beta_regularizer = None , gamma_regularizer = None , beta_constraint = None , gamma_constraint = None , renorm = False , renorm_clipping = None , renorm_momentum = 0.99 , trainable = True , name = None , * * kwargs): # Currently we only support aggregating over the global batch size. super (ExtendedBatchNormalization, self ).__init__( axis = axis, momentum = momentum, epsilon = epsilon, center = center, scale = scale, beta_initializer = beta_initializer, gamma_initializer = gamma_initializer, moving_mean_initializer = moving_mean_initializer, moving_variance_initializer = moving_variance_initializer, beta_regularizer = beta_regularizer, gamma_regularizer = gamma_regularizer, beta_constraint = beta_constraint, gamma_constraint = gamma_constraint, renorm = renorm, renorm_clipping = renorm_clipping, renorm_momentum = renorm_momentum, fused = False , trainable = trainable, virtual_batch_size = None , name = name, * * kwargs) def _calculate_mean_and_var( self , x, axes, keep_dims): with tf.keras.backend.name_scope( 'moments' ): y = tf.cast(x, tf.float32) if x.dtype = = tf.float16 else x replica_ctx = tf.distribute.get_replica_context() if replica_ctx: local_sum = tf.math.reduce_sum(y, axis = axes, keepdims = True ) local_squared_sum = tf.math.reduce_sum(tf.math.square(y), axis = axes, keepdims = True ) batch_size = tf.cast(tf.shape(y)[ 0 ], tf.float32) y_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp. SUM , local_sum) y_squared_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp. SUM , local_squared_sum) global_batch_size = replica_ctx.all_reduce(tf.distribute.ReduceOp. SUM , batch_size) axes_vals = [(tf.shape(y))[i] for i in range ( 1 , len (axes))] multiplier = tf.cast(tf.reduce_prod(axes_vals), tf.float32) multiplier = multiplier * global_batch_size mean = y_sum / multiplier y_squared_mean = y_squared_sum / multiplier # var = E(x^2) - E(x)^2 variance = y_squared_mean - tf.math.square(mean) else : # Compute true mean while keeping the dims for proper broadcasting. mean = tf.math.reduce_mean(y, axes, keepdims = True , name = 'mean' ) variance = tf.math.reduce_mean( tf.math.squared_difference(y, tf.stop_gradient(mean)), axes, keepdims = True , name = 'variance' ) if not keep_dims: mean = tf.squeeze(mean, axes) variance = tf.squeeze(variance, axes) variance = tf.math.reduce_mean(variance) if x.dtype = = tf.float16: return (tf.cast(mean, tf.float16), tf.cast(variance, tf.float16)) else : return mean, variance |
作用:
一个简易改进版的Batch Normalization,思路简单有效。
8.学习率策略
[1506.01186] Cyclical Learning Rates for Training Neural Networks (arxiv.org)
作用:
一个推荐的学习率策略方案,特定情况下可以取得更好的泛化。
9.重参数化
https://zhuanlan.zhihu.com/p/361090497
作用:
通过同时训练多份参数,合并权重的思路来提升模型泛化性。
10.长尾学习
[2110.04596] Deep Long-Tailed Learning: A Survey (arxiv.org)
Jorwnpay/A-Long-Tailed-Survey: 本项目是 Deep Long-Tailed Learning: A Survey 文章的中译版 (github.com)
作用:
解决长尾问题,可以加速收敛,提升模型泛化,稳定训练。
转 https://www.cnblogs.com/cpuimage/p/16427268.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)
2021-07-05 3D 可视化展示 Go 项目源码 GoCity
2021-07-05 go语言模仿Java 8的 Stream API stream
2021-07-05 Go 语言实现的中国行政区划代码包 gbt2260
2021-07-05 golang(4)使用beego + ace admin 开发后台系统 CRUD
2021-07-05 常用系统 基于beego的场地预约借用系统
2021-07-05 Kubernetes 原生 Serverless 框架 Kubeless
2021-07-05 极简图片服务器 simple-image-server