limx→a[f(x)+g(x)]=limx→af(x)+limx→ag(x)limx→a[f(x)−g(x)]=limx→af(x)−limx→ag(x)limx→a[cf(x)]=climx→af(x)limx→a[f(x)g(x)]=limx→af(x)⋅limx→ag(x)limx→af(x)g(x)=limx→af(x)limx→ag(x) if limx→ag(x)≠0
[Def] Partial Derivative: For a real-valued function of multivariate f:Rn↦R,x↦f(x),x=(x1,x2,...,xn) , the partial derivative of f with respect to xi is defined as
The gradient of a real-valued function f:Rn↦R,x↦f(x),x∈Rn×1=[x1,x2,...,xn]T with respect to the colunn vector x is defined as a row vector of partial derivatives:
∇xf=dfdx:=[∂f∂x1,...,∂f∂xn]∈R1×n
.
注意,梯度是各分量偏导数的向量形式。
It is not uncommon to define the gradient vector as a column vector, following the convension that vectors are generally column vectors.
The reason why we define the gradient vector as a row vector is twofold: (1) First, we can consistently generalize the gradient to vector-valued functions f:Rn↦Rm (then
the gradient becomes a matrix). (2) Second, we can immediately apply the multi-variate chain rule without paying attention to the dimension of the gradient.
Hessian: For a real-valued function f:Rn↦R,x↦f(x) , the Hessian is defined as the second-derivative of f with respect to x
H∈Rn×n , where Hi,j=∂2f∂xi∂xj
is a symmetric matrix.
If f:Rn↦Rm,x↦f(x) , then Jacobian
J∈Rm×n , where Ji,j=∂fi∂xj
and the Hessian
H∈Rm×n×n , where Hijk=∂2fi∂xj∂xk
is an ( m×n×n )-tensor.
In the context of neural networks, where the input dimensionality is often much higher than the dimensionality of the labels, the reverse mode is computationally significantly cheaper than the forward mode.
机器学习中的后向传播算法是数值分析中“自动微分”的一个特例。
[Theorem] 多变元复合函数的链式法则:
For a real-valued funtion f(x):Rn↦R,x=[x1,x2,…,xn] , and x(u):Rm↦Rn=[x1(u),x2(u),…,xn(u)],u=[u1,…,um] , then
∂f∂uj=n∑i∂f∂xi∂xi∂uj∂f∂→u=∂f∂→x∂→x∂→u
, where ∂f∂→u∈R1×m is a row-shaped vector of gradient of f w.r.t. →u , and ∂→x∂→u∈Rm×n is the Jacobian matrix of →x w.r.t. →u .
total differential: for a real-valued function of independent multivariables, f(x) ,
df=∇xf⋅dx=∑i∂f∂xidxi
.
[Def] 多变元函数的方向导数 directional derivative: For a function f(→x):Rn↦R , and a constant unit vector (direction) →u , the directional derivative of f(→x) in the direction of a unit vector →u is
Duf(x):=limh→0f(x+hu)−f(xh
.
And the partial derivatives come to be a special case of directional derivatives. (let u=[0,…,0,1,0,…,0]) .
[Theorem] For f(x) and a unit vector u , the directional derivative
Duf(x)=∇xf⋅u
级数 Series: 无穷个数项的和:
∞∑nan:=limN→∞N∑nan
自然常数e Natural Nubmer e
e=limn→∞(1+1n)n=∞∑n=01n!
几何级数 geometric series:
∞∑n=0arn=a+ar+ar2+⋯
is convergent if |r|<1 , and the sum is
∞∑n=0arn=a1−r,|r|<1
. If |r|≥1 , the geometric series is divergennt.
Talor series 泰勒级数:
For a real-valued function f:R↦R,x↦f(x) , the Taylor series at x0 is defined as
f(x)=∞∑k=0Dkxf(x0)k!(x−x0)k
, where Dkxf(x0) is the k-th derivative of f with respect to x , evaluated at x0 .
Multivariant Talor series: For a real-valued function f:Rn↦R,x↦f(x),x∈Rn , the multivariant Talor series at x0∈Rn is defined as
f(x)=∞∑k=0Dkxf(x0)k!(x−x0)⊗k
, where Dkxf(x0) is the k-th (total) derivative of f with respect to x , evaluated at x0 . (x−x0)⊗k is the result of applying outer product on k times of the vector (x−x0) . Note that Dkxf(x) and (x−x0)⊗k are both k-th order tensor, i.e. Dkxf(x),(x−x0)⊗k∈Rk timesn×...×n .
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律