xinyu04

导航

Deep Learning Week4 Notes

1. DAG Networks

$\text{If }(a_1,..,a_Q) = \phi(b_1,...,b_R),\text{ we use the notation:} $

\[\begin{align} [\frac{\partial a}{\partial b}]&=J_{\phi}^T = \begin{pmatrix} \frac{\partial a_1}{\partial b_1}\ ... \ \frac{\partial a_Q}{\partial b_1}\\ ... \ ... \ ...\\ \frac{\partial a_1}{\partial b_R}\ ...\ \frac{\partial a_Q}{\partial b_R} \end{pmatrix} \end{align} \]

Siamese Networks

\(\text{Network1 and network2 use the }\textbf{weight sharing system}.\text{ The reason behind the }\textbf{weight sharing }\text{is that }\textbf{if a processing is good for one element of the pair, it's also good for the other element.}\)

\(\text{These 2 elements go through the same processing before being re-comined by a final processing }\Phi,\text{ which can be a distance measure.}\)

2. Autograd

requires_grad:

>>> x = torch.tensor([ 1., 2. ])
>>> y = torch.tensor([ 4., 5. ])
>>> z = torch.tensor([ 7., 3. ])
>>> x.requires_grad
False
>>> (x + y).requires_grad
False
>>> z.requires_grad = True
>>> (x + z).requires_grad
True

\(\textbf{Only floating point type can have the grad.}\)

>>> x = torch.tensor([1., 10.])
>>> x.requires_grad = True
>>> x = torch.tensor([1, 10])
>>> x.requires_grad = True
Traceback (most recent call last):
/.../
RuntimeError: only Tensors of floating point dtype can require gradients

torch.autograd.grad(outputs, inputs) computes the gradients of outputs with respect to inputs:

>>> t = torch.tensor([1., 2., 4.]).requires_grad_()
>>> u = torch.tensor([10., 20.]).requires_grad_()
>>> a = t.pow(2).sum() + u.log().sum()
>>> torch.autograd.grad(a, (t, u))
(tensor([2., 4., 8.]), tensor([0.1000, 0.0500]))

\[a = \sum_i t_i^2+\sum_i \log{u_i} \]

Tensor.backward(): accumulates gradients in the grad field of tensors which are not the reults of operations (i.e. \(\textbf{leaves}\)):

>>> x = torch.tensor([ -3., 2., 5. ]).requires_grad_()
>>> u = x.pow(3).sum()
>>> x.grad
>>> u.backward()
>>> x.grad
tensor([27., 12., 75.])

\(\large\textbf{Note: }\)Tensor.backward() accumulates the gradients, so have to set them to \(0\) before calling it.

Autograd graph

>>> x = torch.tensor([ 1.0, -2.0, 3.0, -4.0 ]).requires_grad_()
>>> a = x.abs()
>>> s = a.sum()
>>> s
tensor(10., grad_fn=<SumBackward0>)
>>> s.grad_fn.next_functions
((<AbsBackward object at 0x7ffb2b1462b0>, 0),)
>>> s.grad_fn.next_functions[0][0].next_functions
((<AccumulateGrad object at 0x7ffb2b146278>, 0),)

\(\Large\textbf{For the details, read: }\)Lecture

torch.no_grad(): switches off the autograd machinery, and can be used for operations such as parameter updates.

w = torch.empty(10, 784).normal_(0, 1e-3).requires_grad_()
b = torch.empty(10).normal_(0, 1e-3).requires_grad_()

for k in range(10001):
    y_hat = x @ w.t() + b
    loss = (y_hat - y).pow(2).mean()
    w.grad, b.grad = None, None
    loss.backward()

    with torch.no_grad():
        w -= eta * w.grad
        b -= eta * b.grad

detach():creates a tensor which shares the data, but does \(\textbf{not require gradient computation}\), and is \(\textbf{not connected to the current graph}\).

\(\Large\textbf{Comparison:}\)

a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a - b)**2
    ga, gb = torch.autograd.grad(l, (a, b))

    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb
print(a, b)

tensor(0.3333, requires_grad=True) tensor(-0.3333, requires_grad=True)

\[\begin{align} l &= (a-1)^2+(b+1)^2+(a-b)^2\\ \frac{dl}{da} &= 2(a-1)+2(a-b) = 4a-2b-2\\ \frac{dl}{db}&= 2(b+1)-2(a-b)=4b-2a+2 \end{align} \]

\(\text{Let }\frac{dl}{da}=\frac{dl}{db}=0,\text{ we get: } a=1/3, b=-1/3\)

a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a.detach() - b)**2
    ga, gb = torch.autograd.grad(l, (a, b))

    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb
print(a, b)

tensor(1.0000, requires_grad=True tensor(-8.2480e-08, requires_grad=True)
\(\textbf{In this time:}\)

\[\begin{align} \frac{dl}{da}&=2(a-1)=0\\ \frac{dl}{db}&=2(b+1)-2(a-b)=0 \end{align} \]

\(\large\textbf{By default, autograd deletes the computational graph when it's used.}\)

>>> x = torch.tensor([1.]).requires_grad_()
>>> z = 1/x
>>> torch.autograd.grad(z, x)
(tensor([-1.]),)
>>> torch.autograd.grad(z * z, x)
Traceback (most recent call last):
/.../
RuntimeError: Trying to backward through the graph a second time, but
the buffers have already been freed.

We can use retain_graph to keep:

>>> x = torch.tensor([1.]).requires_grad_()
>>> z = 1/x
>>> torch.autograd.grad(z, x, retain_graph = True)
(tensor([-1.]),)
>>> torch.autograd.grad(z * z, x)
(tensor([-2.]),)

High-order derivatives

\[\begin{align} \psi(x_1,x_2)&=\log{x_1}+x_2^2\\ ||\nabla\psi||_2^2&=(1/x_1)^2+(2x_2)^2\\ \nabla||\nabla\psi||_2^2&=(-2/x_1^3,8x_2) \end{align} \]

we use create_graph = True

>>> x = torch.tensor([2., 3.]).requires_grad_()
>>> psi = x[0].log() + x[1].pow(2)
>>> g, = torch.autograd.grad(psi, x, create_graph = True)
>>> torch.autograd.grad(g.pow(2).sum(), x)
(tensor([-0.2500, 24.0000]),)

3. PyTorch modules and batch processing

nn.Linear(in_features, out_features, bias = True):
\(\mathbb{R}^C\rightarrow\mathbb{R}^D\)

Input: \(N\times C\); Output: \(N\times D\)

>>> f = nn.Linear(in_features = 10, out_features = 4)
>>> for n, p in f.named_parameters(): print(n, p.size())
...
weight torch.Size([4, 10])
bias torch.Size([4])
>>> x = torch.randn(523, 10)
>>> y = f(x)
>>> y.size()
torch.Size([523, 4])

nn.MSELoss():

>>> f = nn.MSELoss()
>>> x = torch.tensor([[ 3. ]])
>>> y = torch.tensor([[ 0. ]])
>>> f(x, y)
tensor(9.)
>>> x = torch.tensor([[ 3., 0., 0., 0. ]])
>>> y = torch.tensor([[ 0., 0., 0., 0. ]])
>>> f(x, y)
tensor(2.2500)

\(\large\textbf{Note: }\text{Criteria (loss function) not accept a target with }\)requires_grad = True:

>>> import torch
>>> f = nn.MSELoss()
>>> x = torch.tensor([ 3., 2. ]).requires_grad_()
>>> y = torch.tensor([ 0., -2. ]).requires_grad_()
>>> f(x, y)
Traceback (most recent call last):
/.../
AssertionError: nn criterions don't compute the gradient w.r.t.
targets - please mark these tensors as not requiring gradients

4.Convolutions

\(\text{Formally, in 1d, given}\)

\[\begin{align} x = (x_1,...,x_W) \end{align} \]

\(\textbf{convolution kernel/filter }\text{of width }w:\)

\[\begin{align} u = (u_1,...,u_w) \end{align} \]

\(\text{the convolution }x\circledast u\text{ is a vector of size }W-w+1\)

\[\begin{align} (x\circledast u)_i &=\sum_{j=1}^w u_jx_{j+i-1}\\ &=(x_i,...,x_{i+w-1})\cdot u \end{align} \]

\(\textbf{Most usual form: processing 3d tensor (multi-channel 2d signal)}\)
\(\text{In this case:}\)

input: (C,H,W)
kernel: (C,h,w)
Output: (H-h+1,W-w+1)

\(\Large\textbf{For pictures: see }\)lecture

F.conv2d(input, weight, bias=None, stride=1,padding=0, dilation=1, groups=1)
weight: \(D\times C\times h\times w\)
input: \(N\times C\times H\times W\)
Output: \(N\times D\times (H-h+1)\times (W-w+1)\)

>>> weight = torch.randn(5, 4, 2, 3)
>>> bias = torch.randn(5)
>>> input = torch.randn(117, 4, 10, 3)
>>> output = F.conv2d(input, weight, bias)
>>> output.size()
torch.Size([117, 5, 9, 1])

torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
\(\text{Example:}\)

>>> f = nn.Conv2d(in_channels = 4, out_channels = 5, kernel_size = (2, 3))
>>> for n, p in f.named_parameters(): print(n, p.size())
...
weight torch.Size([5, 4, 2, 3])
bias torch.Size([5])
>>> x = torch.randn(117, 4, 10, 3)
>>> y = f(x)
>>> y.size()
torch.Size([117, 5, 9, 1])

Padding, Stride, Dilation

\(\text{Refer to }\)Lecture

  • $H_{out} = \lfloor \frac{H_{in}+2\times padding[0]-dilation[0]\times(\text{kernel_size}[0]-1)-1}{stride[0]} +1\rfloor $
  • $W_{out} = \lfloor \frac{W_{in}+2\times padding[1]-dilation[1]\times(\text{kernel_size}[1]-1)-1}{stride[1]} +1\rfloor $

Pooling

\(\text{Refer to }\)Lecture
\(\large\text{Contrary to convolution, pooling is applied independently on each channel. There are as many channels as output.}\)

\(\textbf{Pooling provides invariance to any permutation inside one of the cell}\)

posted on 2022-05-07 21:06  Blackzxy  阅读(42)  评论(0编辑  收藏  举报