Deep Learning Week4 Notes
1. DAG Networks
$\text{If }(a_1,..,a_Q) = \phi(b_1,...,b_R),\text{ we use the notation:} $
Siamese Networks
\(\text{Network1 and network2 use the }\textbf{weight sharing system}.\text{ The reason behind the }\textbf{weight sharing }\text{is that }\textbf{if a processing is good for one element of the pair, it's also good for the other element.}\)
\(\text{These 2 elements go through the same processing before being re-comined by a final processing }\Phi,\text{ which can be a distance measure.}\)
2. Autograd
requires_grad
:
>>> x = torch.tensor([ 1., 2. ])
>>> y = torch.tensor([ 4., 5. ])
>>> z = torch.tensor([ 7., 3. ])
>>> x.requires_grad
False
>>> (x + y).requires_grad
False
>>> z.requires_grad = True
>>> (x + z).requires_grad
True
\(\textbf{Only floating point type can have the grad.}\)
>>> x = torch.tensor([1., 10.])
>>> x.requires_grad = True
>>> x = torch.tensor([1, 10])
>>> x.requires_grad = True
Traceback (most recent call last):
/.../
RuntimeError: only Tensors of floating point dtype can require gradients
torch.autograd.grad(outputs, inputs)
computes the gradients of outputs
with respect to inputs
:
>>> t = torch.tensor([1., 2., 4.]).requires_grad_()
>>> u = torch.tensor([10., 20.]).requires_grad_()
>>> a = t.pow(2).sum() + u.log().sum()
>>> torch.autograd.grad(a, (t, u))
(tensor([2., 4., 8.]), tensor([0.1000, 0.0500]))
Tensor.backward()
: accumulates gradients in the grad
field of tensors which are not the reults of operations (i.e. \(\textbf{leaves}\)):
>>> x = torch.tensor([ -3., 2., 5. ]).requires_grad_()
>>> u = x.pow(3).sum()
>>> x.grad
>>> u.backward()
>>> x.grad
tensor([27., 12., 75.])
\(\large\textbf{Note: }\)Tensor.backward()
accumulates the gradients, so have to set them to \(0\) before calling it.
Autograd graph
>>> x = torch.tensor([ 1.0, -2.0, 3.0, -4.0 ]).requires_grad_()
>>> a = x.abs()
>>> s = a.sum()
>>> s
tensor(10., grad_fn=<SumBackward0>)
>>> s.grad_fn.next_functions
((<AbsBackward object at 0x7ffb2b1462b0>, 0),)
>>> s.grad_fn.next_functions[0][0].next_functions
((<AccumulateGrad object at 0x7ffb2b146278>, 0),)
\(\Large\textbf{For the details, read: }\)Lecture
torch.no_grad()
: switches off the autograd machinery, and can be used for operations such as parameter updates.
w = torch.empty(10, 784).normal_(0, 1e-3).requires_grad_()
b = torch.empty(10).normal_(0, 1e-3).requires_grad_()
for k in range(10001):
y_hat = x @ w.t() + b
loss = (y_hat - y).pow(2).mean()
w.grad, b.grad = None, None
loss.backward()
with torch.no_grad():
w -= eta * w.grad
b -= eta * b.grad
detach()
:creates a tensor which shares the data, but does \(\textbf{not require gradient computation}\), and is \(\textbf{not connected to the current graph}\).
\(\Large\textbf{Comparison:}\)
a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
for k in range(100):
l = (a - 1)**2 + (b + 1)**2 + (a - b)**2
ga, gb = torch.autograd.grad(l, (a, b))
with torch.no_grad():
a -= eta * ga
b -= eta * gb
print(a, b)
tensor(0.3333, requires_grad=True) tensor(-0.3333, requires_grad=True)
\(\text{Let }\frac{dl}{da}=\frac{dl}{db}=0,\text{ we get: } a=1/3, b=-1/3\)
a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
for k in range(100):
l = (a - 1)**2 + (b + 1)**2 + (a.detach() - b)**2
ga, gb = torch.autograd.grad(l, (a, b))
with torch.no_grad():
a -= eta * ga
b -= eta * gb
print(a, b)
tensor(1.0000, requires_grad=True tensor(-8.2480e-08, requires_grad=True)
\(\textbf{In this time:}\)
\(\large\textbf{By default, autograd deletes the computational graph when it's used.}\)
>>> x = torch.tensor([1.]).requires_grad_()
>>> z = 1/x
>>> torch.autograd.grad(z, x)
(tensor([-1.]),)
>>> torch.autograd.grad(z * z, x)
Traceback (most recent call last):
/.../
RuntimeError: Trying to backward through the graph a second time, but
the buffers have already been freed.
We can use retain_graph
to keep:
>>> x = torch.tensor([1.]).requires_grad_()
>>> z = 1/x
>>> torch.autograd.grad(z, x, retain_graph = True)
(tensor([-1.]),)
>>> torch.autograd.grad(z * z, x)
(tensor([-2.]),)
High-order derivatives
we use create_graph = True
>>> x = torch.tensor([2., 3.]).requires_grad_()
>>> psi = x[0].log() + x[1].pow(2)
>>> g, = torch.autograd.grad(psi, x, create_graph = True)
>>> torch.autograd.grad(g.pow(2).sum(), x)
(tensor([-0.2500, 24.0000]),)
3. PyTorch modules and batch processing
nn.Linear(in_features, out_features, bias = True)
:
\(\mathbb{R}^C\rightarrow\mathbb{R}^D\)
Input: \(N\times C\); Output: \(N\times D\)
>>> f = nn.Linear(in_features = 10, out_features = 4)
>>> for n, p in f.named_parameters(): print(n, p.size())
...
weight torch.Size([4, 10])
bias torch.Size([4])
>>> x = torch.randn(523, 10)
>>> y = f(x)
>>> y.size()
torch.Size([523, 4])
nn.MSELoss()
:
>>> f = nn.MSELoss()
>>> x = torch.tensor([[ 3. ]])
>>> y = torch.tensor([[ 0. ]])
>>> f(x, y)
tensor(9.)
>>> x = torch.tensor([[ 3., 0., 0., 0. ]])
>>> y = torch.tensor([[ 0., 0., 0., 0. ]])
>>> f(x, y)
tensor(2.2500)
\(\large\textbf{Note: }\text{Criteria (loss function) not accept a target with }\)requires_grad = True
:
>>> import torch
>>> f = nn.MSELoss()
>>> x = torch.tensor([ 3., 2. ]).requires_grad_()
>>> y = torch.tensor([ 0., -2. ]).requires_grad_()
>>> f(x, y)
Traceback (most recent call last):
/.../
AssertionError: nn criterions don't compute the gradient w.r.t.
targets - please mark these tensors as not requiring gradients
4.Convolutions
\(\text{Formally, in 1d, given}\)
\(\textbf{convolution kernel/filter }\text{of width }w:\)
\(\text{the convolution }x\circledast u\text{ is a vector of size }W-w+1\)
\(\textbf{Most usual form: processing 3d tensor (multi-channel 2d signal)}\)
\(\text{In this case:}\)
input: (C,H,W)
kernel: (C,h,w)
Output: (H-h+1,W-w+1)
\(\Large\textbf{For pictures: see }\)lecture
F.conv2d(input, weight, bias=None, stride=1,padding=0, dilation=1, groups=1)
weight
: \(D\times C\times h\times w\)
input
: \(N\times C\times H\times W\)
Output: \(N\times D\times (H-h+1)\times (W-w+1)\)
>>> weight = torch.randn(5, 4, 2, 3)
>>> bias = torch.randn(5)
>>> input = torch.randn(117, 4, 10, 3)
>>> output = F.conv2d(input, weight, bias)
>>> output.size()
torch.Size([117, 5, 9, 1])
torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
\(\text{Example:}\)
>>> f = nn.Conv2d(in_channels = 4, out_channels = 5, kernel_size = (2, 3))
>>> for n, p in f.named_parameters(): print(n, p.size())
...
weight torch.Size([5, 4, 2, 3])
bias torch.Size([5])
>>> x = torch.randn(117, 4, 10, 3)
>>> y = f(x)
>>> y.size()
torch.Size([117, 5, 9, 1])
Padding, Stride, Dilation
\(\text{Refer to }\)Lecture
- $H_{out} = \lfloor \frac{H_{in}+2\times padding[0]-dilation[0]\times(\text{kernel_size}[0]-1)-1}{stride[0]} +1\rfloor $
- $W_{out} = \lfloor \frac{W_{in}+2\times padding[1]-dilation[1]\times(\text{kernel_size}[1]-1)-1}{stride[1]} +1\rfloor $
Pooling
\(\text{Refer to }\)Lecture
\(\large\text{Contrary to convolution, pooling is applied
independently on each channel. There are as many channels as output.}\)
\(\textbf{Pooling provides invariance to any permutation inside one of the cell}\)