Deep Learning Week5 Notes
1. Cross-Entropy Loss
\(\textbf{MSE }\text{is justified in Euclidean space. But there is no sense in the classification context, because the class values do not have any topological structure.}\)
Generaliza the Logistic Regression
\(\text{Logits:}\)
\(\text{From which:}\)
\(\large\text{Ignore the penalty on }w,\text{ it minimizes:}\)
\(\\\)
\(\large\textbf{Cross-Entropy:}\)
\(\text{Rewrite the loss function}\)
torch.nn.CrossEntropyLoss
:
>>> f = torch.tensor([[-1., -3., 4.], [-3., 3., -1.]])
>>> target = torch.tensor([0, 1])
>>> criterion = torch.nn.CrossEntropyLoss()
>>> criterion(f, target)
tensor(2.5141)
\(\\\)
\(\textbf{Cross-entropy loss can be seen as the composition of log-soft-max:}\)
\(\text{This can be done by }\)torch.nn.LogSoftmax
and torch.nn.NLLLoss
:
>>> f = torch.tensor([[-1., -3., 4.], [-3., 3., -1.]])
>>> target = torch.tensor([0, 1])
>>> model = nn.LogSoftmax(dim = 1)
>>> criterion = torch.nn.NLLLoss()
>>> criterion(model(f), target)
tensor(2.5141)
\(\Large\text{Hence, if a network should compute log-probabilities, it may have a }\)nn.LogSoftmax
\(\Large\text{ final layer, and be trained with }\)nn.NLLLoss
2. SGD
\(\textbf{Mini-batch:}\)
for e in range(nb_epochs):
for b in range(0, train_input.size(0), batch_size):
output = model(train_input[b:b+batch_size])
loss = criterion(output, train_target[b:b+batch_size])
model.zero_grad()
loss.backward()
with torch.no_grad():
for p in model.parameters(): p -= eta * p.grad
Momentum and moment estimation
\(n(t,b)\text{ is the index of }b\text{th sample of the mini-batch used at iteration }t\)
\(\\\)
\(\textbf{First improvement of 'momentum':}\)
\(\gamma =0:\text{ the same as SGD}\)
\(\textbf{Nice properties:}\)
- \(\text{ Can go through local barriers}\)
- \(\text{Accelerates if the gradients not change much}\)
- \(\text{Dampens oscillations in narrow valleys}\)
\(\\\)
Adam
3. PyTorch Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr = eta)
for e in range(nb_epochs):
for b in range(0, train_input.size(0), batch_size):
output = model(train_input[b:b+batch_size])
loss = criterion(output, train_target[b:b+batch_size])
optimizer.zero_grad()
loss.backward()
optimizer.step()
\(\text{Adam:}\)
optimizer = optim.Adam(model.parameters(), lr = eta)
4. Weight Initialization
\(\textbf{Goal:} \text{ controlling at:}\)
\(\large\textbf{Weights evolve at the same rate across layers during training,}\text{ and no layer reaches a saturation behavior before others.}\)
\(\text{Please refer to }\)Xavier and Kaming Initialization