xinyu04

导航

Deep Learning Week5 Notes

1. Cross-Entropy Loss

\(\textbf{MSE }\text{is justified in Euclidean space. But there is no sense in the classification context, because the class values do not have any topological structure.}\)

Generaliza the Logistic Regression

\(\text{Logits:}\)

\[\begin{align} P(Y=y|X=x,W=w) = \frac{\exp{f_y(x;w)}}{\sum_k \exp{f_k(x;w)}} \end{align} \]

\(\text{From which:}\)

\[\begin{align} \log{\mu_W(w|D=d)}&=\log{\frac{\mu_D(d|W=w)\mu_W(w)}{\mu_D(d)}}\\ &=\log{\mu_D(d|W=w)}+\log{\mu_W}-\log{Z}\\ &=\sum_n\log{\mu_D(x_n,y_n|W=w)}+\log{\mu_W}-\log{Z}\\ &=\sum_n\log{P(Y=y_n|X=x_n,W=w)}+\log{\mu_W}-\log{Z'}\\ &=\sum_n\log{\frac{\exp{f_y(x;w)}}{\sum_k \exp{f_k(x;w)}}}+\log{\mu_W}-\log{Z'} \end{align} \]

\(\large\text{Ignore the penalty on }w,\text{ it minimizes:}\)

\[\begin{align} L(w) = -\frac{1}{N}\sum_{n=1}^N\log{(\frac{\exp{f_y(x;w)}}{\sum_k \exp{f_k(x;w)}})} \end{align} \]

\(\\\)

\(\large\textbf{Cross-Entropy:}\)

\[\begin{align} H(p,q) = -\sum_kp(k)\log{q(k)} \end{align} \]

\(\text{Rewrite the loss function}\)

\[\begin{align} -\frac{1}{N}\sum_{n=1}^N\log{(\frac{\exp{f_y(x;w)}}{\sum_k \exp{f_k(x;w)}})}&=\log{\hat{P}_w(Y=y_n|X=x_n)}\\ &= -\sum_k\delta_{y_n}(k)\log{\hat{P}_w(Y=k|X=x_n)}\\ &= H(\delta_{y_n},\hat{P}_w(Y|X=x_n)) \end{align} \]

torch.nn.CrossEntropyLoss:

>>> f = torch.tensor([[-1., -3., 4.], [-3., 3., -1.]])
>>> target = torch.tensor([0, 1])
>>> criterion = torch.nn.CrossEntropyLoss()
>>> criterion(f, target)
tensor(2.5141)

\[-\frac{1}{2}(\log{\frac{e^{-1}}{e^{-1}+e^{-3}+e^4}}+\log{\frac{e^3}{e^{-3}+e^{3}+e^{-1}}}) = 2.5141 \]

\(\\\)
\(\textbf{Cross-entropy loss can be seen as the composition of log-soft-max:}\)

\[\alpha_i \rightarrow \log{\frac{\exp{\alpha_i}}{{\sum_k\exp{\alpha_k}}}} \]

\(\text{This can be done by }\)torch.nn.LogSoftmax and torch.nn.NLLLoss:

>>> f = torch.tensor([[-1., -3., 4.], [-3., 3., -1.]])
>>> target = torch.tensor([0, 1])
>>> model = nn.LogSoftmax(dim = 1)
>>> criterion = torch.nn.NLLLoss()
>>> criterion(model(f), target)
tensor(2.5141)

\(\Large\text{Hence, if a network should compute log-probabilities, it may have a }\)nn.LogSoftmax\(\Large\text{ final layer, and be trained with }\)nn.NLLLoss

2. SGD

\(\textbf{Mini-batch:}\)

for e in range(nb_epochs):
    for b in range(0, train_input.size(0), batch_size):
        output = model(train_input[b:b+batch_size])
        loss = criterion(output, train_target[b:b+batch_size])
        model.zero_grad()
        loss.backward()

        with torch.no_grad():
            for p in model.parameters(): p -= eta * p.grad

Momentum and moment estimation

\[\begin{align} w_{t+1} &= w_t-\eta g_t\\ g_t &= \sum_{b=1}^B\nabla l_{n(t,b)}(w_t) \end{align} \]

\(n(t,b)\text{ is the index of }b\text{th sample of the mini-batch used at iteration }t\)

\(\\\)
\(\textbf{First improvement of 'momentum':}\)

\[\begin{align} u_t &= \gamma u_{t-1}+\eta g_t\\ w_{t+1} &= w_t - u_t \end{align} \]

\(\gamma =0:\text{ the same as SGD}\)
\(\textbf{Nice properties:}\)

  • \(\text{ Can go through local barriers}\)
  • \(\text{Accelerates if the gradients not change much}\)

\[\begin{align} u = \gamma u+\eta g\Rightarrow u = \frac{\eta}{1-\gamma}g \end{align} \]

  • \(\text{Dampens oscillations in narrow valleys}\)

\(\\\)

Adam

\[\begin{align} m_t&=\beta_1 m_{t-1}+(1-\beta_2)g_t\\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}\\ v_t&= \beta_2v_{t-1}+(1-\beta_2)g_t^2\\ \hat{v}_t&= \frac{v_t}{1-\beta_2^t}\\ w_{t+1}&=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t \end{align} \]

3. PyTorch Optimizer

optimizer = torch.optim.SGD(model.parameters(), lr = eta)

for e in range(nb_epochs):
    for b in range(0, train_input.size(0), batch_size):
        output = model(train_input[b:b+batch_size])
        loss = criterion(output, train_target[b:b+batch_size])

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

\(\text{Adam:}\)
optimizer = optim.Adam(model.parameters(), lr = eta)

4. Weight Initialization

\(\textbf{Goal:} \text{ controlling at:}\)

\[\begin{align} Var(\frac{\partial L}{\partial w_{i,j}^{(l)}}),Var(\frac{\partial L}{\partial b_i^{(l)}}) \end{align} \]

\(\large\textbf{Weights evolve at the same rate across layers during training,}\text{ and no layer reaches a saturation behavior before others.}\)

\(\text{Please refer to }\)Xavier and Kaming Initialization

posted on 2022-05-10 18:20  Blackzxy  阅读(23)  评论(0编辑  收藏  举报