xinyu04

导航

Deep Learning Week9 Notes

1. Looking at parameters

Hidden units of a perceptron

one-hidden layer fully connected network \(\mathbb{R}^2\rightarrow \mathbb{R}^2\)

nb_hidden = 20

model = nn.Sequential(
        nn.Linear(2, nb_hidden),
        nn.ReLU(),
        nn.Linear(nb_hidden, 2)
        )

visit the parameters \((w, b)\) of each hidden units:

for k in range(model[0].weight.size(0)):
    w = model[0].weight[k]
    b = model[0].bias[k]

2. Looking at activations

Given data points in high dimension:

\[\mathcal{D} = \{ x_n\in\mathbb{R}^D,n=1,...,N \} \]

the objective of the data visualization is to find a set of corresponding low-dimension points

\[\mathcal{Y} = \{y_n\in\mathbb{R}^C,n=1,...,N \} \]

\(\large\text{t-Distributed Stochastic Neighbor Embedding (t-SNE):}\) optimizes with SGD the \(y_i\)s so that the distributions of distances to close neighbors of each point are preserved.

It actually matches for \(D_{KL}\) two distance-dependent distributions: \(\textbf{Gaussian}\) in the original space, and \(\textbf{Student t-distribution}\) in the low-dimension one.

\(\text{Code:}\)

from sklearn.manifold import TSNE

# x is the array of the original high-dimension points
x_np = x.numpy()
y_np = TSNE(n_components = 2, perplexity = 50).fit_transform(x_np)
# y is the array of corresponding low-dimension points
y = torch.from_numpy(y_np)

n_components specifies the embedding dimension and perplexity states how many points are considered neighbors of each point.

3. Visualizing the processing in the input

Saliency maps

An alternative is to compute the gradient of an output with respect to the input:

\[\nabla_{|x}f_c(x;w) \]

\(\text{Code}\)

input.requires_grad_()
output = model(input)
grad_input, = torch.autograd.grad(output[0, c], input)

Smilkov et al. (2017) proposed to smooth the gradient with respect to the input image by averaging over slightly perturbed versions of the latter.

\[\tilde{\nabla}_{\mid x} f_{y}(x ; w)=\frac{1}{N} \sum_{n=1}^{N} \nabla_{\mid x} f_{y}\left(x+\epsilon_{n} ; w\right) \]

where \(\epsilon_{1}, \ldots, \epsilon_{N}\) are i.i.d of distribution \(\mathcal{N}\left(0, \sigma^{2} I\right)\), and \(\sigma\) is a fraction of the gap \(\Delta\) between the maximum and the minimum of the pixel values.

\(\text{Code}\)

std = std_fraction * (img.max() - img.min())
acc_grad = img.new_zeros(img.size())

for q in range(nb_smooth): # This should be done with mini-batches ...
    noisy_input = img + img.new(img.size()).normal_(0, std)
    noisy_input.requires_grad_()
    output = model(noisy_input)
    
    grad_input, = torch.autograd.grad(output[0, c], noisy_input)
    acc_grad += grad_input

acc_grad = acc_grad.abs().sum(1) # sum across channels
  • std_fraction is typically between \(0.1\) and \(0.25\).
  • Remember that new_* initialize tensors with the same type and same device as the input tensor
  • .sum(1) sums across RGB channels, so we go from a tensor of size \(1 × 3 × 224 × 224\) to a tensor of size \(1 × 224 × 224\), which can be represented as a gray-scale image. Here, the \(1\) is for a mini-batch of one sample.

Deconvolution and guided back-propagation

For ReLU function, forward pass:

\[x = \max(0,s) \]

backward pass:

\[\frac{\partial l}{\partial s} = \mathbf{1}_{\left\{s>0\right\}} \frac{\partial \ell}{\partial x} \]

\(\large\textbf{Deconvolution:}\)

\[\frac{\partial l}{\partial s} = \mathbf{1}_{\left\{\frac{\partial \ell}{\partial x}>0\right\}} \frac{\partial \ell}{\partial x} \]

This quantity is positive for units whose output has a positive contribution to the response, kills the others, and is not modulated by the pre-layer activation \(s\).

\(\large\textbf{Guided back-propagation:}\)

\[\mathbf{1}_{\{s>0\}} \mathbf{1}_{\left\{\frac{\partial \ell}{\partial x}>0\right\}} \frac{\partial \ell}{\partial x} \]

aims at the best of both worlds: Discarding structures which would not contribute positively to the final response, and discarding structures which are not already present.

Hook

>>> x = torch.tensor([ 1.23, -4.56 ])
>>> m = nn.ReLU()
>>> m(x)
tensor([ 1.2300, 0.0000])
>>> def my_hook(m, input, output):
... print(str(m) + ' got ' + str(input[0].size()))
...
>>> handle = m.register_forward_hook(my_hook)
>>> m(x)
ReLU() got torch.Size([2])
tensor([ 1.2300, 0.0000])
>>> handle.remove()
>>> m(x)
tensor([ 1.2300, 0.0000])

Using hooks, we can implement the deconvolution as follows:

def relu_backward_deconv_hook(module, grad_input, grad_output):
    return F.relu(grad_output[0]),

def equip_model_deconv(model):
    for m in model.modules():
        if isinstance(m, nn.ReLU):
            m.register_backward_hook(relu_backward_deconv_hook)

def grad_view(model, image_name):
    to_tensor = transforms.ToTensor()
    img = to_tensor(PIL.Image.open(image_name))
    img = 0.5 + 0.5 * (img - img.mean()) / img.std()
    
    model.to(device)
    img = img.to(device)
    
    input = img.view(1, img.size(0), img.size(1), img.size(2)).requires_grad_()
    output = model(input)
    
    result, = torch.autograd.grad(output.max(), input)
    result = result / result.max() + 0.5
    return result

model = models.vgg16(pretrained = True)
model.eval()
model = model.features
equip_model_deconv(model)
result = grad_view(model, 'blacklab.jpg')
utils.save_image(result, 'blacklab-vgg16-deconv.png')

\(\text{Hooks for Guided back-propagation:}\)

def relu_forward_gbackprop_hook(module, input, output):
    module.input_kept = input[0]

def relu_backward_gbackprop_hook(module, grad_input, grad_output):
    return F.relu(grad_output[0]) * F.relu(module.input_kept).sign(),

def equip_model_gbackprop(model):
    for m in model.modules():
        if isinstance(m, nn.ReLU):
            m.register_forward_hook(relu_forward_gbackprop_hook)
            m.register_backward_hook(relu_backward_gbackprop_hook)

Grad-CAM

\(\text{Gradient-weighted Class Activation Mapping (Grad-CAM)}\) visualizes the importance of the input sub-parts according to the activations in a specific layer.

Formally, let \(k = \{1,...,C \}\) be a channel number, \(A^k\in \mathbb{R}^{H\times W}\) the output feature map \(k\) of the selected layer, \(c\) a class number, and \(y^c\) the network's logit for that class.

The channel's weight:

\[\alpha_k^c = \frac{1}{HW}\sum_{i=1}^H\sum_{j=1}^W\frac{\partial y^c}{\partial A_{i,j}^k} \]

The final localization map is:

\[L_{Grad-CAM}^c = \text{ReLU}(\sum_{k=1}^C\alpha_k^cA^k) \]

\(\text{Code:}\)

def hook_store_A(module, input, output):
    module.A = output[0]

def hook_store_dydA(module, grad_input, grad_output):
    module.dydA = grad_output[0]

model = torchvision.models.vgg19(pretrained = True)
model.eval()

layer = model.features[35] # Last ReLU of the conv layers
layer.register_forward_hook(hook_store_A)
layer.register_backward_hook(hook_store_dydA)

load an image and make it as one batch:

to_tensor = torchvision.transforms.ToTensor()
input = to_tensor(PIL.Image.open('example_images/elephant_hippo.png')).unsqueeze(0)

Compute:

output = model(input)

c = 386 # African elephant
output[0, c].backward()

alpha = layer.dydA.mean((2, 3), keepdim = True)
L = torch.relu((alpha * layer.A).sum(1, keepdim = True))

mean((2, 3), keepdim = True) computes the mean over the height and width of the image. So we go from a tensor of size \(1 × 3 × H × W\) to a tensor of size \(1 × 3 × 1 × 1\). The last two “\(1\)” are preserved by keepdim = True.

Save it as a resized colored heat-map:

L = L / L.max()
L = F.interpolate(L, size = (input.size(2), input.size(3)),
                    mode = 'bilinear', align_corners = False)

l = L.view(L.size(2), L.size(3)).detach().numpy()
PIL.Image.fromarray(numpy.uint8(cm.gist_earth(l) * 255)).save('result.png')

gist_earth is a color map with orange color for high values, blue for low ones, and green for intermediate ones.

posted on 2022-06-02 05:49  Blackzxy  阅读(24)  评论(0编辑  收藏  举报