Deep Learning Week8 Notes
1. Computer Vision Task
- Error rate: \(P(f(X)\neq Y)\)
- Accuracy: \(P(f(X)=Y)\)
- \(\textbf{Balanced error rate (BER)}\): \(\frac{1}{C}\sum_{y=1}^CP(f(X)\neq Y|Y=y)\)
- In two-class case, we can define \(\textbf{True Positive (TP)}\) rate \(P(f(X)=1|Y=1)\), \(\textbf{False Positive (FP)}\) rate as \(P(f(X)=1|Y=0)\).
- Ideal algorithm: \(\textbf{TP}\rightarrow 1\), \(\textbf{FP}\rightarrow 0\)
- \(\textbf{Example:}\)
- \(\textbf{Cancer detection}\): low threshold to get a high \(TP\) rate at a cost of a high \(FP\) rate.
- \(\textbf{Image Retrival:}\) High threshold to get a low \(FP\) rate, at the cost of a low \(TP\) rate.
\(\\\)
\(\textbf{ROC}:\) The ROC curve shows the true positive rate as a function of the false positive rate. Each position of the ROC corresponds to a \(\textbf{threshold}\), not shown, which is the value above
which the sample is predicted to be of class 1:
- \(\textbf{High Threshold:}\) true positive rate is low, but the false positive rate is also low
- \(\textbf{Low Threshold:}\) so the true positive rate is higher, but so is the false positive rate
Object Detection
Predicted bounding box \(\hat{B}\), annotated bounding box \(B\). If the \(\textbf{Intersection over Union}\) (IoU) is large enough:
then we would consider it's correct.
Image segmentation
consists of labeling individual pixels with the class of the object it belongs to, and may also involve predicting the instance it belongs to.
\(\textbf{Segmentation Accuracy (SA)}\): for class \(c\) is defined as:
where \(N\) means the number.
2. Networks for image classification
Standard model: LeNet family. They share a common structure of several convolutional layers seen as features extractor, followed by fully connected layers as the classifier.
For example, \(\textbf{AlexNet}\):
import torchvision
alexnet = torchvision.models.alexnet()
\(\textbf{LeNet5}\): \(10\) classes, input \(1\times 28\times 28\):
(features): Sequential (
(0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(1): ReLU (inplace)
(2): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(4): ReLU (inplace)
(5): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
)
(classifier): Sequential (
(0): Linear (256 -> 120)
(1): ReLU (inplace)
(2): Linear (120 -> 84)
(3): ReLU (inplace)
(4): Linear (84 -> 10)
)
\(\textbf{AlexNet}\): \(1,000\) classes, input \(3\times 244\times 244\).
(features): Sequential (
(0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(1): ReLU (inplace)
(2): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
(3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(4): ReLU (inplace)
(5): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
(6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU (inplace)
(8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU (inplace)
(10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU (inplace)
(12): MaxPool2d (size=(3, 3), stride=(2, 2), dilation=(1, 1))
)
(classifier): Sequential (
(0): Dropout (p = 0.5)
(1): Linear (9216 -> 4096)
(2): ReLU (inplace)
(3): Dropout (p = 0.5)
(4): Linear (4096 -> 4096)
(5): ReLU (inplace)
(6): Linear (4096 -> 1000)
)
\(\text{Data Augmentation}\) to reduce over-fitting:
- crop a \(224\times 224\) image at a random position in the original \(256\times 256\), and randomly reflect it horizontally
- color transformation using \(\textbf{PCA}\) of the color distribution
During test the prediction is averaged over five random crops and their horizontal reflections.
\(\text{Example:}\) Pre-trained models on image-classification problem.
import PIL, torch, torchvision
# Load and normalize the image
to_tensor = torchvision.transforms.ToTensor()
img = to_tensor(PIL.Image.open('../example_images/blacklab.jpg'))
img = img.unsqueeze(0) # (batch_size, C, H, W)
img = 0.5 + 0.5 * (img - img.mean()) / img.std()
# Load and evaluate the network
alexnet = torchvision.models.alexnet(pretrained = True)
alexnet.eval() # due to dropout effect
output = alexnet(img) # (1, 1000)
# Prints the classes
scores, indexes = output.view(-1).sort(descending = True)
class_names = eval(open('imagenet1000_clsid_to_human.txt', 'r').read())
for k in range(12):
print(f'#{k+1} {scores[k].item():.02f} {class_names[indexes[k].item()]}')
Fully convolutional networks
See Lecture from P\(14\), StackExchange
Standard convolutional networks reshape the tensor \(x^{(l)}\) produced by convolution layers into \(1d\) tensors before feeding into fully connected layers composing the classifiers of the model.
\(\textbf{Conversely:}\) we can replace the fully connexted layers by convolution layers whose filters are as big as the input tensors.
\(\text{Code:}\)
def convolutionize(layers, input_size):
result_layers = []
x = torch.zeros((1, ) + input_size)
for m in layers:
if isinstance(m, torch.nn.Linear):
n = torch.nn.Conv2d(in_channels = x.size(1),
out_channels = m.weight.size(0),
kernel_size = (x.size(2), x.size(3)))
with torch.no_grad():
n.weight.view(-1).copy_(m.weight.view(-1))
n.bias.view(-1).copy_(m.bias.view(-1))
m = n
result_layers.append(m)
x = m(x)
return result_layers
3. Networks for object detection
While image classification aims at predicting the class of the main object in the image, object detection aims at not only predicting the classes of all the objects which are visible, but also their locations.
\(\large\text{Overfeat}:\) adding a regression part to predict the object's bounding box. (See Lecture-P3)
In the single-object case, the convolutional layers are \(\textbf{frozen}\), and the localization layers are trained with a \(L_2\) loss.
For multiple boxes, using class-specific localization layers did not provide better results than having a \(\textbf{single one shared}\) across classes.
\(\\\)
\(\large\text{One of the most famous algorithm: } \textbf{YOLO (You Only Look Once)}\). It comes back to a classical architecture with a series of convolutional layers followed by a few fully connected layers.
In detail, it uses leaky ReLU, and its convolutional layers make use of the \(1 × 1\) bottleneck filters (Lin et al., 2013) to control the memory footprint and computational cost.
Illustration: Lecture-P8
During training, YOLO makes the assumption that any of the \(S^2\) cells contains at most a single object.
For every image, cell index \(i=1,...,S^2\), predicted box index \(j=1,...,B\), class index \(c = 1,...,C\).
- \(1_i^{obj}\) is \(1\) if there is an object in cell \(i\) and \(0\) otherwise.
- \(1_{i,j}^{obj}\) is \(1\) if there is an object in cell \(i\) and predicted box \(j\) is the most fitting one, \(0\) otherwise.
- \(p_{i,c}\) is \(1\) if there is an object of class \(c\) in cell \(i\), and \(0\) otherwise.
- \(x_i,y_i,w_i,h_i\) the annotated object bounding box (defined only \(1_i^{obj}=1\))
- \(c_{i,j}\) IoU between predicted box and the target groundtruth.
then minimize:
The first part of the loss aims at minimizing the localization error of the detection. The square
root is used to reduce the weight of the height and width (\(w_i, h_i\)) over the corner location(\(x_i, y_i\)).
The second part of the loss estimates the confidence of a detection \(\hat{c}_{i,j}\) to reflect the intersection over union \(c_{i,j}\) of that bounding box with the ground truth. When there is no object, we want the confidence \(\hat{c}_{i,j}\) of that bounding box to be low, this part being driven by \(\lambda_{noobj}\).
The last part of the loss is for the class score. Note that while a natural choice is the cross-entropy, it is here a quadratic error.
\(\large\textbf{Tricks for training:}\) Lecture-P13
\(\large\textbf{Summarize: how 'one shot' can be achieved}\)
- networks trained on image classification capture localization information
- regression layers can be attached to classification-trained networks
- object localization does not have to be class-specific
4. Networks for semantic segmentation
The deep-learning approach re-casts semantic segmentation as pixel classification, and re-uses networks trained for image classification by making them fully convolutional.
The added “\(\textbf{background}\)” class is added for pixels that do not belong to any of the defined object and avoid forcing the network to make a inconsistent choice.
Since segmentation aims at classifying the individual pixels, the size of the final tensor should be of the same size as the input image. Since the activation maps have been reduced by pooling operations, the size has to be increase back.
5. DataLoader
torch.utils.data.DataLoader
train_transforms = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize(mean = (0.1302,), std = (0.3069, ))
]
)
train_loader = DataLoader(
datasets.MNIST(root = data_dir, train = True, download = True,
transform = train_transforms),
batch_size = 100,
num_workers = 4,
shuffle = True,
pin_memory = torch.cuda.is_available()
)
num_workers
: is the number of treads used by the CPU to load and prepare the mini-batch.
pin_memory
: is useful when training on the GPU. This allocates the samples on a page-locked memory which speeds up the transfer between CPU and GPU.
\(\large\text{Example:}\)
data_dir = os.environ.get('PYTORCH_DATA_DIR') or './data/cifar10/'
num_workers = 4
batch_size = 64
transform = torchvision.transforms.ToTensor()
train_set = datasets.CIFAR10(root = data_dir, train = True,
download = True, transform = transform)
train_loader = utils.data.DataLoader(train_set, batch_size = batch_size,
shuffle = True, num_workers = num_workers)
test_set = datasets.CIFAR10(root = data_dir, train = False,
download = True, transform = transform)
test_loader = utils.data.DataLoader(test_set, batch_size = batch_size,
shuffle = False, num_workers = num_workers)
class ResBlock(nn.Module):
def __init__(self, nb_channels, kernel_size):
super().__init__()
self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
padding = (kernel_size-1)//2)
self.bn1 = nn.BatchNorm2d(nb_channels)
self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
padding = (kernel_size-1)//2)
self.bn2 = nn.BatchNorm2d(nb_channels)
def forward(self, x):
y = self.bn1(self.conv1(x))
y = F.relu(y)
y = self.bn2(self.conv2(y))
y += x
y = F.relu(y)
return y
class Monster(nn.Module):
def __init__(self, nb_blocks, nb_channels):
super().__init__()
alexnet = torchvision.models.alexnet(pretrained = True)
self.features = nn.Sequential(alexnet.features[0], nn.ReLU(inplace = True))
dummy = self.features(torch.zeros(1, 3, 32, 32)).size()
alexnet_nb_channels = dummy[1]
alexnet_map_size = tuple(dummy[2:4])
self.conv = nn.Conv2d(alexnet_nb_channels, nb_channels, kernel_size = 1)
self.resblocks = nn.Sequential(
*(ResBlock(nb_channels, kernel_size = 3) for _ in range(nb_blocks))
)
self.avg = nn.AvgPool2d(kernel_size = alexnet_map_size)
self.fc = nn.Linear(nb_channels, 10)
def forward(self, x):
x = self.features(x)
x = F.relu(self.conv(x))
x = self.resblocks(x)
x = F.relu(self.avg(x))
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
nb_epochs = 50
nb_blocks, nb_channels = 8, 64
model, criterion = Monster(nb_blocks, nb_channels), nn.CrossEntropyLoss()
model.to(device)
criterion.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-2)
for e in range(nb_epochs):
# Freeze the features during half of the epochs
for p in model.features.parameters():
p.requires_grad = e >= nb_epochs // 2
acc_loss = 0.0
for input, targets in iter(train_loader):
input, targets = input.to(device), targets.to(device)
output = model(input)
loss = criterion(output, targets)
acc_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(e, acc_loss)
nb_test_errors, nb_test_samples = 0, 0
model.eval()
for input, targets in iter(test_loader):
input, targets = input.to(device), targets.to(device)
output = model(input)
wta = torch.argmax(output.data, 1).view(-1)
for i in range(targets.size(0)):
nb_test_samples += 1
if wta[i] != targets[i]: nb_test_errors += 1
test_error = 100 * nb_test_errors / nb_test_samples
print(f'test_error {test_error:.02f}% ({nb_test_errors}/{nb_test_samples})')