西电人工智能实验课01
01 在高算平台使用pytorch框架在cifar10数据集上实现分类任务
- Author: Kuka
- Date: 2024-03-27
登录高算平台
点击网址进入高算平台,输入账号密码登录
登录进入的界面如上图
配置conda环境
启动conda
右键点击,选择打开终端
在终端输入(注意,如果无法使用conda,大概率是没有先执行这条命令)
source /apps/software/anaconda3/etc/profile.d/conda.sh
然后输入查看conda版本
conda -V
![[Pasted image 20240327112310.png]]
使用conda创建虚拟环境
输入下面命令创建新的名字叫myenv的环境,python版本指定为3.8
# -n 后面的是自己的虚拟环境名称
conda create -n myenv python=3.8
启动conda虚拟环境
conda activate myenv
![[Pasted image 20240327112743.png]]
看到前面出现(myenv)时,说明环境启动成功。
在虚拟环境中安装pytorch
这里我提供一个比较方便的pytorch安装方法。
先在下面两个链接中下载torch和torchvision
点击链接下载到本地
![[Pasted image 20240327134223.png]]
在高算平台中新建一个名叫project的文件夹,然后在project文件夹中新建一个cifar文件夹。
![[Pasted image 20240327135500.png]]
![[Pasted image 20240327135514.png]]
在高算平台的“我的数据”中打开刚刚创建的cifar文件夹,然后上传torch和torchvision的whl文件
![[Pasted image 20240327134359.png]]
在linux桌面中进入文件夹就可以看到刚刚上传的文件
![[Pasted image 20240327135630.png]]
在这里右键点开终端,使用下面命令安装torch和torchvision(记得先启动之前创建的conda虚拟环境)
![[Pasted image 20240327135925.png]]
验证安装成功
在之前创建的cifar文件夹中创建一个test.py文件
touch test.py
在test.py中写入
import torch
import torchvision
print(torch.cuda.is_available())
然后保存。
然后写一个提交作业的脚本文件,在终端中输入下面命令创建新的文件
touch submit.sh
打开submit.sh文件,写入以下内容
#!/bin/bash
#JSUB -q AI205009
#JSUB -gpgpu 1
#JSUB -e logs/error.%J
#JSUB -o logs/output.%J
#JSUB -n 1
#JSUB -J test_torch
source /apps/software/anaconda3/etc/profile.d/conda.sh
conda activate myenv
python test.py > log.txt
保存后,新建一个logs空文件夹用来存作业输出文件
完成以上步骤的文件夹应该如下图所示
![[Pasted image 20240327141950.png]]
右键打开终端,输入下面命令提交作业
jsub < submit.sh
![[Pasted image 20240327142059.png]]
图中9163为刚刚提交的作业号。
然后使用jjobs命令查看作业状态
jjobs 9163
作业运行完后,目录下会出现一个log.txt文件,里面的内容是True,则说明验证安装成功。
logs目录下面是历史作业输出,可以在这里找到运行错误信息
![[Pasted image 20240327142455.png]]
使用高算平台训练cifar分类网络
写好训练cifar的代码并上传,由于高算平台提供了cifar10数据集,我们直接调用即可。
trainset=torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10',train=True,download=False, transform=transform)
trainloader=torch.utils.data.DataLoader(trainset,batch_size=batch_size,shuffle=True, num_workers=8)
testset=torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10',train=False,download=False, transform=transform)
testloader=torch.utils.data.DataLoader(testset,batch_size=batch_size,shuffle=False, num_workers=8)
编写好cifar_classifier.py的代码后,按照之前提交作业的方法提交,此时需要把之前提交作业的脚本中的test.py改成要运行的cifar_classifier.py
![[Pasted image 20240327145242.png]]
修改submit.sh
#!/bin/bash
#JSUB -q AI205009
#JSUB -gpgpu 1
#JSUB -e logs/error.%J
#JSUB -o logs/output.%J
#JSUB -n 1
#JSUB -J test_torch
source /apps/software/anaconda3/etc/profile.d/conda.sh
conda activate myenv
python cifar_classifier.py > log.txt
然后使用jsub命令提交程序
![[Pasted image 20240327145416.png]]
在高算平台首页的“我的作业”中可以查看作业状态
![[Pasted image 20240327145518.png]]
![[Pasted image 20240327145536.png]]
作业完成后,我们的log.txt中就是程序输出
![[Pasted image 20240327145628.png]]
![[Pasted image 20240327145650.png]]
运行出错
如果代码有错误,则在作业运行状态位置会显示EXIT,此时需要通过报错信息来追溯定位错误位置。
在logs目录下,error文件即为错误输出,参考错误信息即可修正代码。(logs目录需要在提交作业前手动创建)
![[Pasted image 20240327145911.png]]
附录
cifar_classifier.py参考代码
# 该参考代码根据齐飞老师所写代码修改
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size = 128
n_epochs = 2
n_batches = 16
trainset = torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10', train=True,
download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=8)
testset = torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10', train=False,
download=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=8)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# 网络定义
class MyNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 32, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(32, 32, 5)
self.fc = nn.Linear(32 * 5 * 5, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = self.fc(x)
return x
mynet = MyNet()
print(mynet)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(mynet.parameters(), lr=0.001, momentum=0.9)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
mynet.to(device)
for epoch in range(n_epochs): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = mynet(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if (i+1) % n_batches == 0: # print every 2000 mini-batches
running_loss = 0.0
model_path = 'cifar_net.pth'
torch.save(mynet.state_dict(), model_path)
mynet = MyNet()
mynet.load_state_dict(torch.load(model_path))
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
# calculate outputs by running images through the network
outputs = mynet(images)
# the class with the highest energy is what we choose as prediction
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('网络在测试集上的正确率是: %3.1f %%' % (100*correct/total))
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}
# again no gradients needed
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = mynet(images)
_, predictions = torch.max(outputs, 1)
# collect the correct predictions for each class
for label, prediction in zip(labels, predictions):
if label == prediction:
correct_pred[classes[label]] += 1
total_pred[classes[label]] += 1
# print accuracy for each class
for classname, correct_count in correct_pred.items():
accuracy = 100 * float(correct_count) / total_pred[classname]
print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
accuracy))