西电人工智能实验课01

01 在高算平台使用pytorch框架在cifar10数据集上实现分类任务

  • Author: Kuka
  • Date: 2024-03-27

登录高算平台

高算平台网址:https://xdhpcai.xidian.edu.cn/

点击网址进入高算平台,输入账号密码登录

登录进入的界面如上图

配置conda环境

启动conda

右键点击,选择打开终端

在终端输入(注意,如果无法使用conda,大概率是没有先执行这条命令

source /apps/software/anaconda3/etc/profile.d/conda.sh

然后输入查看conda版本

conda -V

![[Pasted image 20240327112310.png]]

使用conda创建虚拟环境

输入下面命令创建新的名字叫myenv的环境,python版本指定为3.8

# -n 后面的是自己的虚拟环境名称
conda create -n myenv python=3.8

启动conda虚拟环境

conda activate myenv

![[Pasted image 20240327112743.png]]
看到前面出现(myenv)时,说明环境启动成功。

在虚拟环境中安装pytorch

这里我提供一个比较方便的pytorch安装方法。
先在下面两个链接中下载torch和torchvision

torchvision下载链接
https://download.pytorch.org/whl/cu111/torchvision-0.9.1%2Bcu111-cp38-cp38-linux_x86_64.whl#sha256=563b02056f4bbacaf868340ba9a161e5da55ff9649e4f0ecee763809b0870d30

torch下载链接
https://download.pytorch.org/whl/cu111/torch-1.8.1%2Bcu111-cp38-cp38-linux_x86_64.whl#sha256=aaf4d030bcf80903e06a5cd4c98c33eab9be131c96948cc8f8548c421c4ce1e3

点击链接下载到本地

![[Pasted image 20240327134223.png]]
在高算平台中新建一个名叫project的文件夹,然后在project文件夹中新建一个cifar文件夹。
![[Pasted image 20240327135500.png]]
![[Pasted image 20240327135514.png]]


在高算平台的“我的数据”中打开刚刚创建的cifar文件夹,然后上传torch和torchvision的whl文件
![[Pasted image 20240327134359.png]]

在linux桌面中进入文件夹就可以看到刚刚上传的文件
![[Pasted image 20240327135630.png]]

在这里右键点开终端,使用下面命令安装torch和torchvision(记得先启动之前创建的conda虚拟环境)
![[Pasted image 20240327135925.png]]

验证安装成功

在之前创建的cifar文件夹中创建一个test.py文件

touch test.py

在test.py中写入

import torch
import torchvision
print(torch.cuda.is_available())

然后保存。

然后写一个提交作业的脚本文件,在终端中输入下面命令创建新的文件

touch submit.sh

打开submit.sh文件,写入以下内容

#!/bin/bash
#JSUB -q AI205009
#JSUB -gpgpu 1
#JSUB -e logs/error.%J
#JSUB -o logs/output.%J
#JSUB -n 1
#JSUB -J test_torch

source /apps/software/anaconda3/etc/profile.d/conda.sh
conda activate myenv
python test.py > log.txt

保存后,新建一个logs空文件夹用来存作业输出文件
完成以上步骤的文件夹应该如下图所示
![[Pasted image 20240327141950.png]]
右键打开终端,输入下面命令提交作业

jsub < submit.sh

![[Pasted image 20240327142059.png]]

图中9163为刚刚提交的作业号。

然后使用jjobs命令查看作业状态

jjobs 9163

作业运行完后,目录下会出现一个log.txt文件,里面的内容是True,则说明验证安装成功。

logs目录下面是历史作业输出,可以在这里找到运行错误信息
![[Pasted image 20240327142455.png]]

使用高算平台训练cifar分类网络

写好训练cifar的代码并上传,由于高算平台提供了cifar10数据集,我们直接调用即可。

trainset=torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10',train=True,download=False, transform=transform)

trainloader=torch.utils.data.DataLoader(trainset,batch_size=batch_size,shuffle=True, num_workers=8)

testset=torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10',train=False,download=False, transform=transform)

testloader=torch.utils.data.DataLoader(testset,batch_size=batch_size,shuffle=False, num_workers=8)

编写好cifar_classifier.py的代码后,按照之前提交作业的方法提交,此时需要把之前提交作业的脚本中的test.py改成要运行的cifar_classifier.py

![[Pasted image 20240327145242.png]]

修改submit.sh

#!/bin/bash
#JSUB -q AI205009
#JSUB -gpgpu 1
#JSUB -e logs/error.%J
#JSUB -o logs/output.%J
#JSUB -n 1
#JSUB -J test_torch

source /apps/software/anaconda3/etc/profile.d/conda.sh
conda activate myenv
python cifar_classifier.py > log.txt

然后使用jsub命令提交程序
![[Pasted image 20240327145416.png]]

在高算平台首页的“我的作业”中可以查看作业状态

![[Pasted image 20240327145518.png]]

![[Pasted image 20240327145536.png]]

作业完成后,我们的log.txt中就是程序输出
![[Pasted image 20240327145628.png]]

![[Pasted image 20240327145650.png]]

运行出错

如果代码有错误,则在作业运行状态位置会显示EXIT,此时需要通过报错信息来追溯定位错误位置。
在logs目录下,error文件即为错误输出,参考错误信息即可修正代码。(logs目录需要在提交作业前手动创建)
![[Pasted image 20240327145911.png]]

附录

cifar_classifier.py参考代码

# 该参考代码根据齐飞老师所写代码修改
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np

transform = transforms.Compose(
[transforms.ToTensor(),
 transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 128
n_epochs = 2
n_batches = 16

trainset = torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10', train=True,
                                    download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                      shuffle=True, num_workers=8)

testset = torchvision.datasets.CIFAR10(root='/apps/data/ai/cifar10', train=False,
                                   download=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                     shuffle=False, num_workers=8)

classes = ('plane', 'car', 'bird', 'cat',
       'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# 网络定义
class MyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 32, 5)
        self.fc = nn.Linear(32 * 5 * 5, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = self.fc(x)
        return x
        
mynet = MyNet()
print(mynet)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(mynet.parameters(), lr=0.001, momentum=0.9)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

mynet.to(device)
for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()

        outputs = mynet(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % n_batches == 0: # print every 2000 mini-batches
            running_loss = 0.0

model_path = 'cifar_net.pth'
torch.save(mynet.state_dict(), model_path)

mynet = MyNet()
mynet.load_state_dict(torch.load(model_path))

correct = 0
total = 0

with torch.no_grad():
    for data in testloader:
        images, labels = data
        # calculate outputs by running images through the network
        outputs = mynet(images)
        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('网络在测试集上的正确率是: %3.1f %%' % (100*correct/total))


# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = mynet(images)
        _, predictions = torch.max(outputs, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                     accuracy))


posted @ 2024-11-23 16:26  黑鹿kuka  阅读(3)  评论(0编辑  收藏  举报