【软硬件环境及工具安装使用】edgeai-torchvision的使用

前言

一、安装edgeai-torchvision环境

首先需要理解的是，虚拟环境安装完torch之后再安装torchvision，且torchvision是基于源码编译安装的，因为the standard torchvision will not support all the features in this repository. 博主系统CUDA版本是11.7，但是当前edgeai-torchvision只支持到cuda11.3，故安装cuda11.3支持的pytorch版本和torchvision，根据setup.sh，安装pytorch1.10.0和torchvision0.11.0，其他依赖项版本能够支持使用即可；

但是出错

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.

尝试了多种方法，均失败。深入理解setup.py代码之后意识到，就是源码安装torchvision的时候链接不到虚拟环境的CUDA，而是系统的CUDA版本；

edgeai-torchvision/torchvision/extension.py

def _check_cuda_version():
    """
    Make sure that CUDA versions match between the pytorch install and torchvision install
    """
    if not _HAS_OPS:
        return -1
    import torch
    _version = torch.ops.torchvision._cuda_version()
    if _version != -1 and torch.version.cuda is not None:
        tv_version = str(_version)
        if int(tv_version) < 10000:
            tv_major = int(tv_version[0])
            tv_minor = int(tv_version[2])
        else:
            tv_major = int(tv_version[0:2])
            tv_minor = int(tv_version[3])
        t_version = torch.version.cuda
        t_version = t_version.split('.')
        t_major = int(t_version[0])
        t_minor = int(t_version[1])
        if t_major != tv_major or t_minor != tv_minor:
            raise RuntimeError("Detected that PyTorch and torchvision were compiled with different CUDA versions. "
                               "PyTorch has CUDA Version={}.{} and torchvision has CUDA Version={}.{}. "
                               "Please reinstall the torchvision that matches your PyTorch install."
                               .format(t_major, t_minor, tv_major, tv_minor))
    return _version

/home/xxx/miniconda3/envs/edgeaitv/lib/python3.8/site-packages/torch/utils/cpp_extension.py

def _check_cuda_version(self):
        if CUDA_HOME:
            nvcc = os.path.join(CUDA_HOME, 'bin', 'nvcc')
            cuda_version_str = subprocess.check_output([nvcc, '--version']).strip().decode(*SUBPROCESS_DECODE_ARGS)
            cuda_version = re.search(r'release (\d+[.]\d+)', cuda_version_str)
            if cuda_version is not None:
                cuda_str_version = cuda_version.group(1)
                cuda_ver = packaging.version.parse(cuda_str_version)
                torch_cuda_version = packaging.version.parse(torch.version.cuda)
                if cuda_ver != torch_cuda_version:
                    # major/minor attributes are only available in setuptools>=49.6.0
                    if getattr(cuda_ver, "major", float("nan")) != getattr(torch_cuda_version, "major", float("nan")):
                        raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
                    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))

        else:
            raise RuntimeError(CUDA_NOT_FOUND_MESSAGE)

从这些出错部分的源码看出，出错的主要原因是源码编译安装torchvision的时候，是从CUDA_HOME/NVCC中获取的CUDA版本，故虚拟环境的CUDA版本需要和系统的CUDA版本一致。目前系统版本是CUDA11.7，现在为了编译edgeai-torchvision，需要用到cuda11.3，且必须是从系统获取的，所以需要重新安装cuda11.3版本，以后也要便于切换回cuda11.7，具体的安装过程请参考【软硬件环境及工具安装】nvidia驱动/CUDA版本关系及CUDA安装；

错误1：

    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

这个问题和numpy的版本有关，直接安装指定版本的numpy即可；

1）numpy.int was deprecated in NumPy 1.20 and was removed in NumPy 1.24.
   You can change it to numpy.int_, or just int.

2）pip3 install numpy==1.19

错误2：

packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'

setuptools版本问题，版本过高导致的问题；setuptools版本

AttributeError: module ‘distutils‘ has no attribute ‘version‘ 解决方案

AttributeError: module ‘distutils‘ has no attribute ‘version‘

# 使用pip，不能使用 conda uninstall setuptools，原因是conda在卸载的时候，会自动分析与其相关的库，然后全部删除，如果y的话，整个环境都需要重新配置。
pip3 uninstall setuptools
pip3 install setuptools==59.5.0

二、测试环境；

1. 图像分类

直接运行脚本文件

sh run_edgeailite_classification.sh

也可以直接运行命令行

python ./references/edgeailite/scripts/train_classification_main.py --dataset_name cifar100_classification --model_name mobilenetv2_tv_x1 --data_path ./data/datasets/cifar100_classification --img_resize 32 --img_crop 32 --rand_scale 0.5 1.0

error

edgeai-torchvision/references/edgeailite/engine/train_classification.py", line 695, in validate
    progress_bar.set_postfix(Epoch='{}'.format(status_str))
TypeError: set_postfix() missing 1 required positional argument: 'postfix'

原因是源码中函数使用有误，修改即可；

progress_bar.set_postfix('Epoch={}'.format(status_str))

先训练，训练之后基于训练的模型进行量化训练，最后验证，估计量化结果的准确性；基本上理解分类过程的实现逻辑和流程框架；

每个阶段生成3个文件，训练pytorch模型文件，转换的onnx模型文件，以及torchscript模型文件；

2. 语义分割

直接根据软硬件环境修改配置参数，运行脚本文件

sh run_edgeailite_segmentation.sh

错误1：

edgeai-torchvision/torchvision/edgeailite/xvision/datasets/cityscapes_plus.py", line 519, in cityscapes_segmentation
    train_split = CityscapesDataLoader(dataset_config, root, split_name, gt, transforms=transforms[0],
TypeError: __init__() got an unexpected keyword argument 'annotation_prefix'

python *args和**kwargs详解_惊瑟的博客-CSDN博客

将错误行替换为不使用annotation_prefix参数(查看以前版本的代码)，解决问题；

Modelmaker integration v1 · TexasInstruments/edgeai-torchvision@f108240

使用

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms)

替换原来的

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms, annotation_prefix=args.annotation_prefix)

对上述语句的理解：

Python __dict__属性详解 - 星空778 - 博客园

在Python中，__dict__ 是一个特殊的内置属性，它用于存储一个对象（通常是类实例或模块）的所有属性和方法（或称为“成员”）的字典表示。对于模块（例如，你提到的 xvision.datasets），__dict__ 会包含该模块中定义的所有函数、类和其他变量。
这里做了以下几件事：
xvision.datasets：这通常是一个模块，它可能包含多个数据集类或其他函数。
xvision.datasets.__dict__：这会返回一个字典，其中键是 xvision.datasets 模块中定义的所有名称（如函数名、类名等），值是这些名称对应的对象（函数、类等）。
args.dataset_name：这是一个从命令行参数或配置文件中获取的字符串，它应该对应于 xvision.datasets 模块中某个类或函数的名称。
xvision.datasets.__dict__[args.dataset_name]：这会从 xvision.datasets 的 __dict__ 字典中根据 args.dataset_name 获取对应的对象（例如，一个数据集类）。
最后，这个获取到的对象（假设它是一个类）被当作函数来调用，并传递了 args.dataset_config、args.data_path、split=split_arg 和 transforms=transforms 这些参数。这通常意味着这个类有一个初始化方法（__init__），它接受这些参数并返回一个该类的实例。在这个例子中，我们期望这个类返回一个元组，其中包含两个数据集实例（train_dataset 和 val_dataset）。
简而言之，__dict__ 允许你以动态的方式访问模块、类或实例的成员，这在某些情况下（如插件系统、动态加载等）是非常有用的。但是，过度使用它可能会导致代码难以理解和维护，因为它破坏了Python的静态类型检查和其他一些特性。如果可能的话，最好使用更明确和静态的导入和引用方式。

错误2：

AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'

原因：AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘_软件测试大叔的博客-CSDN博客

原来是在pillow的10.0.0版本中，ANTIALIAS方法被删除了，使用新的方法即可，现在需要使用PIL.Image.LANCZOS或PIL.Image.Resampling.LANCZOS。（这与ANTIALIAS引用的算法完全相同，只是不能再通过名称ANTIALIAS访问它。）；或者降低pillow的版本，使用低版本的pillow；

print(PIL.__version__)

pip uninstall -y Pillow
pip install Pillow==9.5.0

三、设计任务；

参考

1. 安装torch/torchvision/cuda版本关系；

2. github_edgeai-torchvision；

3. github_torchvision；

完

posted on 2023-08-17 18:32 鹅要长大阅读(212) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

鹅要长大

【软硬件环境及工具安装使用】edgeai-torchvision的使用

公告

导航