YOLOv7 训练命令

1. 多GPU后台分布式训练：

# 后台从头训练 不要忘记 & 后面加上&符号，可以使得我们就算关掉了session连接，远程服务器也可以保持训练任务的运行。
nohup python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --workers 16 --device 0,1 --sync-bn --batch-size  128  &
# 后台恢复训练
nohup python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --workers 16 --device 0,1 --sync-bn --batch-size  128 --resume &

# 重定向输出
nohup python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --workers 16 --device 0,1 --sync-bn --batch-size  128 --resume > train.txt &

参考：https://www.cnblogs.com/yunwangjun-python-520/p/10713564.html

其中我遇到的一个问题：显卡的显存占满， GPU-Util 满，但是 Pwr:Usage/Cap 为待机功率。

解决：https://blog.csdn.net/bagba/article/details/113124482

我是以方式2解决了该问题。

使用 nohup 时候，当我关闭终端时候，训练过程会挂掉，并报错：

torch.distributed.elastic.multiprocessing.api.SignalException: Process 15150 got signal: 1

看来我只好使用 screen 了。

CUDA_VISIBLE_DEVICES=0 nohup python xx.py > out.log &
CUDA_VISIBLE_DEVICES=0 要放到外面，> 是重定向符，& 表示后台训练。 0 表示使用 0 gpu，可以写为 1,2 表示使用 1,2 gpu。

1、安装：

Ubuntu：

sudo apt-get install screen

Centos：

sudo yum install screen

2、建立一个 screen

screen -S name

3、写入日志的方式建立 screen，会在当前目录（也可能是用户目录）下生成 screenlog.0 文件，其中便保存着 screen 中操作与输出的记录。

screen -L -S session_name
# 使用该方式 无需再重定向。会自动生成一个 nohup.out 文件 和 终端打印的完全一样。
# 用法：
nohup bash xx.sh &

推荐

4、查看现有的screen

screen -ls

5、退出当前终端

Ctrl + a +d

6、恢复终端

screen -r name

7、kill 当前会话

快捷键：ctrl + a + k

其他参考：

Linux Screen命令上下左右翻页出现ABCD乱码如何解决？ - 理心炼丹的回答 - 知乎
https://www.zhihu.com/question/310784716/answer/1431959121

https://zhuanlan.zhihu.com/p/107802400?from_voters_page=true

因此最终解决方案：

screen -S name
# 在screen 建立的终端中使用 nohup 将输出重定向到 指定文件下，当然也可以直接使用 screen 的输出写入日志中，请参考上面的链接
nohup python -m torch.distributed.launch --nproc_per_node 2  --master_port 9527 train.py --workers 16 --device 0,1 --sync-bn --batch-size  96 --resume  > xx.txt &

2. 模型不收敛想继续训练

# 修改epochs 为目标数，将 weights 替换为 last.pt，不要加 resume
nohup python -m torch.distributed.launch --nproc_per_node 2  --master_port 9527 train.py --epochs 50 --weights runs/train/xx/weights/last.pt  --workers 16 --device 0,1 --sync-bn --batch-size  96  > xxx.txt &

# 这里我把--hyp中的 mosaic: 0  # image mosaic (probability) # 1.0  关闭了

posted @ 2022-09-01 17:23 Zenith_Hugh 阅读(1034) 评论(0) 收藏举报

刷新页面返回顶部

Zenith Hugh

We Go To The Moon

YOLOv7 训练命令

1. 多GPU后台分布式训练：

2. 模型不收敛想继续训练

公告