服务器自动挂实验运行脚本

"""
# -*- encoding=utf-8 -*-
import time, os
import sys
cmd_queue = []
"""
"""

maximal_concurrent_process = 1

def check(path):
    f = open(path)
    content = f.readlines()
    new_contents = []
    for i in content:
        if i.find("MiB")>=0:
            new_contents.append(i)
    # print(6, is_GPU_available_do(content, 8000, 6))
    for i in range(0, 8):
        print(i,is_GPU_available_do(new_contents, 8000, i))
        print(get_number_of_running_process(new_contents, i))

def get_number_of_running_process(content, gpu_num=0):
    count = 0
    for i in range(0, len(content)):
        line = content[i]
        line = line.strip()
        line = line[1:].strip()
        if line.find(str(gpu_num))==0 and line.find("python") >= 0:
            count = count + 1
    return count

def get_number_of_running_pythons():
    cmd = "ps aux|grep src/main.py"
    content = os.popen(cmd).readlines()
    count = len(content)-1
    return count


def is_GPU_available_do(content, capacity=12500, gpu_num=0, maximal_concurrent_process=2):
    count = get_number_of_running_process(content, gpu_num)
    line = content[gpu_num]
    if line.find("MiB") >= 0:
        pos1 = line.find("|")
        pos1 = line.find("|", pos1 + 1)
        pos2 = line.find("MiB", pos1)
        usage = line[pos1 + 1:pos2].strip()
        usage = int(usage)
        # print("GPU usage %d MB"%(usage,))
        if usage <= capacity and count<maximal_concurrent_process:
            # print("true")
            return True
        # print("maximal_concurrent_process", maximal_concurrent_process)
    return False

def is_GPU_available(capacity=12500, gpu_num=0, maximal_concurrent_process=2):
    cmd = "nvidia-smi|grep MiB"
    content = os.popen(cmd).readlines()
    return is_GPU_available_do(content, capacity, gpu_num, maximal_concurrent_process)

def read_commands(file_path):
    f = open(file_path)
    contents = f.readlines()
    f.close()
    return contents

def write_cmd_back(cmds, file_path):
    f = open(file_path, "w")
    for cmd in cmds:
        f.write(cmd)
        if not cmd.endswith("\n"):
            f.write("\n")
    f.close()

if __name__ == "__main__":
    if (len(sys.argv)==1):
        check("../results/test_input")
        exit(0)
    print("usage: cmd_path, max_process, max_memory, gpuid1, gpuid2, gpuid3 ....")
    cmd_path = sys.argv[1]
    maximal_concurrent_process = int(sys.argv[2])
    if maximal_concurrent_process >4:
        print("usage: cmd_path, max_process, max_memory, gpuid1, gpuid2, gpuid3 ....")
        print("the max_process is the maximal number of experiments allow to run in one GPU, you set it to be", maximal_concurrent_process, ". it is too large, set it to be a number <=3")
        exit(0)
    maximal_usage = int(sys.argv[3])
    num_gpus = 1
    gpu_ids = []
    if len(sys.argv)>=5:
        for i in range(4,len(sys.argv)):
            gpu_ids.append(int(sys.argv[i]))
    else:
        print("usage: nohup python src/smart_runner.py ./cmds 1 15000 0 1 2 3&")
        exit(1)
        # min_gpuId = int(sys.argv[3])
        # max_gpuId = int(sys.argv[4])
    print("The command path is", cmd_path)
    command_file_path = cmd_path
    while True:
        for i in range(0, len(gpu_ids)):
            total_python_runs = get_number_of_running_pythons()
            gpu = gpu_ids[i]
            if is_GPU_available(maximal_usage, gpu, maximal_concurrent_process):
                if maximal_concurrent_process * len(gpu_ids) < total_python_runs: #有的时候进程即将要跑了，但是还没有用gpu，这个时候nvidia-smi是看不出来的。
                    print("the maximal allowed process is ", maximal_concurrent_process * len(gpu_ids), "current running src/main.py", total_python_runs, "so not run")
                    time.sleep(60)
                    continue
                cmds = read_commands(command_file_path)         #get top command from a file
                if len(cmds) == 0:
                    continue
                cmd = cmds.pop()   #remove the first line from the file
                write_cmd_back(cmds, command_file_path)        #save the file
                cmd = "CUDA_VISIBLE_DEVICES="+str(gpu) + " " + cmd
                print(cmd)
                os.system(cmd) #run the command
                time.sleep(30) #避免在初始化一个任务的时候启动太多任务
                if cmd.find("aq")>=0 or cmd.find("qgraph")>=0 or cmd.find("ar")>=0:
                    time.sleep(60)

这是一个为特定计算环境（具有多个GPU）而设计的智能运行器脚本。脚本的目的是基于资源可用性（GPU内存和正在运行的进程数）自动调度命令。现在，我将简要地为您介绍代码的每一个部分：

导入必要的库:
- time 用于使进程暂停一段时间。
- os 允许我们执行操作系统命令。
- sys 用于访问传递给脚本的命令行参数。
全局变量:
- cmd_queue 保存待执行的命令。
- maximal_concurrent_process 表示单个GPU上允许的最大并行进程数。
核心函数:
- check() 用于检查一个特定文件中的内容，看看哪个GPU可用。
- get_number_of_running_process() 返回特定GPU上当前正在运行的Python进程数。
- get_number_of_running_pythons() 返回整个系统上所有正在运行的Python进程数。
- is_GPU_available_do() 和 is_GPU_available() 检查给定GPU是否有足够的内存并且并行进程数不超过限制。
- read_commands() 和 write_cmd_back() 是用于读取和写入命令文件的工具函数。
主逻辑:
- 脚本首先检查传递给它的参数数量。如果没有参数，它将检查并打印某个文件的内容。
- 脚本需要以下参数：
  - cmd_path：命令文件的路径。
  - maximal_concurrent_process：单个GPU上允许的最大并行进程数。
  - maximal_usage：GPU的最大内存使用量。
  - 后续参数是GPU的ID。
- 在无限循环中，脚本会检查每个GPU是否可用。如果GPU可用，它会读取命令文件的顶部命令，执行该命令，并从文件中删除它。然后，脚本将暂停一段时间，以避免立即启动太多的任务。

简而言之，这个脚本的目标是在有限的资源下自动、智能地调度并执行命令，确保每个GPU不会超载。

要执行smart_run.py脚本，你需要满足一些先决条件并且根据你的需求提供正确的参数。以下是如何操作的详细步骤：

先决条件:
- 你需要有一个运行Linux的机器，因为此脚本使用了os.popen来执行Unix命令。
- 确保你的机器上有Python环境。
- 如果你使用该脚本来检查GPU的状态，你需要安装nvidia-smi工具。
准备命令文件:
脚本通过读取一个命令文件来获取要执行的命令。这个文件应该是一个纯文本文件，其中每一行都是一个要执行的命令。
执行脚本:
在终端中，进入脚本所在的目录并执行以下命令：
```
python smart_run.py [cmd_path] [max_process] [max_memory] [gpuid1] [gpuid2] ...
```
- cmd_path: 是你要从中读取命令的文件的路径。
- max_process: 是每个GPU上允许的最大并行进程数。
- max_memory: 是你想为每个进程分配的最大GPU内存。
- gpuid1, gpuid2, ... : 是你想使用的GPU的ID。
例如，如果你有一个名为commands.txt的命令文件，想要在两个GPU（ID为0和1）上执行命令，每个GPU最多2个进程，并且每个进程最多15000MiB的内存，你可以执行：
```
python smart_run.py commands.txt 2 15000 0 1
```
其他:
- 如果你只想检查一个名为../results/test_input的文件并不执行任何命令，只需运行python smart_run.py，脚本将自动检查该文件。

nohup python src/smart_runner.py ./cmds.txt 3 16000 0 1 2 3 4 5 6 7 &

posted @ 2023-09-09 16:09 CharlesLC 阅读(112) 评论(0) 收藏举报

刷新页面返回顶部

CharlesLC

Stay foolish, stay hungry.

服务器自动挂实验运行脚本

公告