pycharm 远程连接服务器并且debug, 支持torch.distributed.launch debug

未经允许,本文不得转载,vx:837007389

step1:下载专业版本的pycharm


https://www.jetbrains.com/pycharm/download/other.html
首先,你需要有个专业版本的pycharm,并且要有激活码,有需要激活码的可以私信我,我这里的激活码全部最新的pycharm都可以激活,以下激活码只是样例,过期了。

NXHAY2OW76-eyJsaWNlbnNlSWQiOiJOWEhBWTJPVzc2IiwibGljZW5zZWVOYW1lIjoiSHVuYW4gSW5zdGl0dXRlIG9mIFNjaWVuY2UgYW5kIFRlY2hub2xvZ3kiLCJhc3NpZ25lZU5hbWUiOiJ6b2xhIGdhcm1zIiwiYXNzaWduZWVFbWFpbCI6InpvbGFfZ2FybXNAaG90bWFpbC5jb20iLCJsaWNlbnNlUmVzdHJpY3Rpb24iOiJGb3IgZWR1Y2F0aW9uYWwgdXNlIG9ubHkiLCJjaGVja0NvbmN1cnJlbnRVc2UiOmZhbHNlLCJwcm9kdWN0cyI6W3siY29kZSI6IkRQTiIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiREIiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IlBTIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJJSSIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNDIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IkdPIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJETSIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUlNGIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IkRTIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJQQyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUkMiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkNMIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJXUyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUkQiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IlJTMCIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiUk0iLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IkFDIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOmZhbHNlfSx7ImNvZGUiOiJSU1YiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiREMiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6ZmFsc2V9LHsiY29kZSI6IlJTVSIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjpmYWxzZX0seyJjb2RlIjoiRFAiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUERCIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IlBXUyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjp0cnVlfSx7ImNvZGUiOiJQU0kiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUFBTIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IlBDV01QIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IlBHTyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjp0cnVlfSx7ImNvZGUiOiJQUEMiLCJwYWlkVXBUbyI6IjIwMjQtMDItMDkiLCJleHRlbmRlZCI6dHJ1ZX0seyJjb2RlIjoiUFJCIiwicGFpZFVwVG8iOiIyMDI0LTAyLTA5IiwiZXh0ZW5kZWQiOnRydWV9LHsiY29kZSI6IlBTVyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjp0cnVlfSx7ImNvZGUiOiJSUyIsInBhaWRVcFRvIjoiMjAyNC0wMi0wOSIsImV4dGVuZGVkIjp0cnVlfV0sIm1ldGFkYXRhIjoiMDEyMDIzMDIwN0xQQUEwMDkwMDkiLCJoYXNoIjoiNDI1NDk2NzkvMjAxMDIwNTY6LTE5MTk3NjU2NzciLCJncmFjZVBlcmlvZERheXMiOjcsImF1dG9Qcm9sb25nYXRlZCI6ZmFsc2UsImlzQXV0b1Byb2xvbmdhdGVkIjpmYWxzZX0=-jld9RLG5lScm+JPCOjY2bVd6Q+EC9HRx2ZAYKm+ySUHk7VqxJ8yvO+RNcgX/s1HiS1HuAqYgkUqNHXzwFS2TbyaDXw27fZk3F8oCvkdYpItKzMx6IRJ4NNffGGAC5U6culHEAWgNPHpQA3Q6Mw34Cz/19P7syAwTMcy3xCDYWII29+gS9LI9I3/HosfI8qLJSDSGrhOfXtHKKEXIg4QguDU13p897IN5u5CNOhXjAu4oKU+ELWBjeIUBS/fkdrDM56JU5hhKKs3JNJ53VPbWSnP2Uhs/isLP9M1UF6rTgY/reCreWvwEsyG0PVadcvr/lydiKi4+FkX/yZIzVOR4aA==-MIIETDCCAjSgAwIBAgIBDzANBgkqhkiG9w0BAQsFADAYMRYwFAYDVQQDDA1KZXRQcm9maWxlIENBMB4XDTIyMTAxMDE2MDU0NFoXDTI0MTAxMTE2MDU0NFowHzEdMBsGA1UEAwwUcHJvZDJ5LWZyb20tMjAyMjEwMTAwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC/W3uCpU5M2y48rUR/3fFR6y4xj1nOm3rIuGp2brELVGzdgK2BezjnDXpAxVDw5657hBkAUMoyByiDs2MgmVi9IcqdAwpk988/Daaajq9xuU1of59jH9eQ9c3BmsEtdA4boN3VpenYKATwmpKYkJKVc07ZKoXL6kSyZuF7Jq7HoQZcclChbF75QJPGbri3cw9vDk/e46kuzfwpGftvl6+vKibpInO6Dv0ocwImDbOutyZC7E+BwpEm1TJZW4XovMBegHhWC04cJvpH1u98xoR94ichw0jKhdppywARe43rGU96163RckIuFmFDQKZV9SMUrwpQFu4Z2D5yTNqnlLRfAgMBAAGjgZkwgZYwCQYDVR0TBAIwADAdBgNVHQ4EFgQU5FZqQ4gnVc+inIeZF+o3ID+VhcEwSAYDVR0jBEEwP4AUo562SGdCEjZBvW3gubSgUouX8bOhHKQaMBgxFjAUBgNVBAMMDUpldFByb2ZpbGUgQ0GCCQDSbLGDsoN54TATBgNVHSUEDDAKBggrBgEFBQcDATALBgNVHQ8EBAMCBaAwDQYJKoZIhvcNAQELBQADggIBANLG1anEKid4W87vQkqWaQTkRtFKJ2GFtBeMhvLhIyM6Cg3FdQnMZr0qr9mlV0w289pf/+M14J7S7SgsfwxMJvFbw9gZlwHvhBl24N349GuthshGO9P9eKmNPgyTJzTtw6FedXrrHV99nC7spaY84e+DqfHGYOzMJDrg8xHDYLLHk5Q2z5TlrztXMbtLhjPKrc2+ZajFFshgE5eowfkutSYxeX8uA5czFNT1ZxmDwX1KIelbqhh6XkMQFJui8v8Eo396/sN3RAQSfvBd7Syhch2vlaMP4FAB11AlMKO2x/1hoKiHBU3oU3OKRTfoUTfy1uH3T+t03k1Qkr0dqgHLxiv6QU5WrarR9tx/dapqbsSmrYapmJ7S5+ghc4FTWxXJB1cjJRh3X+gwJIHjOVW+5ZVqXTG2s2Jwi2daDt6XYeigxgL2SlQpeL5kvXNCcuSJurJVcRZFYUkzVv85XfDauqGxYqaehPcK2TzmcXOUWPfxQxLJd2TrqSiO+mseqqkNTb3ZDiYS/ZqdQoGYIUwJqXo+EDgqlmuWUhkWwCkyo4rtTZeAj+nP00v3n8JmXtO30Fip+lxpfsVR3tO1hk4Vi2kmVjXyRkW2G7D7WAVt+91ahFoSeRWlKyb4KcvGvwUaa43fWLem2hyI4di2pZdr3fcYJ3xvL5ejL3m14bKsfoOvLHY

step2 配置自动同步文件夹,即远程的工程文件和本地同步

我现在代码在远程服务器上。所以需要把远程服务器代码和我本地一个新建文件夹同步。
新建文件夹remote_0724_new并用pycharm打开,最新版本的pycharm2023.1.4打开空文件夹会自动新建一个main.py文件了。

2.1 Tools -> Deployment -> configuration

点左上角“+”,选择SFTP,随便命名一个server,“port_30975”

输入远程服务器ip,用户名密码端口号,并测试是否链接成功

2.2 设置同步文件夹


最右边的Excluded Paths是不需要同步的文件夹,比如数据文件夹等不需要同步。

2.3 同步服务器上代码到本地

这里打开多级设置,还不好截图了,只能手机拍照了。这里在文件夹上面右击如下:

最下面的File Transfer会显示传输日志信息

代码就同步过来了。

2.3 设置代码同步,Tools->Deploment->Automatic Upload Always

就是你pycharm本地修改的代码自动也在服务器上修改


这里图上下面Browse Remote Host点开,可以在侧边栏显示远程服务器的文件目录,也可以直接在侧边栏打开

代码同步功能需要自己测试一下,比如就在pycharm修改代码,看远程服务器上面代码有没有自己同步过来。

在最下面栏,Terminal然后^可以打开远程的终端,可以vim查看代码是否同步过来了。

step3 配置解释器

3.1 一般可以直接debug

File -> Settings->Project->Python interpreter->Add interpreter->On SSH

输入密码端口号用户等信息,然后next,然后第4步project director and python
这里我设置的是第二个system interpreter,我远程环境就是直接用的系统目录下的python的,没有用conda虚拟环境。
设置python的路径,可以在远程环境下which python看看用的哪里的python
sync folders还是设置的之前的本地和远程的同步文件夹

这里理论上可以运行debug远程服务器上代码了,但是我这个比较特殊,是运行的pyorch的分布式训练的代码。运行的指令是

python -m torch.distributed.launch --nproc_per_node=1 main.py

直接点运行会报错

/usr/bin/python /code_src_debug/main.py 
./logs/test

  File "/_src_debug/src/data.py", line 807, in compile_data
    train_sampler = DistributedSampler(traindata)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/distributed.py", line 65, in __init__
    num_replicas = dist.get_world_size()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
    return _get_group_size(group)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Process finished with exit code 1

这里需要格外的关注pycharm第一行给我们实际运行的指令

/usr/bin/python /code_src_debug/main.py 

3.2 pytorch的分布式训练, 可以直接跑但是不能debug

可以看到pycharm实际运行的指令少了分布式的那些指令,python -m torch.distributed.launch --nproc_per_node=1 main.py,所以加上:
Run ->Edit Configurations...,在Interpreter optiins:这栏填写-m torch.distributed.launch --nproc_per_node=1

点运行按钮,可以直接运行跑了!! 这回pycharm给出的运行指令是:

/usr/bin/python -m torch.distributed.launch --nproc_per_node=1 /code_src_debug/main.py 
./logs/test

3.3 pytorch的分布式训练, 可以debug

按照上面3.2,debug运行,报错

/usr/bin/python -m torch.distributed.launch --nproc_per_node=1 /root/.pycharm_helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client localhost --port 60888 --file /code_src_debug/main.py  
Traceback (most recent call last):
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 2016, in main
    setup = process_command_line(sys.argv)
  File "/root/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_command_line_handling.py", line 146, in process_command_line
    raise ValueError("Unexpected option: " + argv[i])
ValueError: Unexpected option: --local_rank=0
Usage:
	pydevd.py --port N [(--client hostname) | --server] --file executable [file_options]
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', '/root/.pycharm_helpers/pydev/pydevd.py', '--local_rank=0', '--multiprocess', '--qt-support=auto', '--client', 'localhost', '--port', '60888', '--file', '/code_src_debug/main.py  ']' returned non-zero exit status 1.

Process finished with exit code 1

分析: 可以看到现在debug模式pycharm给出的运行指令不一样了,

/usr/bin/python -m torch.distributed.launch --nproc_per_node=1 /root/.pycharm_helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client localhost --port 60888 --file /code_src_debug/main.py  

经过一系列折腾,终于可以了,此处省略万字,

然后可以看到pycharm给出的debug的运行指令:

/usr/bin/python /root/.pycharm_helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client localhost --port 50009 --file /usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py --nproc_per_node=1 main.py 
Connected to pydev debugger (build 231.9225.15)

其他

批量设置不同步的文件

试验了在Excluded Paths那边填没有用,只能填文件夹或者单个文件;
比如远程服务器老是生成core.3456,很大,这个时候在Tools->Deployment->Options...->exclude items by names在后面添加core
即可:

.svn;.cvs;.idea;.DS_Store;.git;.hg;*.hprof;*.pyc;core*

本地debug torch.distributed.launch的DDP模式的工程代码

python -m torch.distributed.launch --nproc_per_node=1 --master_port 12379 main.py

posted @ 2023-07-24 18:02  无左无右  阅读(699)  评论(0编辑  收藏  举报