当前位置: 首页 > article >正文

torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容

问题现象:

使用nohup 启动torch的分布式训练后, 由于ssh断开与服务器的连接, 导致训练过程出错:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971878 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971879 closing signal SIGHUP
Traceback (most recent call last):File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 87, in _run_codeexec(code, run_globals)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>main()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in mainlaunch(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launchrun(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in runelastic_launch(File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agentresult = agent.run()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapperresult = f(*args, **kwargs)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3971841 got signal: 1

执行的命令如下:

nohup ./my_train.sh   >log.log 2>&1   &

报错的原因可能是torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容 , 当ssh连接断开, 窗口被关闭时,torch.distribute 接管了相关异常, 导致nohup没起作用。

ref: https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/6

http://www.lryc.cn/news/2399030.html

相关文章:

  • [SC]C++ 中 struct vs. class 的唯一区别
  • React从基础入门到高级实战:React 高级主题 - React设计模式:提升代码架构的艺术
  • 【GitHub开源AI精选】WhisperX:70倍实时语音转录、革命性词级时间戳与多说话人分离技术
  • 【leetcode】459.重复的子字符串
  • 华为OD机试真题——文件目录大小(2025 A卷:100分)Java/python/JavaScript/C++/C语言/GO六种语言最佳实现
  • 【Java】mybatis-plus乐观锁与Spring重试机制
  • Linux 与 Windows:哪个操作系统适合你?
  • C#委托的概念与使用方法
  • 消费者行为变革下开源AI智能名片与链动2+1模式S2B2C商城小程序的协同创新路径
  • 软考 系统架构设计师系列知识点之杂项集萃(78)
  • 解决MyBatis参数绑定中参数名不一致导致的错误问题
  • 如何解决MySQL Workbench中的错误Error Code: 1175
  • Docker 镜像(或 Docker 容器)中查找文件命令
  • MySQL进阶篇(存储引擎、索引、视图、SQL性能优化、存储过程、触发器、锁)
  • python批量解析提取word内容到excel
  • BugKu Web渗透之game1
  • 使用Composer创建公共类库
  • Axure设计案例——科技感渐变柱状图
  • LeetCode 热题 100 394. 字符串解码
  • 互联网大厂智能体平台体验笔记字节扣子罗盘、阿里云百炼、百度千帆 、腾讯元器、TI-ONE平台、云智能体开发平台
  • 深入解析ReactJS中JSX的底层工作原理
  • 亡羊补牢与持续改进 - SRE 的安全日志、审计与事件响应
  • NodeMediaEdge任务管理
  • LIMIT 和 OFFSET 在大数据量下的性能问题分析与优化方案
  • SpringBoot集成第三方jar的完整指南
  • 登高架设作业实操考试需要注意哪些安全细节?
  • 前端基础之《Vue(18)—路由知识点》
  • 014校园管理系统技术解析:构建智慧校园管理平台
  • 微服务各个部分的作用
  • SQLite详细解读