大模型调试debug记录
环境:Linux , cuda 11.7
- RuntimeError: Distributed package doesn't have NCCL built in
原因:pytorch安装的是cpu版本,需要安装支持gpu版本的
RuntimeError: Distributed package doesn't have NCCL built in - #3 by bdabykov - distributed - PyTorch Forums
2. NotImplementedError: Cannot copy out of meta tensor; no data!
原因:显存不够