当前位置: 首页 > news >正文

昇腾910使用记录

一. 压缩文件和解压文件

1. 压缩文件

tar -czvf UNITE-main.tar.gz ./UNITE-main/

2. 解压文件

tar -xvf ./UNITE-main/

二. CUDA更改为NPU

data['label'] = data['label'].cuda()
data['instance'] = data['instance'].cuda()
data['image'] = data['image'].cuda()

更改为

data['label'] = data['label'].npu()
data['instance'] = data['instance'].npu()
data['image'] = data['image'].npu()

三. 配置环境变量

1. 创建env.sh

touch env.sh

2. 打开env.sh

vi env.sh

3. 配置环境变量

# 配置CANN相关环境变量
CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info'
if [ -f $CANN_INSTALL_PATH_CONF ]; thenDEFAULT_CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2)
elseDEFAULT_CANN_INSTALL_PATH="/usr/local/Ascend/"
fi
CANN_INSTALL_PATH=${1:-${DEFAULT_CANN_INSTALL_PATH}}
if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ];thensource ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh
elsesource ${CANN_INSTALL_PATH}/nnae/set_env.sh
fi
# 导入依赖库
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/openblas/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64/
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib/
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib/aarch64_64-linux-gnu
# 配置自定义环境变量
export HCCL_WHITELIST_DISABLE=1
# log
export ASCEND_SLOG_PRINT_TO_STDOUT=0 # 日志打屏, 可选
export ASCEND_GLOBAL_LOG_LEVEL=3 # 日志级别常用 1 INFO级别; 3 ERROR级别
export ASCEND_GLOBAL_EVENT_ENABLE=0 # 默认不使能event日志信息

并输入

:wq!

4. 使用环境

source env.sh

四. RuntimeError: ACL stream synchronize failed, error code:507018

E39999: Inner Error, Please contact support engineer!
E39999  Aicpu kernel execute failed, device_id=0, stream_id=0, task_id=6394, fault op_name=ScatterElements[FUNC:GetError][FILE:stream.cc][LINE:1044]TraceBack (most recent call last):rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]DEVICE[0] PID[41411]: 
EXCEPTION TASK:Exception info:TGID=2593324, model id=65535, stream id=0, stream phase=SCHEDULE, task id=742, task type=aicpu kernel, recently received task id=742, recently send task id=741, task phase=RUNMessage info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210Other info[0]:time=2023-10-12-11:22:01.273.951, function=proc_aicpu_task_done, line=972, error code=0x2a 
EXCEPTION TASK:Exception info:TGID=2593324, model id=65535, stream id=0, stream phase=3, task id=6394, task type=aicpu kernel, recently received task id=6406, recently send task id=6393, task phase=RUNMessage info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210Other info[0]:time=2023-10-12-11:41:20.661.958, function=proc_aicpu_task_done, line=972, error code=0x2a
Traceback (most recent call last):File "train.py", line 40, in <module>trainer.run_generator_one_step(data_i)File "/home/ma-user/work/SPADE-master/trainers/pix2pix_trainer.py", line 35, in run_generator_one_stepg_losses, generated = self.pix2pix_model(data, mode='generator')File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_implreturn forward_call(*input, **kwargs)File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forwardreturn self.module(*inputs, **kwargs)File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_implreturn forward_call(*input, **kwargs)File "/home/ma-user/work/SPADE-master/models/pix2pix_model.py", line 43, in forwardinput_semantics, real_image = self.preprocess_input(data)File "/home/ma-user/work/SPADE-master/models/pix2pix_model.py", line 113, in preprocess_inputdata['label'] = data['label'].npu()File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch_npu/utils/device_guard.py", line 38, in wrapperreturn func(*args, **kwargs)File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch_npu/utils/tensor_methods.py", line 66, in _npureturn torch_npu._C.npu(self, *args, **kwargs)
RuntimeError: ACL stream synchronize failed, error code:507018
THPModule_npu_shutdown success.

猜测可能是没有开混合精度

五. 开启混合精度

1. 在构建神经网络前,我们需要导入torch_npu中的AMP模块

import time
import torch
import torch.nn as nn
import torch_npu
from torch_npu.npu import amp    # 导入AMP模块

2. 在模型、优化器定义之后,定义AMP功能中的GradScaler

model = CNN().to(device)
train_dataloader = DataLoader(train_data, batch_size=batch_size)    # 定义DataLoader
loss_func = nn.CrossEntropyLoss().to(device)    # 定义损失函数
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)    # 定义优化器
scaler = amp.GradScaler()    # 在模型、优化器定义之后,定义GradScaler

3. 在训练代码中添加AMP功能相关的代码开启AMP

for epo in range(epochs):
for imgs, labels in train_dataloader:
imgs = imgs.to(device)labels = labels.to(device)with amp.autocast():outputs = model(imgs)    # 前向计算loss = loss_func(outputs, labels)    # 损失函数计算optimizer.zero_grad()# 进行反向传播前后的loss缩放、参数更新scaler.scale(loss).backward()    # loss缩放并反向传播scaler.step(optimizer)    # 更新参数(自动unscaling)scaler.update()    # 基于动态Loss Scale更新loss_scaling系数 

六. 未知错误

E39999: Inner Error, Please contact support engineer!
E39999  An exception occurred during AICPU execution, stream_id:78, task_id:742, errcode:21008, msg:inner error[FUNC:ProcessAicpuErrorInfo][FILE:device_error_proc.cc][LINE:673]TraceBack (most recent call last):Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:task.cc][LINE:1068]Aicpu kernel execute failed, device_id=0, stream_id=78, task_id=742.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:774]Aicpu kernel execute failed, device_id=0, stream_id=78, task_id=742, fault op_name=ScatterElements[FUNC:GetError][FILE:stream.cc][LINE:1044]rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]op[Minimum], The Minimum op dtype is not same, type1:DT_FLOAT16, type2:DT_FLOAT[FUNC:CheckTwoInputDtypeSame][FILE:util.cc][LINE:116]Verifying Minimum failed.[FUNC:InferShapeAndType][FILE:infershape_pass.cc][LINE:135]Call InferShapeAndType for node:Minimum(Minimum) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:117]process pass InferShapePass on node:Minimum failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:530]build graph failed, graph id:894, ret:1343242270[FUNC:BuildModel][FILE:ge_generator.cc][LINE:1484][Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 1343242270[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]DEVICE[0] PID[189368]: 
EXCEPTION TASK:Exception info:TGID=3114744, model id=65535, stream id=78, stream phase=SCHEDULE, task id=742, task type=aicpu kernel, recently received task id=742, recently send task id=741, task phase=RUNMessage info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210Other info[0]:time=2023-10-12-12:12:22.763.259, function=proc_aicpu_task_done, line=972, error code=0x2a 
EXCEPTION TASK:Exception info:TGID=3114744, model id=65535, stream id=78, stream phase=3, task id=4347, task type=aicpu kernel, recently received task id=4354, recently send task id=4346, task phase=RUNMessage info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210Other info[0]:time=2023-10-12-12:13:57.997.757, function=proc_aicpu_task_done, line=972, error code=0x2a
Aborted (core dumped)
(py38) [ma-user SPADE-master]$Process ForkServerProcess-2:
Traceback (most recent call last):File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrapself.run()File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 61, in wrapperraise expFile "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapperfunc(*args, **kwargs)File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 268, in task_distributekey, func_name, detail = resource_proxy[TASK_QUEUE].get()File "<string>", line 2, in getFile "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethodkind, result = conn.recv()File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 250, in recvbuf = self._recv_bytes()File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytesbuf = self._recv(4)File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 383, in _recvraise EOFError
EOFError
/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 91 leaked semaphore objects to clean up at shutdownwarnings.warn('resource_tracker: There appear to be %d '

参考链接1
参考链接2:昇腾官网

http://www.lryc.cn/news/190110.html

相关文章:

  • 从一部iPhone手机看芯片的分类
  • arm day 7
  • Java基础面试-面向对象
  • GCC vs. G++:C 与 C++ 编译器的差异和比较
  • MAC m系列docker login报错
  • Redis通用指令和五大基本数据类型常用指令总结
  • uCharts常用图表组件demo
  • VNC:Timed out waiting for a response from the computer
  • Kotlin 协程 知识点
  • 简单大方的自我介绍 PPT 格式
  • panads操作excel
  • 【MySQL】联合查询、子查询、合并查询
  • 小程序中如何设置所服务地区的时区
  • Linux环境安装mysql8.0
  • STM32_DMA_多通道采集ADC出现错位现象
  • Linux内存管理 (2):memblock 子系统的建立
  • 创新学习方式,电大搜题助您迈向成功之路
  • Mybatis整理
  • pytorch定义datase多次重复采样
  • 自动化测试 —— Pytest fixture及conftest详解!
  • Nginx解析漏洞
  • 【机器学习】决策树原理及scikit-learn使用
  • #基于一个小车项目的FREERTOS分析(一)系统时钟
  • ubuntu mmdetection配置
  • 嵌入式面试常见问题(一)
  • docker批量删除本地镜像
  • 数据结构(一)—— 数据结构简介
  • Ubuntu输入正确密码重新跳到登录界面
  • TCP/IP(十四)流量控制
  • CSS网页标题图案和LOGO SEO优化