以此芯p1芯片为例研究OpenHarmony上GPU (Vulkan) 加速在深度学习推理中的价值
笔者最近和同事一起在研究vulkan在OpenHarmony上的作用,我们使用ncnn的benchncnn对cpu和gpu进行对照,现将结论分析如下:
测试环境
瑞莎星睿O6开发板 + amd rx580显卡
- https://docs.radxa.com/orion/o6/getting-started/introduction
- 瑞莎星睿 O6 (Radxa Orion O6) 是一款面向 AI 计算和多媒体应用的专业级 Mini-ITX 主板。它搭载 此芯科技 Cix P1 SoC(型号 CD8180),支持 最高 64GB LPDDR5 内存,在紧凑的尺寸下提供服务器级性能。Orion O6 具备丰富的 I/O 接口,包括 四路显示输出、双 5GbE 网络 和 PCIe Gen4 扩展,非常适合 AI 开发工作站、边缘计算节点 以及 高性能个人计算 应用。
- 最重要的一点是支持PCIe x16 全尺寸插槽,支持 PCIe Gen4 8 通道,可以插显卡,笔者这里插了一张rx580
OpenHarmony版本:5.0.0
- Vulkan版本:1.4.313
什么是vulkan
可以参考 https://blog.csdn.net/Interview_TC/article/details/149866464
目前OpenHarmony上主要使用的图像api是OpenGL ES ,以rk3568为例,3568使用mail系列的GPU,如果需要得到其vulkan驱动的话,建议使用mesa3d提供的开源实现
benchncnn纯cpu推理
# ./benchncnn 10 1 0 -1 0
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = -1
cooling_down = 0squeezenet min = 14.58 max = 14.76 avg = 14.68squeezenet_int8 min = 9.62 max = 10.14 avg = 9.94mobilenet min = 26.55 max = 26.87 avg = 26.75mobilenet_int8 min = 12.81 max = 13.14 avg = 13.01mobilenet_v2 min = 16.31 max = 16.45 avg = 16.38mobilenet_v3 min = 12.52 max = 12.80 avg = 12.67shufflenet min = 8.39 max = 8.64 avg = 8.53shufflenet_v2 min = 9.21 max = 9.43 avg = 9.33mnasnet min = 16.63 max = 16.81 avg = 16.75proxylessnasnet min = 18.66 max = 19.09 avg = 18.75efficientnet_b0 min = 26.62 max = 26.92 avg = 26.79efficientnetv2_b0 min = 31.94 max = 32.17 avg = 32.03regnety_400m min = 20.36 max = 21.84 avg = 20.61blazeface min = 2.42 max = 2.65 avg = 2.50googlenet min = 51.36 max = 51.76 avg = 51.52googlenet_int8 min = 37.29 max = 37.65 avg = 37.42resnet18 min = 38.96 max = 40.10 avg = 39.12resnet18_int8 min = 31.86 max = 32.07 avg = 31.97alexnet min = 33.00 max = 34.18 avg = 33.21vgg16 min = 193.38 max = 195.13 avg = 193.95vgg16_int8 min = 238.43 max = 241.54 avg = 239.21resnet50 min = 117.60 max = 119.09 avg = 117.81resnet50_int8 min = 66.19 max = 66.59 avg = 66.31squeezenet_ssd min = 30.69 max = 31.07 avg = 30.82squeezenet_ssd_int8 min = 29.76 max = 30.41 avg = 29.99mobilenet_ssd min = 53.97 max = 55.14 avg = 54.19mobilenet_ssd_int8 min = 25.82 max = 26.22 avg = 26.07mobilenet_yolo min = 119.46 max = 120.76 avg = 119.71mobilenetv2_yolov3 min = 59.16 max = 59.45 avg = 59.31yolov4-tiny min = 69.37 max = 69.81 avg = 69.54nanodet_m min = 20.09 max = 20.32 avg = 20.14yolo-fastest-1.1 min = 7.96 max = 8.18 avg = 8.06yolo-fastestv2 min = 6.75 max = 6.95 avg = 6.85vision_transformer min = 1068.13 max = 1069.96 avg = 1069.13FastestDet min = 8.41 max = 8.66 avg = 8.57
benchncnn Vulkan推理
# ./benchncnn 10 1 0 0 0
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] queueC=1[4] queueG=0[1] queueT=0[1]
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] fp16-p/s/u/a=1/1/1/0 int8-p/s/u/a=1/1/1/1
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] subgroup=64(64~64) ops=1/1/1/1/1/1/1/1/0/0
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] fp16-cm=0 int8-cm=0 bf16-cm=0 fp8-cm=0
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = 0
cooling_down = 0squeezenet min = 2.26 max = 2.38 avg = 2.33squeezenet_int8 min = 9.65 max = 9.86 avg = 9.77mobilenet min = 2.62 max = 2.77 avg = 2.68mobilenet_int8 min = 12.39 max = 12.47 avg = 12.42mobilenet_v2 min = 3.61 max = 3.84 avg = 3.72mobilenet_v3 min = 3.62 max = 3.79 avg = 3.70shufflenet min = 1.94 max = 2.17 avg = 2.02shufflenet_v2 min = 2.74 max = 2.93 avg = 2.84mnasnet min = 3.86 max = 4.22 avg = 3.99proxylessnasnet min = 3.73 max = 3.96 avg = 3.86efficientnet_b0 min = 5.84 max = 6.68 avg = 6.15efficientnetv2_b0 min = 20.50 max = 22.40 avg = 21.91regnety_400m min = 4.44 max = 5.01 avg = 4.68blazeface min = 1.30 max = 1.67 avg = 1.48googlenet min = 9.27 max = 9.81 avg = 9.51googlenet_int8 min = 38.14 max = 38.52 avg = 38.26resnet18 min = 4.48 max = 4.87 avg = 4.65resnet18_int8 min = 31.54 max = 32.87 avg = 31.81alexnet min = 3.15 max = 3.53 avg = 3.26vgg16 min = 11.14 max = 11.54 avg = 11.46vgg16_int8 min = 238.41 max = 239.90 avg = 238.76resnet50 min = 10.11 max = 11.20 avg = 10.64resnet50_int8 min = 66.91 max = 67.03 avg = 66.95squeezenet_ssd min = 7.90 max = 8.85 avg = 8.42squeezenet_ssd_int8 min = 28.88 max = 29.10 avg = 28.98mobilenet_ssd min = 6.98 max = 8.32 avg = 7.37mobilenet_ssd_int8 min = 24.77 max = 24.98 avg = 24.88mobilenet_yolo min = 6.73 max = 7.86 avg = 7.46mobilenetv2_yolov3 min = 9.55 max = 10.76 avg = 10.20yolov4-tiny min = 11.57 max = 12.67 avg = 12.28nanodet_m min = 5.73 max = 7.84 avg = 6.24yolo-fastest-1.1 min = 3.44 max = 3.85 avg = 3.58yolo-fastestv2 min = 3.90 max = 5.13 avg = 4.14vision_transformer min = 77.86 max = 78.55 avg = 78.29FastestDet min = 2.85 max = 3.24 avg = 2.97
结论
- 优先使用 GPU (Vulkan): 对于绝大多数模型,尤其是中大型模型(resnet, vgg, yolo 系列, vision transformer 等),启用 GPU 能带来 数量级 的性能提升。这是提升推理速度最有效的手段。
CPU 适用场景:
- 运行极小的模型(如 blazeface),此时 GPU 启动开销占比过高,优势不大。
- 当系统没有兼容的 GPU 或 Vulkan 驱动不可用时。
GPU 首次运行模型时可能会有编译着色器的开销,导致第一次推理时间较长。Benchmark 中的 min 时间通常能反映预热后的最佳性能。
INT8 量化: 在 CPU 上,INT8 量化模型通常比 FP32 模型快得多(如 mobilenet_int8 13.01ms vs mobilenet 26.75ms)。但在 rx580 GPU (Vulkan) 上,INT8 模型的加速比 FP32 模型小很多,有时甚至不如 FP32 模型快。