当前位置: 首页 > news >正文

TensorRT-LLM高级用法

--multi_block_mode

decoding phase, 推理1个新token,

平时:按照batch样本,按照head,将计算平均分给所有SM;

batch_size*num_heads和SM数目相比较小时:有些SM会空闲;加了--multi_block_mode,似乎是将input context再进行划分,原来1个SM干的活儿,分给多个SM来干,让所有SM都并行忙碌起来;

其他证据:

"we only use multi-block in generation phase (generating new token). In context phase, we have enough blocks to run in parallel and we don't need to use multi-block."
"take H100-SXM as an example, you have 132 SMs, and let us say the batch size is 1, num heads is 16, then normally we can split the sequence into (132/16 = 8) blocks to fully utilize all SMs, but if the sequence length is quite small like 1K, it might not worth 8 blocks per sequence (maybe fewer)."

支持llama格式和hf格式

llama格式,要使用--meta_ckpt_dir:

# Build LLaMA v3 70B TP=8 using Meta checkpoints directly.
python convert_checkpoint.py --meta_ckpt_dir ./tmp/llama/70B/ \--output_dir ./tllm_checkpoint_8gpu_tp8 \--dtype float16 \--tp_size 8

hf格式,使用--model_dir:

# Build LLaMA v3 70B using 4-way tensor parallelism and 2-way pipeline parallelism.
python convert_checkpoint.py --model_dir ./tmp/llama/70B/hf/ \--output_dir ./tllm_checkpoint_8gpu_tp4_pp2 \--dtype float16 \--tp_size 4 \--pp_size 2

推理显存占用分析

Total memory = (Model size + KV cache size + Activation memory) / Parallelism

where

  • The model size is the number of parameters * the size of data type.
  • The KV cache size is the total number of tokens * the size of KV cache data type * the number of layers * the KV hidden dimension
  • The activation memory is determined by TRT engine, which can be a few GBs regardless of the degree of parallelism used

For LLaMA v2 70B FP16 weights + FP8 KV cache, the model size is 70B parameters * 2 bytes = 140GB. The KV cache size is 32K tokens * 1 bytes * 80 layers * 2048 KV hidden dimension = 5GB per 32K tokens. We have 145GB spread across 8 GPUs. The end result is ~18GB per GPU plus some GBs of flat scratch/activation memory allocated by TRT engine and the TRT-LLM runtime.

Note that the KV hidden dimension is derived by the number of KV heads times hidden dimension of each head. LLaMA v2 70B has hidden dimension of 8192, and uses grouped-query attention where 8 key heads and 8 value heads are associated with 64 query heads. Each head has hidden dimension of 8192/64 = 128. So the hidden dimension for KV in total is 128 * 8 * 2 = 2048. (2是K和V)

The total number of tokens is determined by beam width, batch size, and maximum sequence length.

http://www.lryc.cn/news/432370.html

相关文章:

  • 文心一言功能新升级:读文档、懂翻译、能识图
  • C++机试——走方格的方案
  • Bootstrap 字体图标无法显示问题,<i>标签字体图标无法显示问题
  • docker registry 仓库加密
  • 利用高德+ArcGIS优雅获取任何感兴趣的矢量边界
  • 炮弹【USACO】
  • python如何读取excel文件内的数据
  • Java项目: 基于SpringBoot+mybatis+maven+mysql教师工作量管理系统(含源码+数据库+毕业论文)
  • 项目开发--数据库--postgresql数据库操作
  • c语言——用一维数组输出杨辉三角形
  • Codeforces Round 971 (Div. 4) (A~G1)
  • 为什么构造函数不能为虚函数?为什么析构函数可以为虚函数,如果不设为虚函数可能会存在什么问题?
  • 【数据结构】单链表功能的实现
  • 最新车型库大全|阿里云实现调用API接口
  • 70. 爬楼梯
  • pytorch正向传播没问题,loss.backward()使定义的神经网络中权重参数变为nan
  • ❤《实战纪录片 1 》原生开发小程序中遇到的问题和解决方案
  • 2024.9.6 作业
  • 2024年架构设计师论文-“模型驱动架构设计方法及其应用”
  • Tapd敏捷开发平台的使用心得
  • 远程桌面 Rust Desk 自建服务器
  • 开源网安引领AIGC+开发安全,智能防护铸就软件安全新高度
  • 树和二叉树
  • 一篇带你速通差分算法(C/C++)
  • 贷款利率高低跟什么有关?仅凭身份证就能贷到款?额度是多少?
  • 苹果电脑需要安装杀毒软件吗?探索Mac的安全世界!
  • Oracle start with connect BY 死循环
  • 力扣接雨水
  • bug“医典”
  • Track 06:量子计算机概述