当前位置: 首页 > news >正文

vLLM (4) - LLMEngine上篇

系列文章目录

vLLM (1) - Qwen2推理&部署
vLLM (2) - 架构总览
vLLM (3) - Sequence & SequenceGroup
vLLM (4) - LLMEngine上篇
vLLM (5) - LLMEngine下篇


文章目录

  • 系列文章目录
  • 前言
  • 一、类图
  • 二、LLM
  • 三、LLMEngine
  • 四、GPUExectuor
  • 五、Worker
  • 六、ModelRunner
  • 七、CacheEngine
  • 总结


前言

经过前面两篇的铺垫,终于来到了解析LLMEngine的篇章。如下图所示,LLMEngine主要有两部分构成,右边部分包括WorkerCacheEngineModelRunner等重要的类,它们在LLMEngine的初始化阶段就会用到,工作内容包括模型加载,KV Cache初始化等等,这是本文中重点;左边部分包括SchedulerBlockSpaceManger,用于调度用户请求,并在过程中管理显存和内存,这部分发生在LLMEngine的(generate)生成阶段,将放到后续文章中。
在这里插入图片描述


一、类图

本篇重点讲述LLMEngine的初始化部分。由于代码调用相对复杂,下面我使用类图的方式来表示不同的类之间的关系。同时,在类图中只写上本篇所涉及的相关属性和方法,避免其他属性和方法对本篇阅读造成干扰。建议该类图当结合后续代码一起使用。

# 类图+-------------------------+
|           LLM           |
+-------------------------+
| + llm_engine: LLMEngine |
+-------------------------+||v
+-------------------------+
|        LLMEngine        |
+-------------------------+
| + model_executor: GPUExecutor |  # 执行器,名字有点歧义,项目有个子目录也叫model_exectuor
| - _initialize_kv_caches()     |  # 初始化kv_caches
| + scheduler: Scheduler  |        # 调度器
| + output_processor      |        # 输出处理器
+-------------------------+||v
+-------------------------+
|        GPUExecutor      |        
+-------------------------+
| - _init_executor()      |        # 初始化执行器
| + driver_worker: Worker |        # worker
|                         |
| + determine_num_available_blocks: Tuple[int, int] |   # 确认可用的gpu blocks和cpu blocks
| + initalize_cache()     |        # 初始化缓存,先用全0张量为kv_cache占住内存
+-------------------------+||v
+-------------------------+
|          Worker         |
+-------------------------+
| + model_runner: ModelRunner |    # 加载和执行模型的部分
| + cache_engine: CacheEngine |    # 初始化和更新kv_cache的部分
| + init_device()         |        # 初始化设备,gpu
| + load_model()          |        # 加载模型
+-------------------------+|                              	||                              	|v                              	v
+-------------------------+    +-------------------------+
|       ModelRunner       |    |       CacheEngine       |
+-------------------------+    +-------------------------+
| + loader_model()        |    | + gpu_cache             | 
| + profile_run()         |    | - _allocate_kv_cache(): List[torch.Tensor] |
| + capture_model()       |    | + get_cache_block_size(...): int           |
+-------------------------+    +-------------------------+

二、LLM

LLM是一个在给定promptsample paramters时,使用指定的大语言模型生成文本的类;其核心组件为self.llm_engineLLMEngine的实例化对象),LLM的绝大多数工作由它来完成。
使用LLM的示例代码如下所示。1)构建LLM实例化对象,其初始化部分将完成llm_engine: LLMEngine的创建(本文将重点);2)处理请求,使用self.generate()方法,完成了资源调度,高效的应对用户请求,输出文本(后续文章讲述)。

# 完整示例见系列文章的Qwen2推理篇
from vllm import LLMllm = LLM(model=DEFAULT_CKPT_PATH)      # DEFAULT_CKPT_PATH为模型名称或下载到本地的目录
outputs = llm.generate(text, sampling_params)  # text为输入文本,sampling_params是采样参数

三、LLMEngine

LLMEngine主要包含两个部分:1)model_executor;2)schedulermodel_executor主要负责模型相关的部分,比如设备的选择,模型的加载等等;而scheduler用于资源的调度,这部分在会模型推理阶段频繁使用。
结合代码来看一下LLMEngine在初始化环节都在干什么:

  1. 创建model_executor:根据model_config等一系列配置创建模型执行器;对于一个不太富裕的从业者来说,我们可能在一块单卡上跑vllm,这时候model_executorGPUExectuor,如果你使用的硬件是Neuron或者TPU,对应的model_executor就是NeuronExecutorTPUExecutor;另外,model_config等配置是将输入和默认参数按照功能拆分出的多个配置项,这里不赘述;
  2. 初始化kv_caches:借由self.model_exectutor(下一小节展开),确定可用于kv_caches的内存空间,并创建tensor占用这部分内存;在Qwen2推理&部署中的真实显存占用这一小节中,我们已经观察到了这个动作,并做了详细分析,不清楚的可以去看一下;
  3. 构建scheduler:资源调度一般都出现在模型推理阶段;
  4. 其他:比如创建output_processor等,这部分不是重点。
# vllm/engine/llm_engine.py
class LLMEngine:def __init__(self, ...):# ...self.model_executor = executor_class(model_config=model_config,cache_config=cache_config,parallel_config=parallel_config,scheduler_config=scheduler_config,device_config=device_config,lora_config=lora_config,vision_language_config=vision_language_config,speculative_config=speculative_config,load_config=load_config,)    # 1) 根据输入配置构建model_executorif not self.model_config.embedding_mode:self._initialize_kv_caches()   # 2) 初始化kv caches# 3) 构建schedulerself.scheduler = Scheduler(scheduler_config, cache_config, lora_config)# 4) 创建输出处理器,这在最后输出的时候会用到# Create sequence output processor, e.g. for beam search or speculative decoding.self.output_processor = (SequenceGroupOutputProcessor.create_output_processor(self.scheduler_config,self.detokenizer,self.scheduler,self.seq_counter,self.get_tokenizer_for_seq,stop_checker=StopChecker(self.scheduler_config.max_model_len,self.get_tokenizer_for_seq,),))def _initialize_kv_caches(self) -> None:"""Initialize the KV cache in the worker(s).The workers will determine the number of blocks in both the GPU cacheand the swap CPU cache."""num_gpu_blocks, num_cpu_blocks = (self.model_executor.determine_num_available_blocks())if self.cache_config.num_gpu_blocks_override is not None:num_gpu_blocks_override = self.cache_config.num_gpu_blocks_overridelogger.info("Overriding num_gpu_blocks=%d with ""num_gpu_blocks_override=%d", num_gpu_blocks,num_gpu_blocks_override)num_gpu_blocks = num_gpu_blocks_overrideself.cache_config.num_gpu_blocks = num_gpu_blocksself.cache_config.num_cpu_blocks = num_cpu_blocksself.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

四、GPUExectuor

model_executor(比如GPUExecutor)在初始化阶段在干什么呢?GPUExecutor继承自基类ExecutorBase,在self.__init__()中调用了self._init_executor()方法,具体包括如下:

  1. 使用self._create_worker()创建worker:实际上是通过WorkerWrapperBase来创建的worker,不同的配置对应不同类型的worker,默认情况下是Worker,当你使用投机采样speculative decoding的时候,则是SpecDecodeWorker(合理使用投机采样能够提升解码效率);
  2. worker初始化设备:self.driver_worker.init_device()
  3. worker加载模型:self.driver_worker.load_model()
    前面提到,GPUExecutor在被创建之后,还用来完成kv_caches的初始化,如上一节LLMEngine._initialize_kv_caches()方法所示,这其中主要涉及GPUExecutor的两个方法:
  4. self.determine_num_available_blocks():该方法返回了当前可用的gpu_blockscpu_blocks的数量;block的意思是将gpucpu按照指定的大小block_size进行分块,每一块对应一定大小的显存/内存;
  5. initialize_cache():在确定num_gpu_blocksnum_cpu_blocks,也就是确定有多少显存和内存可用于kv_caches之后,就可以占据这部分资源进行缓存初始化;
    这边简单说明了GPUExecutor在前期的一些工作,但这些操作基本依赖于它创建的worker,我们下一小节来看。
# vllm/executor/gpu_executor.py
class GPUExecutor(ExecutorBase):def _init_executor(self) -> None:"""Initialize the worker and load the model."""assert self.parallel_config.world_size == 1, ("GPUExecutor only supports single GPU.")self.driver_worker = self._create_worker()   # 创建workerself.driver_worker.init_device()   # 初始化设备self.driver_worker.load_model()    # 加载模型def _create_worker(self,local_rank: int = 0,rank: int = 0,distributed_init_method: Optional[str] = None):if self.speculative_config is None:worker_module_name = "vllm.worker.worker"worker_class_name = "Worker"else:worker_module_name = "vllm.spec_decode.spec_decode_worker"worker_class_name = "create_spec_worker"wrapper = WorkerWrapperBase(worker_module_name=worker_module_name,worker_class_name=worker_class_name,)wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,distributed_init_method))return wrapper.workerdef determine_num_available_blocks(self) -> Tuple[int, int]:"""Determine the number of available KV blocks by invoking theunderlying worker."""return self.driver_worker.determine_num_available_blocks()def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks) -> None:"""Initialize the KV cache by invoking the underlying worker."""# NOTE: This is logged in the executor because there can be >1 worker# with other executors. We could log in the engine level, but work# remains to abstract away the device for non-GPU configurations.logger.info("# GPU blocks: %d, # CPU blocks: %d", num_gpu_blocks,num_cpu_blocks)self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)

五、Worker

Worker主要承载两部分功能:modelcache,分别对应于成员model_runnercache_engine

  1. self.model_runner:对于使用大模型生成(本次举例)的情形,它是ModelRunner的实例对象;而如果使用了embedding_mode,它就是EmbeddingModelRunner的实例对象;
  2. self.cache_engine:是CacheEngine的实例对象,在self.initial_cache()方法中,主要就是初始化了self.cache_engine,相关内容放在下下小节。
    关于方法self.determine_num_available_blocks(),它返回的是num_gpu_blocksnum_cpu_blocks,两者获取逻辑分别如下:
  • num_gpu_blocks:在清空CUDA缓存之后,执行一次前向传播,profile模型的显存使用情况,然后获取当前CUDA设备的空闲显存和总显存,此时就能就算出峰值显存占用peak_memory;那么可用于kv_caches的显存就是total_gpu_memory * self.cache_config.gpu_memory_utilization - peak_memory,其中的gpu_memory_utilizationgpu使用率,默认0.9;因为缓存以block形式存在,所以除以cache_block_size就能得到num_gpu_blocks,其中cache_block_size是一个block所占用的字节数,这会在CacheEngine中讲到;
  • num_cpu_blocks:模型不会在cpu上进行运算,但是可以在上面缓存,必要时再swapgpu上,这部分内存大小是self.cache_config.swap_space_bytes,默认是4GB
# vllm/worker/worker.py
class Worker(WorkerBase):def __init__(self, ...)   # 传入参数是一些配置项,这边略去# 无关代码,passModelRunnerClass = (EmbeddingModelRunner ifself.model_config.embedding_mode else ModelRunner)self.model_runner = ModelRunnerClass(model_config,parallel_config,scheduler_config,device_config,cache_config,load_config=load_config,lora_config=self.lora_config,kv_cache_dtype=self.cache_config.cache_dtype,is_driver_worker=is_driver_worker,vision_language_config=vision_language_config,)# Uninitialized cache engine. Will be initialized by# initialize_cache.self.cache_engine: CacheEngine# Initialize gpu_cache as embedding models don't initialize kv_cachesself.gpu_cache: Optional[List[torch.tensor]] = None# ------------------- GPUExecutor中被调用来初始化的部分 ------------------- # def init_device(self) -> None:if self.device_config.device.type == "cuda":os.environ["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"# This env var set by Ray causes exceptions with graph building.os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)self.device = torch.device(f"cuda:{self.local_rank}")torch.cuda.set_device(self.device)_check_if_gpu_supports_dtype(self.model_config.dtype)torch.cuda.empty_cache()self.init_gpu_memory = torch.cuda.mem_get_info()[0]else:raise RuntimeError(f"Not support device type: {self.device_config.device}")# Initialize the distributed environment.init_worker_distributed_environment(self.parallel_config, self.rank,self.distributed_init_method,self.local_rank)# Set random seed.set_random_seed(self.model_config.seed)def load_model(self):self.model_runner.load_model()# ------------------- model runner相关 ------------------- # @torch.inference_mode()def determine_num_available_blocks(self) -> Tuple[int, int]:"""Profiles the peak memory usage of the model to determine how manyKV blocks may be allocated without OOMs."""# Profile the memory usage of the model and get the maximum number of# cache blocks that can be allocated with the remaining free memory.torch.cuda.empty_cache()# Execute a forward pass with dummy inputs to profile the memory usage# of the model.self.model_runner.profile_run()# Calculate the number of blocks that can be allocated with the# profiled peak memory.torch.cuda.synchronize()free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()# NOTE(woosuk): Here we assume that the other processes using the same# GPU did not change their memory usage during the profiling.peak_memory = self.init_gpu_memory - free_gpu_memoryassert peak_memory > 0, ("Error in memory profiling. This happens when the GPU memory was ""not properly cleaned up before initializing the vLLM instance.")cache_block_size = self.get_cache_block_size_bytes()num_gpu_blocks = int((total_gpu_memory * self.cache_config.gpu_memory_utilization -peak_memory) // cache_block_size)num_cpu_blocks = int(self.cache_config.swap_space_bytes //cache_block_size)num_gpu_blocks = max(num_gpu_blocks, 0)num_cpu_blocks = max(num_cpu_blocks, 0)if self.model_runner.lora_manager:self.model_runner.remove_all_loras()gc.collect()torch.cuda.empty_cache()return num_gpu_blocks, num_cpu_blocks# --------------------- cache相关 ------------------------def initialize_cache(self, num_gpu_blocks: int,num_cpu_blocks: int) -> None:"""Allocate GPU and CPU KV cache with the specified number of blocks.This also warms up the model, which may record CUDA graphs."""raise_if_cache_size_invalid(num_gpu_blocks,self.cache_config.block_size,self.model_config.max_model_len)self.cache_config.num_gpu_blocks = num_gpu_blocksself.cache_config.num_cpu_blocks = num_cpu_blocksself._init_cache_engine()self._warm_up_model()def _init_cache_engine(self):assert self.cache_config.num_gpu_blocks is not Noneself.cache_engine = CacheEngine(self.cache_config, self.model_config,self.parallel_config)self.gpu_cache = self.cache_engine.gpu_cachedef _warm_up_model(self) -> None:if not self.model_config.enforce_eager:self.model_runner.capture_model(self.gpu_cache)# Reset the seed to ensure that the random state is not affected by# the model initialization and profiling.set_random_seed(self.model_config.seed)

六、ModelRunner

这部分主要讲一下ModelRunner的两个方法:self.profile_run()self.capture_model()

  1. self.profile_run() 是用于跑一跑dummy input然后看一下具体的内存使用情况,最关键的代码是self.execute_model(seqs, kv_caches),也就是说我们需要准备好输入seqs和缓存kv_caches,其中kv_caches用于模型中Attention的计算,初始都为None。详细注释见下方。
class ModelRunner:# ...@torch.inference_mode()def profile_run(self) -> None:# top-k采样,获取内存使用情况sampling_params = SamplingParams(top_p=0.99, top_k=self.vocab_size - 1)# 一个batch中最大处理的token数量,典型的32kmax_num_batched_tokens = self.scheduler_config.max_num_batched_tokens# 最大的序列数量,典型的256max_num_seqs = self.scheduler_config.max_num_seqs# profile的时候要求序列数量为max_num_seqs,token总数等于max_num_batched_tokensseqs: List[SequenceGroupMetadata] = []model_config = self.model_config# lora: pass# vlm: passfor group_id in range(max_num_seqs):# 均分所有token,计算每个seq的长度seq_len = (max_num_batched_tokens // max_num_seqs +(group_id < max_num_batched_tokens % max_num_seqs))# SequenceData和SequenceGroupMetadata已经在前面文章中讲过,不再赘述# 构造dummy inputseq_data = SequenceData([0] * seq_len)dummy_multi_modal_data = Noneseq = SequenceGroupMetadata(request_id=str(group_id),is_prompt=True,seq_data={group_id: seq_data},sampling_params=sampling_params,block_tables=None,lora_request=dummy_lora_requests_per_seq[group_id]if dummy_lora_requests_per_seq else None,multi_modal_data=dummy_multi_modal_data,)seqs.append(seq)# 构造kv caches,由于尚未开始推理,初始化为Nonenum_layers = self.model_config.get_num_layers(self.parallel_config)kv_caches = [None] * num_layers# 执行模型self.execute_model(seqs, kv_caches)# cuda同步torch.cuda.synchronize()return@torch.inference_mode()def execute_model(self,seq_group_metadata_list: Optional[List[SequenceGroupMetadata]],kv_caches: List[torch.Tensor],) -> Optional[SamplerOutput]:# 准备输入张量(input_tokens, input_positions, attn_metadata, sampling_metadata,lora_requests, lora_mapping, multi_modal_kwargs) = self.prepare_input_tensors(seq_group_metadata_list)# lora: pass# 仅在decode阶段使用cuda graph(它能提升效率)prefill_meta = attn_metadata.prefill_metadata  # 具体是怎样的,暂时不必管它decode_meta = attn_metadata.decode_metadataif prefill_meta is None and decode_meta.use_cuda_graph:graph_batch_size = input_tokens.shape[0]model_executable = self.graph_runners[graph_batch_size]else:model_executable = self.model# 模型具体执行,模型在vllm/model_executor/models/中有定义,这边找到qwen2.py文件hidden_states = model_executable(input_ids=input_tokens,positions=input_positions,kv_caches=kv_caches,attn_metadata=attn_metadata,**multi_modal_kwargs,)# Compute the logits.logits = self.model.compute_logits(hidden_states, sampling_metadata)# Only perform sampling in the driver worker.if not self.is_driver_worker:return None# Sample the next token. 采样output = self.model.sample(logits=logits,sampling_metadata=sampling_metadata,)return output
  1. self.capture_model()使用CUDA Graph技术(仅在解码过程中使用),捕获一个模型的执行过程,以便在后续的推理过程中可以重用这个捕获的图,从而提高性能,代码中给出了简单注释;
class ModelRunner:# ...@torch.inference_mode()def capture_model(self, kv_caches: List[torch.Tensor]) -> None:"""Cuda graph capture a model. ...CUDA Graph主要用于解码阶段,因为对于较大的批量大小,CUDA Graph的性能提升不明显,并且由于CUDA Graph需要固定大小的张量,支持大或可变批量大小需要较高的GPU内存开销"""# 提示信息,可以了解一下assert not self.model_config.enforce_eagerlogger.info("Capturing the model for CUDA graphs. This may lead to ""unexpected consequences if the model is not static. To ""run the model in eager mode, set 'enforce_eager=True' or ""use '--enforce-eager' in the CLI.")logger.info("CUDA graphs can take additional 1~3 GiB memory per GPU. ""If you are running out of memory, consider decreasing ""`gpu_memory_utilization` or enforcing eager mode. ""You can also reduce the `max_num_seqs` as needed ""to decrease memory usage.")start_time = time.perf_counter()# dummy inputsmax_batch_size = max(_BATCH_SIZES_TO_CAPTURE)input_tokens = torch.zeros(max_batch_size, dtype=torch.long).cuda()input_positions = torch.zeros(max_batch_size, dtype=torch.long).cuda()slot_mapping = torch.empty(max_batch_size, dtype=torch.long).cuda()slot_mapping.fill_(_PAD_SLOT_ID)seq_lens = torch.ones(max_batch_size, dtype=torch.int32).cuda()block_tables = torch.from_numpy(self.graph_block_tables).cuda()# 创建用于存储输出隐藏状态的缓冲区hidden_states,# 该缓冲区将在第一次图捕获后填充。hidden_states: Optional[torch.Tensor] = None# 需要捕获的若干batch sizegraph_batch_size = _get_graph_batch_size(self.scheduler_config.max_num_seqs)batch_size_capture_list = [bs for bs in _BATCH_SIZES_TO_CAPTURE if bs <= graph_batch_size]# 捕获CUDA Graph,graph_capture()是上下文管理器(一些并行策略)with graph_capture() as graph_capture_context:# NOTE: Capturing the largest batch size first may help reduce the# memory usage of CUDA graph.for batch_size in reversed(batch_size_capture_list):# Create dummy attn_metadata.attn_metadata = self.attn_backend.make_metadata(num_prefills=0,num_prefill_tokens=0,num_decode_tokens=batch_size,slot_mapping=slot_mapping[:batch_size],seq_lens=None,seq_lens_tensor=seq_lens[:batch_size],max_query_len=None,max_prefill_seq_len=0,max_decode_seq_len=self.max_seq_len_to_capture,query_start_loc=None,seq_start_loc=None,context_lens_tensor=None,block_tables=block_tables[:batch_size],use_cuda_graph=True,)if self.lora_config:lora_mapping = LoRAMapping([0] * batch_size,[0] * batch_size,)self.set_active_loras(set(), lora_mapping)# 创建CUDAGraphRunner实例,并使用capture方法捕获模型的执行过程graph_runner = CUDAGraphRunner(self.model)hidden_states = graph_runner.capture(input_tokens[:batch_size],input_positions[:batch_size],hidden_states[:batch_size]if hidden_states is not None else None,kv_caches,attn_metadata,memory_pool=self.graph_memory_pool,stream=graph_capture_context.stream,)self.graph_memory_pool = graph_runner.graph.pool()# graph_runner存起来self.graph_runners[batch_size] = graph_runnerend_time = time.perf_counter()elapsed_time = end_time - start_time# This usually takes < 10 seconds.logger.info("Graph capturing finished in %.0f secs.", elapsed_time)

七、CacheEngine

  1. self._allocate_kv_cache():该方法就是收集(初始化)kv_cache,用处就是先占住gpucpu资源;
  2. self.get_cache_block_size():这个方法计算了每一个block对应的字节数,也就是上面Worker中的cache_block_size。1)每个block存放block_sizetokenkv_caches;2)单个token对应的k的元素个数为num_heads * head_size * num_layersv也一样;3)通过1)和2)计算出block中包含的元素个数,根据不同的数据类型就能得到这个block占用的字节数了(见注释);
  3. self.swap_in()self.swap_out()self.copy()这几个方法并没有在初始化阶段用到,但这边解释一下:当处理大量用户请求的时候,涉及到资源的分配,比如在此之前,有部分数据时缓存在cpu上的,现在gpu上有剩余可用的显存了,那就应该使用self.swap_in()将数据搬到gpu上进行计算;反之,gpu可用显存都占满了,可能会将原本在gpu上的部分缓存搬至cpu,等待机会搬回gpu,此时使用self.swap_out()
# vllm/worker/cache_engine.py
class CacheEngine:"""Manages the KV cache.This class is responsible for initializing and managing the GPU and CPU KVcaches. It also provides methods for performing KV cache operations, suchas swapping and copying."""def __init__(self,cache_config: CacheConfig,model_config: ModelConfig,parallel_config: ParallelConfig,) -> None:self.cache_config = cache_config   # 传入的配置self.model_config = model_configself.parallel_config = parallel_configself.head_size = model_config.get_head_size()   # 多头注意力每个头的维度self.num_layers = model_config.get_num_layers(parallel_config)  # 每个pp对应的层的个数self.num_kv_heads = model_config.get_num_kv_heads(parallel_config) # 每个tp对应的kv heads的个数self.block_size = cache_config.block_sizeself.num_gpu_blocks = cache_config.num_gpu_blocksself.num_cpu_blocks = cache_config.num_cpu_blocksif cache_config.cache_dtype == "auto":self.dtype = model_config.dtypeelse:self.dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]# Get attention backend.self.attn_backend = get_attn_backend(model_config.get_num_attention_heads(parallel_config),self.head_size,self.num_kv_heads,model_config.get_sliding_window(),model_config.dtype,cache_config.cache_dtype,self.block_size,)# Initialize the cache.self.gpu_cache = self._allocate_kv_cache(self.num_gpu_blocks, "cuda")self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")def _allocate_kv_cache(self,num_blocks: int,device: str,) -> List[torch.Tensor]:"""Allocates KV cache on the specified device."""kv_cache_shape = self.attn_backend.get_kv_cache_shape(num_blocks, self.block_size, self.num_kv_heads, self.head_size)  # 计算kv_cache的形状pin_memory = is_pin_memory_available() if device == "cpu" else Falsekv_cache: List[torch.Tensor] = []for _ in range(self.num_layers):   # 添加每一层的kv缓存# null block in CpuGpuBlockAllocator requires at least that# block to be zeroed-out.# We zero-out everything for simplicity.kv_cache.append(torch.zeros(kv_cache_shape,dtype=self.dtype,pin_memory=pin_memory,device=device))return kv_cachedef swap_in(self, src_to_dst: torch.Tensor) -> None:for i in range(self.num_layers):self.attn_backend.swap_blocks(self.cpu_cache[i], self.gpu_cache[i],src_to_dst)def swap_out(self, src_to_dst: torch.Tensor) -> None:for i in range(self.num_layers):self.attn_backend.swap_blocks(self.gpu_cache[i], self.cpu_cache[i],src_to_dst)def copy(self, src_to_dsts: torch.Tensor) -> None:self.attn_backend.copy_blocks(self.gpu_cache, src_to_dsts)@staticmethoddef get_cache_block_size(cache_config: CacheConfig,model_config: ModelConfig,parallel_config: ParallelConfig,) -> int:head_size = model_config.get_head_size()num_heads = model_config.get_num_kv_heads(parallel_config)num_layers = model_config.get_num_layers(parallel_config)# block_size默认16, 也就是存16个token的kv_caches# 具体计算以字节为单位的大小时,需要考虑kv的大小,也就是num_heads * head_size * num_layerskey_cache_block = cache_config.block_size * num_heads * head_sizevalue_cache_block = key_cache_blocktotal = num_layers * (key_cache_block + value_cache_block)if cache_config.cache_dtype == "auto":dtype = model_config.dtypeelse:dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]dtype_size = get_dtype_size(dtype)   # bf16对应的dtype_size就是2return dtype_size * total

总结

本篇主要介绍了LLMEngine初始化部分的内容,涉及了GPUExecutorWorkerModelRunnerCacheEngine等多个类的方法,有助于理解在使用vllm文本生成之前(初始化阶段)的工作原理。对于LLMEngine的另一个重要组成部分Scheduler,会在后续文章(请求处理阶段)中讲述。

http://www.lryc.cn/news/432924.html

相关文章:

  • Java重修笔记 第五十天 HashSet 和 TreeSet 的去重机制比较
  • 提前购|基于SSM+vue的创新型产品提前购平台(源码+数据库+文档)
  • 上海市计算机学会竞赛平台2024年7月月赛丙组求和问题
  • 【LVI-SAM】激光雷达点云处理特征提取LIO-SAM 之FeatureExtraction实现细节
  • [数据集][目标检测]血细胞检测数据集VOC+YOLO格式2757张4类别
  • opencart journal 3 在价格前添加文本prefix
  • c++ string类的模拟实现的注意事项
  • Unity3D中控制3D场景中游戏对象显示层级的详解
  • 代码执行漏洞-Log4j2漏洞 vulhub CVE-2021-44228
  • uniapp / uniapp x UI 组件库推荐大全
  • 花8000元去培训机构学习网络安全值得吗,学成后就业前景如何?
  • PhpStorm 下调试功能配置
  • MVC(Model-View-Controller)和MVVM(Model-View-ViewModel)
  • 【H2O2|全栈】关于HTML(4)HTML基础(三)
  • 关于找不到插件 ‘org.springframework.boot:spring-boot-maven-plugin:‘的解决方案
  • 深入RabbitMQ世界:探索3种队列、4种交换机、7大工作模式及常见概念
  • 将目标检测模型导出到C++|RT-DETR、YOLO-NAS、YOLOv10、YOLOv9、YOLOv8
  • 【Windows】解决新版 Edge 浏览器开机自启问题(简单有效)
  • 如何给3D人物换衣服CC4
  • 如何对列表、字符串进行分组
  • 【GEE代码实例教程详解:NDVI时间序列趋势分析】
  • 51单片机-DS1302(RTC实时时钟芯片)
  • FreeRTOS学习笔记—②RTOS的认识及任务管理篇
  • 【C++从练气到飞升】22---C++中的异常
  • 前端:HTML、CSS、JS、Vue
  • RocksDB简介
  • [VC] Visual Studio中读写权限冲突
  • ChatGPT3.5/4.0新手使用手册,国内中文版使用教程
  • 基于MicroPython的ESP8266与超声波传感器设计方案
  • 仿华为车机UI--图标从Workspace拖动到Hotseat同时保留图标在原来位置