当前位置：首页 > news >正文

写代码的方式部署glm-4-9b-chat模型：gradio和api两种模式

news 2025/8/20 9:33:33

准备环境

在部署模型之前，需要先看一下模型需要gpu显存大小，然后选择合适的算力服务器

模型部署计算公式：

这是glm-9b-chat的模型信息，可在魔塔社区的配置文件中查看

第一步：计算启动模型时，所需参数的占用大小

因为模型是9b（1b是10亿，10^9），所以参数的大小 = 9 * 10^9 * 2 （2是因为glm采用的是bfloat16，16位代可转成2个byte）

9 * 10^9 * 2 = 16.76GB

第二步：计算模型在推理过程中所需要的显存大小

主要就是计算：向量的KV-Cache占用的大小

因为知道了隐藏层共有40层，隐藏维度有4096

假设你的平均token数为1k，并发用户数为10

所以 40 * 4096 * 2（这个2是代表k和v，2份）* 2（还是因为bfloat占用2个byte）* 1000（1k，可根据实际情况调整）* 10（用户并发数，可根据实际情况调整）

40 * 4096 * 2 * 2 * 1000 * 10 = 6.25GB

第三步：因为在推理的过程中可能产生内存碎片，并发数量超过10个等，我们可以再原来的基础上加 10～20%的显存，防止显存不够导致模型推理失败

（16.76 + 6.25）* （1+10%）约等于 25.31GB

所以假设我们的平均token是1k，并发用户量是10，至少需要显存为25.31GB的算力服务器

我这边用的是算力云服务器，用的是4090的显卡，内存是24G，虽然有点不够，但是能凑合玩玩

安装依赖下载模型文件

魔塔社区中下载文件到服务器：

glm-4-9b-chat

点击下载到时候会有下载的选项，我这里用的是python的sdk下载

1、先升级pip依赖

python -m pip install --upgrade pip

2、设置清华源镜像

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

3、准备requirement文件，下载所需依赖，注意，版本问题容易导致失败（已踩过坑）

torch>=2.3.0torchvision>=0.18.0#transformers==4.51.3,初始使用这个低版本导致报错
transformers>=4.48.0huggingface-hub>=0.24.0sentencepiece>=0.2.0jinja2>=3.1.4pydantic>=2.8.2timm>=1.0.7tiktoken>=0.7.0accelerate>=0.32.1sentence_transformers>=3.0.1gradio>=4.38.1 # web demoopenai>=1.35.0 # openai demoeinops>=0.8.0pillow>=10.4.0sse-starlette>=2.1.2bitsandbytes>=0.43.1 # INT4 Loadingvllm==0.8.5  # vllm版本也不能太低

4、执行requirement.txt文件，安装依赖

pip install -r requirements.txt

5、下载modelscope

pip install modelscope

6、下载模型文件到服务器，并指定目录到/root/autodl-tmp/models/中

from modelscope import snapshot_downloadmodel_dir = snapshot_download('ZhipuAI/glm-4-9b-chat', cache_dir='/root/autodl-tmp/models', revision='master')

7、下载完成，看一下目录中是否跟魔塔社区的文件对的上，还有文件大小

使用gradio的方式来使用大模型

代码文件：chatbot.py，记得文件路径一定要跟你下载的地方一样

# 导入必要的库和模块
import os
from pathlib import Path
from threading import Thread
from typing import Unionimport gradio as gr
import torch
# from peft import AutoPeftModelForCausalLM, PeftModelForCausalLM
from transformers import (AutoModelForCausalLM,AutoTokenizer,PreTrainedModel,PreTrainedTokenizer,PreTrainedTokenizerFast,StoppingCriteria,StoppingCriteriaList,TextIteratorStreamer
)# 定义模型和分词器的类型别名
# ModelType = Union[PreTrainedModel, PeftModelForCausalLM]
ModelType = PreTrainedModel
# TokenizerType = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
TokenizerType = PreTrainedTokenizer# 设置模型和分词器的路径，默认从环境变量获取
MODEL_PATH = os.environ.get('MODEL_PATH', '/root/autodl-tmp/models/ZhipuAI/glm-4-9b-chat')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)def _resolve_path(path: Union[str, Path]) -> Path:"""解析相对路径并返回绝对路径:param path: 输入的路径字符串或Path对象:return: 绝对路径的Path对象"""return Path(path).expanduser().resolve()# 加载预训练模型和分词器的函数
def load_model_and_tokenizer(model_dir: Union[str, Path], trust_remote_code: bool = True
) -> tuple[ModelType, TokenizerType]:"""加载预训练模型和分词器:param model_dir: 模型目录路径:param trust_remote_code: 是否信任远程代码:return: 模型和分词器的对象元组"""# 解析模型目录路径model_dir = _resolve_path(model_dir)# 检查是否存在PEFT适配器配置文件，决定加载PEFT模型还是普通模型if (model_dir / 'adapter_config.json').exists():pass# 加载PEFT模型# model = AutoPeftModelForCausalLM.from_pretrained(#     model_dir, trust_remote_code=trust_remote_code, device_map='auto'# )# 获取基础模型路径# tokenizer_dir = model.peft_config['default'].base_model_name_or_pathelse:# 加载普通模型model = AutoModelForCausalLM.from_pretrained(#device_map:自动找到合适和CPU或者GPU资源，trust_remote_code：是否信任远程代码model_dir, trust_remote_code=trust_remote_code, device_map='auto')# 分词器路径与模型目录相同tokenizer_dir = model_dir# 加载分词器tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir, trust_remote_code=trust_remote_code, use_fast=False)# 返回模型和分词器对象return model, tokenizer# 加载模型和分词器实例
model, tokenizer = load_model_and_tokenizer(MODEL_PATH, trust_remote_code=True)# 定义一个类，用于在生成文本时停止生成
class StopOnTokens(StoppingCriteria):def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:# 获取模型配置中的结束标记IDstop_ids = model.config.eos_token_id# 遍历结束标记IDfor stop_id in stop_ids:# 如果当前生成的最后一个token是结束标记，则停止生成if input_ids[0][-1] == stop_id:return True# 如果没有遇到结束标记，则继续生成return False# 定义一个函数，用于解析和格式化输出文本
# 这个函数 parse_text 用于将 Markdown 格式的文本转换为 HTML 格式，特别是处理包含代码块的文本。
def parse_text(text):# 按行分割文本lines = text.split("\n")# 过滤掉空行lines = [line for line in lines if line != ""]count = 0  # 统计输出文本的行数# 遍历每一行文本for i, line in enumerate(lines):# 检查是否包含代码块标记if "```" in line:  # 代码块的起始和结束count += 1items = line.split('`')# 如果是代码块开始，添加HTML标签if count % 2 == 1:# items[-1] ：计算机语音的标记lines[i] = f'<pre><code class="language-{items[-1]}">'# 如果是代码块结束，添加HTML标签else:lines[i] = f'<br></code></pre>'else:# 如果不是第一行，且在代码块中，对特殊字符进行转义if i > 0:if count % 2 == 1:line = line.replace("`", "\`")line = line.replace("<", "&lt;")line = line.replace(">", "&gt;")line = line.replace(" ", "&nbsp;")line = line.replace("*", "&ast;")line = line.replace("_", "&lowbar;")line = line.replace("-", "&#45;")line = line.replace(".", "&#46;")line = line.replace("!", "&#33;")line = line.replace("(", "&#40;")line = line.replace(")", "&#41;")line = line.replace("$", "&#36;")# 添加换行标签lines[i] = "<br>" + line# 将处理后的文本行合并text = "".join(lines)# 返回格式化后的文本return text# 定义一个函数，用于生成聊天文本
def predict(history, prompt, max_length, top_p, temperature):# 创建一个停止条件实例stop = StopOnTokens()# 初始化消息列表messages = []# 如果提供了提示词，将其添加到消息列表中if prompt:messages.append({"role": "system", "content": prompt})# 遍历历史消息for idx, (user_msg, model_msg) in enumerate(history):  # user_msg是用户的问题，model_msg是大模型的回答# 如果提供了提示词，并且是第一条历史消息，则跳过if prompt and idx == 0:continue# 如果是最后一条历史消息，并且模型消息为空，则添加用户消息并结束if idx == len(history) - 1 and not model_msg:messages.append({"role": "user", "content": user_msg})break# 如果有用户消息，将其添加到消息列表中if user_msg:messages.append({"role": "user", "content": user_msg})# 如果有模型消息，将其添加到消息列表中if model_msg:messages.append({"role": "assistant", "content": model_msg})# 使用分词器将消息列表转换为模型输入# tokenizer：它负责将原始文本转换为模型可以理解的格式。model_inputs = (tokenizer.apply_chat_template(messages,add_generation_prompt=True,tokenize=True,return_tensors="pt")  # 这个参数指定了返回的 tensors 类型。"pt" 表示 PyTorch tensors。这意味着输出将会是一个 PyTorch tensor 对象.to(next(model.parameters()).device))# 创建一个文本迭代器流streamer = TextIteratorStreamer(tokenizer,timeout=60,  # 超时时间为: 60秒数。流将被终止skip_prompt=True,  # 这个参数指示是否跳过生成的初始提示（prompt）。如果设置为 True，则生成的输出将不包含初始提示，只包含模型生成的部分。skip_special_tokens=True)  # 跳过特殊 token。特殊 token 包括 [CLS], [SEP], [PAD] 等，这些 token 通常用于模型内部的处理，但在最终输出中没有实际意义。# 定义生成文本的参数generate_kwargs = {"input_ids": model_inputs,"streamer": streamer,"max_new_tokens": max_length,"do_sample": True,"top_p": top_p,"temperature": temperature,"stopping_criteria": StoppingCriteriaList([stop]),"repetition_penalty": 1.2,"eos_token_id": model.config.eos_token_id,}# 创建一个线程来生成文本t = Thread(target=model.generate, kwargs=generate_kwargs)t.start()# 遍历生成的文本流for new_token in streamer:# 如果有新的token，将其添加到历史消息中if new_token:history[-1][1] += new_token# 生成并返回更新后的历史消息yield history# 使用Gradio构建Web界面
with gr.Blocks() as demo:  # gr.Blocks 是 Gradio 的顶级容器，用于创建整个应用的布局。# 添加HTML标题gr.HTML("""<h1 align="center">测试AI大模型应用</h1>""")# 创建一个聊天机器人界面组件chatbot = gr.Chatbot()  # 聊天记录的视图组件# 创建一个行布局with gr.Row():  # 水平布局容器# 创建一个列布局with gr.Column(scale=3):  # 垂直布局容器# 创建一个文本框用于用户输入with gr.Column(scale=12):user_input = gr.Textbox(show_label=False, placeholder="输入...", lines=10, container=False)# 创建一个提交按钮with gr.Column(min_width=32, scale=1):submitBtn = gr.Button("提交")  # 提交用户问题的 按钮# 创建另一个列布局with gr.Column(scale=1):# 创建一个文本框用于输入提示词prompt_input = gr.Textbox(show_label=False, placeholder="提示词", lines=10, container=False)# 创建一个按钮用于设置提示词pBtn = gr.Button("提示词设置")# 创建第三个列布局with gr.Column(scale=1):# 创建一个按钮用于清除聊天记录emptyBtn = gr.Button("清除聊天记录")# 创建一个滑块用于设置最大长度max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="最大长度", interactive=True)# 创建一个滑块用于设置Top P值top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)# 创建一个滑块用于设置温度temperature = gr.Slider(0.01, 1, value=0.6, step=0.01, label="Temperature", interactive=True)# 定义一个函数，用于处理用户输入def user_question(query, history):return "", history + [[parse_text(query), ""]]# 定义一个函数，用于设置提示词def set_prompt(prompt_text):return [[parse_text(prompt_text), "成功设置提示词"]]# 将设置提示词的函数绑定到按钮点击事件pBtn.click(set_prompt, inputs=[prompt_input], outputs=chatbot)# 将用户输入的函数绑定到提交按钮点击事件submitBtn.click(user_question, [user_input, chatbot], [user_input, chatbot], queue=False).then(predict, [chatbot, prompt_input, max_length, top_p, temperature], chatbot)# 将清除聊天记录的函数绑定到按钮点击事件emptyBtn.click(lambda: (None, None), None, [chatbot, prompt_input], queue=False)
# 启动Gradio演示
demo.queue()
demo.launch(server_name="127.0.0.1", server_port=6006, inbrowser=True, share=True)

启动chabot文件

python ./chatbot.py

看到这个就是成功启动

接下来绑定算力云的端口映射到本地，访问http://localhost:6006，即可访问chatbot页面进行交互

使用api方式，结合fastapi和vllm部署

创建openai_server.py文件，路径记得指定对

import os
import subprocess# text-model THUDM/glm-4-9b-chat
MODEL_PATH = os.environ.get('MODEL_PATH', '/root/autodl-tmp/models/ZhipuAI/glm-4-9b-chat')# vision-model THUDM/glm-4v-9b
# MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4v-9b')if '4v' in MODEL_PATH.lower():subprocess.run(["python", "glm4v_server.py", MODEL_PATH])
else:# 开启一个子进程，运行python命令subprocess.run(["python", "glm_server.py", MODEL_PATH])

创建glm_server.py文件，调用vllm推理引擎获取大模型结果

import time
from asyncio.log import logger
import re
import sys
import uvicorn
import gc
import json
import torch
import random
import string
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, LogitsProcessor
from sse_starlette.sse import EventSourceResponse# 常量：fastapi服务器的监控（心跳机制）
# 设置服务器发送事件的心跳间隔： 设置为 1000 毫秒（1 秒）
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000# AI大模型的上下文最大长度
MAX_MODEL_LENGTH = 8192# fastapi的装饰器
# 异步生命周期管理（app），异步上下文管理器在进入和退出上下文时可以执行异步操作。
@asynccontextmanager
async def lifespan(app: FastAPI):yield  # 表示生命周期的中间阶段，即应用的运行期间。if torch.cuda.is_available():  # 表示生命周期 退出时候自动执行的代码torch.cuda.empty_cache()  # 清空 GPU 缓存，释放不再使用的内存。torch.cuda.ipc_collect()  # 收集并释放进程间通信（IPC）资源，进一步清理内存# 创建fastAPI的app对象
app = FastAPI(lifespan=lifespan)# 设置fastAPI的跨域配置
app.add_middleware(CORSMiddleware,allow_origins=["*"],allow_credentials=True,allow_methods=["*"],allow_headers=["*"],
)# 自动生成一个唯一的id函数
def generate_id(prefix: str, k=29) -> str:"""随机生成一个29长度的字符串，这个字符串有大，小写字母和数字组成prefix是ID的前缀，k 是 id的长度"""suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=k))return f"{prefix}{suffix}"class ModelCard(BaseModel):"""povo类： Web开发中的数据模型类。格式化请求或者是响应的数据"""id: str = ""object: str = "model"created: int = Field(default_factory=lambda: int(time.time()))owned_by: str = "owner"root: Optional[str] = Noneparent: Optional[str] = Nonepermission: Optional[list] = Noneclass ModelList(BaseModel):object: str = "list"data: List[ModelCard] = ["glm-4"]class FunctionCall(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass ChoiceDeltaToolCallFunction(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass UsageInfo(BaseModel):prompt_tokens: int = 0total_tokens: int = 0completion_tokens: Optional[int] = 0class ChatCompletionMessageToolCall(BaseModel):index: Optional[int] = 0id: Optional[str] = Nonefunction: FunctionCalltype: Optional[Literal["function"]] = 'function'class ChatMessage(BaseModel):# “function” 字段解释：# 使用较老的OpenAI API版本需要注意在这里添加 function 字段并在 process_messages函数中添加相应角色转换逻辑为 observationrole: Literal["user", "assistant", "system", "tool"]content: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass DeltaMessage(BaseModel):role: Optional[Literal["user", "assistant", "system"]] = Nonecontent: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal["stop", "length", "tool_calls"]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal["stop", "length", "tool_calls"]]index: intclass ChatCompletionResponse(BaseModel):model: strid: Optional[str] = Field(default_factory=lambda: generate_id('chatcmpl-', 29))object: Literal["chat.completion", "chat.completion.chunk"]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int] = Field(default_factory=lambda: int(time.time()))system_fingerprint: Optional[str] = Field(default_factory=lambda: generate_id('fp_', 9))usage: Optional[UsageInfo] = Noneclass ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float] = 0.8top_p: Optional[float] = 0.8max_tokens: Optional[int] = Nonestream: Optional[bool] = Falsetools: Optional[Union[dict, List[dict]]] = Nonetool_choice: Optional[Union[str, dict]] = Nonerepetition_penalty: Optional[float] = 1.1class InvalidScoreLogitsProcessor(LogitsProcessor):def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:if torch.isnan(scores).any() or torch.isinf(scores).any():scores.zero_()  # 设置为0scores[..., 5] = 5e4return scoresdef process_response(output: str, tools: dict | List[dict] = None, use_tool: bool = False) -> Union[str, dict]:"""处理大模型的中间响应数据。函数的主要功能是处理模型生成的输出，检查是否包含工具调用，并根据工具的名称和参数格式化输出。如果检测到有效的工具调用且启用了工具调用，函数将返回一个包含工具名称和参数的字典；否则，返回原始输出字符串。这种处理方式有助于在生成过程中集成和调用外部工具。"""# 第一步：分割输出lines = output.strip().split("\n")# 第二部：初始化变量arguments_json = None# 特殊工具集special_tools = ["cogview", "simple_browser"]  # glm大模型中内置的函数# 外部指定的工具集tools = {tool['function']['name'] for tool in tools} if tools else {}# 第三步：检查输出格式if len(lines) >= 2 and lines[1].startswith("{"):  # 存在工具调用function_name = lines[0].strip()  # 提取函数名称arguments = "\n".join(lines[1:]).strip()  # 提取函数的传入参数if function_name in tools or function_name in special_tools:try:arguments_json = json.loads(arguments)is_tool_call = Trueexcept json.JSONDecodeError:is_tool_call = function_name in special_tools# 第四步：处理工具的格式化if is_tool_call and use_tool:# 其他外部工具格式化content = {"name": function_name,"arguments": json.dumps(arguments_json if isinstance(arguments_json, dict) else arguments,ensure_ascii=False)}if function_name == "simple_browser":search_pattern = re.compile(r'search\("(.+?)"\s*,\s*recency_days\s*=\s*(\d+)\)')# 根据正则表达式提取：搜索关键词和最近的天数match = search_pattern.match(arguments)if match:content["arguments"] = json.dumps({"query": match.group(1),"recency_days": int(match.group(2))}, ensure_ascii=False)elif function_name == "cogview":content["arguments"] = json.dumps({"prompt": arguments}, ensure_ascii=False)return contentreturn output.strip()@torch.inference_mode()
async def generate_stream_glm4(params):messages = params["messages"]tools = params["tools"]tool_choice = params["tool_choice"]temperature = float(params.get("temperature", 1.0))repetition_penalty = float(params.get("repetition_penalty", 1.0))top_p = float(params.get("top_p", 1.0))max_new_tokens = int(params.get("max_tokens", 8192))# 消息处理messages = process_messages(messages, tools=tools, tool_choice=tool_choice)# 初始化输入inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)params_dict = {"n": 1,"best_of": 1,"presence_penalty": 1.0,"frequency_penalty": 0.0,"temperature": temperature,"top_p": top_p,"top_k": -1,"repetition_penalty": repetition_penalty,# "use_beam_search": False,# "length_penalty": 1,# "early_stopping": False,"stop_token_ids": [151329, 151336, 151338],"ignore_eos": False,"max_tokens": max_new_tokens,"logprobs": None,"prompt_logprobs": None,"skip_special_tokens": True,}sampling_params = SamplingParams(**params_dict)# 调用大模型async for output in engine.generate(prompt=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):output_len = len(output.outputs[0].token_ids)input_len = len(output.prompt_token_ids)ret = {"text": output.outputs[0].text,"usage": {"prompt_tokens": input_len,"completion_tokens": output_len,"total_tokens": output_len + input_len},"finish_reason": output.outputs[0].finish_reason,}yield retgc.collect()  # gc内存torch.cuda.empty_cache()  # 清空显存"""
request = ChatCompletionRequest(tools=[{"type": "function","function": {"name": "get_weather","description": "获取天气信息"}}],tool_choice={"type": "function", "function": {"name": "get_weather"}},messages=[{"role": "user", "content": "今天北京天气如何？"}]e   
)
"""
def process_messages(messages, tools=None, tool_choice="none"):_messages = messages  # 将输入的 messages 赋值给局部变量 _messages。processed_messages = []  # 用于存储处理后的消息列表。msg_has_sys = False  # 标记是否已经添加了系统消息。def filter_tools(tool_choice, tools):''''函数的作用是从给定的工具列表中筛选出与 tool_choice 中指定的函数名称相匹配的工具'''function_name = tool_choice.get('function', {}).get('name', None)if not function_name:return []filtered_tools = [tool for tool in toolsif tool.get('function', {}).get('name') == function_name]return filtered_toolsif tool_choice != "none":  # 用户有匹配的工具调用if isinstance(tool_choice, dict):tools = filter_tools(tool_choice, tools)if tools:  # 过滤之后的工具列表processed_messages.append({"role": "system","content": None,"tools": tools})msg_has_sys = Trueif isinstance(tool_choice, dict) and tools:processed_messages.append({"role": "assistant","metadata": tool_choice["function"]["name"],"content": ""})'''遍历每条消息，根据其角色 (role) 进行处理：如果角色是 "function" 或 "tool"，则将其转换为观察消息。如果角色是 "assistant"，并且有 tool_calls，则为每个工具调用生成一条助手消息。如果没有 tool_calls，则分割内容并为每个部分生成一条助手消息。对于其他角色（如 "user"），直接添加到 processed_messages。如果角色是 "system" 且已经添加过系统消息，则跳过这条消息。'''for m in _messages:role, content, func_call = m.role, m.content, m.function_calltool_calls = getattr(m, 'tool_calls', None)if role == "function":processed_messages.append({"role": "observation","content": content})elif role == "tool":processed_messages.append({"role": "observation","content": content,"function_call": True})elif role == "assistant":if tool_calls:for tool_call in tool_calls:processed_messages.append({"role": "assistant","metadata": tool_call.function.name,"content": tool_call.function.arguments})else:for response in content.split("\n"):if "\n" in response:metadata, sub_content = response.split("\n", maxsplit=1)else:metadata, sub_content = "", responseprocessed_messages.append({"role": role,"metadata": metadata,"content": sub_content.strip()})else:if role == "system" and msg_has_sys:msg_has_sys = Falsecontinueprocessed_messages.append({"role": role, "content": content})if not tools or tool_choice == "none":for m in _messages:if m.role == 'system':processed_messages.insert(0, {"role": m.role, "content": m.content})breakreturn processed_messages@app.get("/health")
async def health() -> Response:"""Health check."""return Response(status_code=200)@app.get("/v1/models", response_model=ModelList)
async def list_models():model_card = ModelCard(id="glm-4")return ModelList(data=[model_card])"""
post是：请求方法：POST,GET,PUT,DELETE等等;
/v1/chat/completions: 路由地址，接口的请求地址
"""@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):if len(request.messages) < 1 or request.messages[-1].role == "assistant":raise HTTPException(status_code=400, detail="Invalid request")gen_params = dict(messages=request.messages,temperature=request.temperature,top_p=request.top_p,max_tokens=request.max_tokens or 1024,echo=False,stream=request.stream,repetition_penalty=request.repetition_penalty,tools=request.tools,tool_choice=request.tool_choice,)logger.debug(f"==== request ====\n{gen_params}")# 第三步：流式传输处理if request.stream:  # 流式处理# 创建一个异步的生成器predict_stream_generator = predict_stream(request.model, gen_params)# 异步获取生成器的下一条数据output = await anext(predict_stream_generator)logger.debug(f"First result output：\n{output}")# 代码错误# if output:#     return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")function_call = Noneif output and request.tools:try:function_call = process_response(output, request.tools, use_tool=True)except:logger.warning("Failed to parse tool call")if isinstance(function_call, dict):function_call = ChoiceDeltaToolCallFunction(**function_call)generate = parse_output_text(request.model, output, function_call=function_call)return EventSourceResponse(generate, media_type="text/event-stream")else:return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")# 非流式处理response = ""async for response in generate_stream_glm4(gen_params):passif response["text"].startswith("\n"):response["text"] = response["text"][1:]response["text"] = response["text"].strip()usage = UsageInfo()function_call, finish_reason = None, "stop"tool_calls = Noneif request.tools:try:function_call = process_response(response["text"], request.tools, use_tool=True)except Exception as e:logger.warning(f"Failed to parse tool call: {e}")if isinstance(function_call, dict):finish_reason = "tool_calls"function_call_response = ChoiceDeltaToolCallFunction(**function_call)function_call_instance = FunctionCall(name=function_call_response.name,arguments=function_call_response.arguments)tool_calls = [ChatCompletionMessageToolCall(id=generate_id('call_', 24),function=function_call_instance,type="function")]message = ChatMessage(role="assistant",content=None if tool_calls else response["text"],function_call=None,tool_calls=tool_calls,)logger.debug(f"==== message ====\n{message}")choice_data = ChatCompletionResponseChoice(index=0,message=message,finish_reason=finish_reason,)task_usage = UsageInfo.model_validate(response["usage"])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)return ChatCompletionResponse(model=request.model,choices=[choice_data],object="chat.completion",usage=usage)async def predict_stream(model_id, gen_params):output = ""is_function_call = Falsehas_send_first_chunk = False  # 是否为第一个chunk。 流式输出：每次大模型输出一个chunkcreated_time = int(time.time())function_name = Noneresponse_id = generate_id('chatcmpl-', 29)system_fingerprint = generate_id('fp_', 9)tools = {tool['function']['name'] for tool in gen_params['tools']} if gen_params['tools'] else {}delta_text = ""  # 每个分块的文本内容（token）async for new_response in generate_stream_glm4(gen_params):decoded_unicode = new_response["text"]delta_text += decoded_unicode[len(output):]output = decoded_unicodelines = output.strip().split("\n")# 检查是否为工具# 这是一个简单的工具比较函数，不能保证拦截所有非工具输出的结果，比如参数未对齐等特殊情况。##TODO 如果你希望做更多处理，可以在这里进行逻辑完善。if not is_function_call and len(lines) >= 2:first_line = lines[0].strip()if first_line in tools:is_function_call = Truefunction_name = first_line  # 工具的名字delta_text = lines[1]  # 工具的参数# 工具调用返回if is_function_call:if not has_send_first_chunk:# 返回工具调用的第一个chunkfunction_call = {"name": function_name, "arguments": ""}tool_call = ChatCompletionMessageToolCall(index=0,id=generate_id('call_', 24),function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role="assistant",function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield ""yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = True# 工具调用后续的chunkfunction_call = {"name": None, "arguments": delta_text}delta_text = ""tool_call = ChatCompletionMessageToolCall(index=0,id=None,function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role=None,function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 如果满足这个条件，说明用户可能希望使用工具，但目前还没有从模型输出中检测到具体的工具调用。elif (gen_params["tools"] and gen_params["tool_choice"] != "none") or is_function_call:continue# 常规返回else:finish_reason = new_response.get("finish_reason", None)# 常规返回的第一个chunkif not has_send_first_chunk:message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = True# 常规返回的后续的chunkmessage = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 工具调用需要额外返回一个字段以对齐 OpenAI 接口if is_function_call:yield ChatCompletionResponse(model=model_id,id=response_id,system_fingerprint=system_fingerprint,choices=[ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(content=None,role=None,function_call=None,),finish_reason="tool_calls"  # 工具调用的chunk结束)],created=created_time,object="chat.completion.chunk",usage=None).model_dump_json(exclude_unset=True)elif delta_text != "":# 发出最后一个文本的chunkmessage = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 返回最后一个带有结束 标记的chunkfinish_reason = 'stop'message = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)yield '[DONE]'else:yield '[DONE]'async def parse_output_text(model_id: str, value: str, function_call: ChoiceDeltaToolCallFunction = None):delta = DeltaMessage(role="assistant", content=value)if function_call is not None:delta.function_call = function_callchoice_data = ChatCompletionResponseStreamChoice(index=0,delta=delta,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,choices=[choice_data],object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))yield '[DONE]'if __name__ == "__main__":MODEL_PATH = sys.argv[1]tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)engine_args = AsyncEngineArgs(model=MODEL_PATH,tokenizer=MODEL_PATH,# 如果你有多张显卡，可以在这里设置成你的显卡数量tensor_parallel_size=1,# dtype="bfloat16",dtype="half",trust_remote_code=True,# 占用显存的比例，请根据你的显卡显存大小设置合适的值，例如，如果你的显卡有80G，您只想使用24G，请按照24/80=0.3设置gpu_memory_utilization=0.9,enforce_eager=True,# worker_use_ray=False,disable_log_requests=True,max_model_len=MAX_MODEL_LENGTH,)engine = AsyncLLMEngine.from_engine_args(engine_args)# fastapi启动的代码uvicorn.run(app, host='0.0.0.0', port=6006, workers=1)

运行openai_server文件，显示这样表示成功

依旧映射端口到本地，然后写文件测试是否能调用到本地接口


from typing import Optionalfrom langchain import hub
from langchain.agents import create_structured_chat_agent, AgentExecutor
from langchain_core.messages import HumanMessage
from langchain_core.tools import StructuredTool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from pydantic import BaseModel, Field#链接本地大模型，
llm = ChatOpenAI(model='glm-4',temperature=1.0,api_key='无所谓key了',base_url="http://127.0.0.1:6006/v1/"
)#定义一个结构化类型，描述参数的
class AddStrLenParam(BaseModel):a:str = Field(description='第一个字符串参数')b:str = Field(description='第二个字符串参数')#创建一个tool方法，多参数，返回一个int类型
def add_strlen(a:str, b:str) -> int:"""分别计算两个字符串的长度，并且累加计算长度的和"""return len(a) + len(b)#创建一个tool工具
tool = StructuredTool(name = 'add_strlen',description='计算两个字符串的长度并返回它们的和',args_schema=AddStrLenParam,func=add_strlen,return_direct = False
)tools = [tool]
#从远程参考拉来提示词地址
prompt = hub.pull('hwchase17/structured-chat-agent')#创建agent代理，支持多参数调用，这个代理并不能直接执行工具，只能决策是否调用工具和参数封装
agent = create_structured_chat_agent(llm=llm,tools=tools,prompt=prompt
)#创建一个执行器，控制agent执行工具，并且可以设置大量参数，使代码更加强大，比如错误处理，日志打印等
agent_executor = AgentExecutor(agent = agent,tools = tools,# max_execution_time=5000,handle_parsing_errors = True,verbose=True
)# agent = create_react_agent(llm,tools)
# print(agent.invoke({'messages': [HumanMessage(content='计算`呼和浩特` 和 `长风破浪` 两个字符串的长度和')]}))
print(agent_executor.invoke({'input': '`爱国者导弹拦截`的字符串长度加上`abc`字符串的长度是多少？ langsmith是什么？'}))

在服务器执行这个python文件，成功执行并且能使用agent调用工具，完美

遇到的问题

遇到的问题其实也就是版本的坑，之前transformers版本和vllm版本过低，会导致某个类中的方法或者属性缺少，以下错误：

（升级transformers和vllm版本即可解决，上面的requirement.txt已经是对的，可放心使用！）

Could not create share link. Missing file: /root/.cache/huggingface/gradio/frpc/frpc_linux_amd64_v0.3.Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps:Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_linux_amd64
Rename the downloaded file to: frpc_linux_amd64_v0.3
Move the file to this location: /root/.cache/huggingface/gradio/frpc
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results.
Exception in thread Thread-7 (generate):
Traceback (most recent call last):
File “/root/miniconda3/lib/python3.12/threading.py”, line 1073, in _bootstrap_inner
self.run()
File “/root/miniconda3/lib/python3.12/threading.py”, line 1010, in run
self._target(*self._args, **self._kwargs)
File “/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py”, line 2634, in generate
result = self._sample(
^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py”, line 3621, in _sample
model_kwargs = self._update_model_kwargs_for_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py”, line 929, in _update_model_kwargs_for_generation
cache_name, cache = self._extract_past_from_model_output(outputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1940, in getattr
raise AttributeError(
AttributeError: ‘ChatGLMForConditionalGeneration’ object has no attribute ‘_extract_past_from_model_output’
Traceback (most recent call last):
File “/root/miniconda3/lib/python3.12/site-packages/gradio/queueing.py”, line 626, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/route_utils.py”, line 350, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/blocks.py”, line 2240, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/blocks.py”, line 1759, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/utils.py”, line 762, in async_iteration
return await anext(iterator)
^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/utils.py”, line 753, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/anyio/to_thread.py”, line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/anyio/_backends/_asyncio.py”, line 2476, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/anyio/_backends/_asyncio.py”, line 967, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/utils.py”, line 736, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/gradio/utils.py”, line 900, in gen_wrapper
response = next(iterator)
^^^^^^^^^^^^^^
File “/root/autodl-tmp/code/chatbot.py”, line 193, in predict
for new_token in streamer:
File “/root/miniconda3/lib/python3.12/site-packages/transformers/generation/streamers.py”, line 226, in next
value = self.text_queue.get(timeout=self.timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/queue.py”, line 179, in get
raise Empty
_queue.Empty
^[c^CKeyboard interruption in main thread… closing server.
root@autodl-container-0138439568-74eddd1d:~/autodl-tmp/code# python chatbot.py
/root/autodl-tmp/code/chatbot.py:120: SyntaxWarning: invalid escape sequence ‘’ line = line.replace(“", "`”)
Loading checkpoint shards: 100%|██████████████████| 10/10 [00:07<00:00, 1.39it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
/root/autodl-tmp/code/chatbot.py:206: UserWarning: You have not specified a value for the type parameter. Defaulting to the ‘tuples’ format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type=‘messages’ instead, which uses openai-style dictionaries with ‘role’ and ‘content’ keys.
chatbot = gr.Chatbot() # 聊天记录的视图组件
Running on local URL: http://127.0.0.1:6006
Could not create share link. Missing file: /root/.cache/huggingface/gradio/frpc/frpc_linux_amd64_v0.3.Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps:Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_linux_amd64
Rename the downloaded file to: frpc_linux_amd64_v0.3
Move the file to this location: /root/.cache/huggingface/gradio/frpc
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results.
Exception in thread Thread-7 (generate):
Traceback (most recent call last):
File “/root/miniconda3/lib/python3.12/threading.py”, line 1073, in _bootstrap_inner
self.run()
File “/root/miniconda3/lib/python3.12/threading.py”, line 1010, in run
self._target(*self._args, **self._kwargs)
File “/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py”, line 2634, in generate
result = self._sample(
^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py”, line 3621, in _sample
model_kwargs = self._update_model_kwargs_for_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py”, line 929, in _update_model_kwargs_for_generation
cache_name, cache = self._extract_past_from_model_output(outputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 1940, in getattr
raise AttributeError(
AttributeError: ‘ChatGLMForConditionalGeneration’ object has no attribute ‘_extract_past_from_model_output’

查看全文

http://www.lryc.cn/news/625786.html