当前位置：首页 > news >正文

GPT-5 提示词优化全攻略：用 Prompt Optimizer 快速迁移与提升，打造更稳更快的智能应用

news 2025/8/15 10:14:53

第1部分：简介与准备

GPT-5 系列是目前最强的模型家族，在各项能力上实现跃升。GPT-5 在智能体式任务执行、编程与可控性方面尤为突出，既适合好奇的普通用户，也适合高级研究者。
GPT-5 依然受益于传统的提示词最佳实践。为便于优化与迁移，我们在 Playground 中引入了 GPT-5 Prompt Optimizer，帮助用户改进现有提示词，并将其迁移到 GPT-5 及其他 OpenAI 模型。

Prompt Optimizer 演示

在本“烹饪手册”中，我们将展示如何使用 Prompt Optimzer 快速上手，用 GPT-5 解决你的任务，并演示提示词优化带来的可量化改进。

迁移与优化提示词

与大型语言模型（LLM）合作时，设计有效的提示词至关重要。Prompt Optimizer 的目标是为你的提示词应用对我们模型最有效的最佳实践与格式，并移除常见的提示失败模式，例如： • 提示中的指令互相矛盾 • 输出格式缺失或不清晰 • 提示与少样本示例之间不一致
除了针对目标模型调优外，Optimizer 还会结合你的目标任务，对智能体工作流、编程、多模态等关键场景应用要点，从而提升表现。接下来我们通过前后对比展示优化的亮点。
提醒：提示词没有一劳永逸的通用方案。建议进行充分实验并持续迭代，以找到最适合你问题的解法。

环境准备

请确保已设置 OpenAI API Key（环境变量名 OPENAI_API_KEY），并拥有 GPT-5 访问权限。

代码（保持原格式）：

import osrequired = ('OPENAI_API_KEY',)
missing = [k for k in required if not os.getenv(k)]
print('OPENAI_API_KEY is set!' if not missing else 'Missing environment variable: ' + ', '.join(missing) + '. Please set them before running the workflow.')

输出： OPENAI_API_KEY is set!

安装依赖

代码（保持原格式）：

%pip install -r requirements.txt --quiet

第2部分：编码与分析示例（流式 Top‑K 高频词）

任务说明

该示例聚焦模型显著增强的方向：编码与分析。我们让模型生成一段 Python 脚本，依据特定分词规范，从大型文本流中计算“精确”的 Top‑K 高频 token。
这类任务对提示质量非常敏感。提示不佳会把模型引向错误的算法与路径（如近似草图 vs 多轮/磁盘辅助的精确解），从而显著影响准确性与运行时间。

评估维度

30 次运行的编译/执行成功率
平均运行时间（成功样本）
平均峰值内存（成功样本）
精确性：输出需与真值 Top‑K 完全一致，且并列规则为先按计数降序，再按 token 升序
注：在 M4 Max MacBook Pro 上评估；如需请自行调整约束。

我们的基线提示词

下面是一个常见的起步提示词，它包含少量自相矛盾的要求，以及模糊或欠明确的指令。在 GPT-5 这类重视推理的模型上，矛盾的指令会降低表现、增加延迟；模糊的指令会引发不期望的行为。

代码（保持原格式）：

baseline_prompt = """
Write Python to solve the task on a MacBook Pro (M4 Max). Keep it fast and lightweight.- Prefer the standard library; use external packages if they make things simpler.
- Stream input in one pass to keep memory low; reread or cache if that makes the solution clearer.
- Aim for exact results; approximate methods are fine when they don't change the outcome in practice.
- Avoid global state; expose a convenient global like top_k so it's easy to check.
- Keep comments minimal; add brief explanations where helpful.
- Sort results in a natural, human-friendly way; follow strict tie rules when applicable.Output only a single self-contained Python script inside one Python code block, with all imports, ready to run.
"""

为何这些问题重要

标准库优先，却允许“若更简单可用外部包”：这种“软许可”会把模型推向非可移植依赖或更重的导入，进而改变不同环境下的性能甚至执行成功率。
鼓励单遍流式处理，但又说“若更清晰可复读或缓存”：这种模糊性会开启多遍或内存缓存的大门，违背原有的流式约束，改变运行时与内存特征。
要求“精确”结果，却允许“在实践中不改变结果的近似方法”：模型难以可靠评估这种判断，可能引入在 Top‑K 边界处微妙偏差的草图/启发式，导致看似正确但无法通过严格评估的输出。
避免全局状态，却建议提供一个像 top_k 的全局变量：混淆了接口契约——到底是返回值，还是从全局读取？模型可能两者都做，增加副作用与复现难度。
文档写作既要“少注释”又要“简要说明”：解释不足或把文字与逻辑交错，甚至可能泄漏到要求之外的输出格式。
“自然、人性化”的排序，同时又要求严格并列规则：两者并不总一致。模型可能图方便选择 Counter.most_common 的行为，在并列时偏离规范的 (-count, token) 排序，造成细微的正确性失误。

总结：这些被“软化”的约束让提示“看起来容易满足”，却制造了分叉。模型在不同时刻可能选择不同分支（标准库 vs 外部依赖、单遍 vs 复读/缓存、精确 vs 近似），导致正确性、延迟与内存的可变性。

评估器是严格的：固定分词为小写文本上的 [a-z0-9]+，排序为 (-count, token)。任何偏离都会在“精确性”上受罚，即便其它部分看似合理。

生成 30 份基线脚本并评估

代码（保持原格式）：

from scripts.gen_baseline import generate_baseline_topkMODEL = "gpt-5"
N_RUNS = 30
CONCURRENCY = 10
OUTPUT_DIR = "results_topk_baseline"USER_PROMPT = """
Task:
Given globals text (str) and k (int), produce the Top-K most frequent tokens.Tokenization:
- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Whole-string lowercasing is not required.
- Tokens are ASCII [a-z0-9]+ sequences; treat all other characters as separators.Output:
- Define top_k as a list of (token, count) tuples.
- Sort by count desc, then token asc.
- Length = min(k, number of unique tokens).Notes:
- Run as-is with the provided globals; no file or network I/O.
"""generate_baseline_topk(model=MODEL,n_runs=N_RUNS,concurrency=CONCURRENCY,output_dir=OUTPUT_DIR,dev_prompt=baseline_prompt,user_prompt=USER_PROMPT,
)

评估生成脚本（基线）

在 results_topk_baseline 中对每份脚本进行基准测试。对于更大数据集，此评估刻意较重，可能需要数分钟。

代码（保持原格式）：

from scripts.topk_eval import evaluate_folderevaluate_folder(folder_path="results_topk_baseline",k=500,scale_tokens=5_000_000,csv_path="run_results_topk_baseline.csv",
)

第3部分：使用 Prompt Optimizer 改善提示词

打开 OpenAI Optimize Playground，将现有提示粘贴到 Developer Message 区域。
点击 Optimize 打开优化面板。你可以：
1. 指定希望体现的修改；或
2. 直接再次点击 Optimize，让其按目标模型与任务的最佳实践进行改写。
本示例先采用默认优化。

optimize_image

完成后可查看优化结果与变更说明（包含修改片段与原因）。你可以通过展开评论或使用行内审阅模式来交互查看。
我们再追加一项改动：强制“单遍流式处理”。在 Prompt Optimizer 的迭代流程中，这一步很容易完成。

optimize_image

对优化版本满意后，可在优化器右上角保存为 Prompt Object。在 API 调用中直接使用该对象，有助于后续迭代、版本管理与跨应用复用。

optimize_image

评估优化后的提示词

为便于展示，下面直接贴出优化后的提示词；当然，你也可以仅传入 prompt_id 与版本号。

代码（保持原格式）：

optimized_prompt = """
# Objective
Generate a single, self-contained Python script that exactly solves the specified task on a MacBook Pro (M4 Max).# Hard requirements
- Use only Python stdlib. No approximate algorithms.
- Tokenization: ASCII [a-z0-9]+ on the original text; match case-insensitively and lowercase tokens individually. Do NOT call text.lower() on the full string.
- Exact Top‑K semantics: sort by count desc, then token asc. No reliance on Counter.most_common tie behavior.
- Define `top_k` as a list of (token, count) tuples with length = min(k, number of unique tokens).
- When globals `text` (str) and `k` (int) exist, do not reassign them; set `top_k` from those globals. If you include a `__main__` demo, guard it to run only when globals are absent.
- No file I/O, stdin, or network access, except optionally printing `top_k` as the last line.# Performance & memory constraints
- Do NOT materialize the entire token stream or any large intermediate list.
- Do NOT sort all unique (token, count) items unless k >= 0.3 * number_of_unique_tokens.
- When k < number_of_unique_tokens, compute Top‑K using a bounded min‑heap of size k over counts.items(), maintaining the correct tie-break (count desc, then token asc).
- Target peak additional memory beyond the counts dict to O(k). Avoid creating `items = sorted(counts.items(), ...)` for large unique sets.# Guidance
- Build counts via a generator over re.finditer with re.ASCII | re.IGNORECASE; lowercase each matched token before counting.
- Prefer heapq.nsmallest(k, cnt.items(), key=lambda kv: (-kv[1], kv[0])) for exact selection without full sort; avoid heapq.nlargest.
- Do NOT wrap tokens in custom comparator classes (e.g., reverse-lex __lt__) or rely on tuple tricks for heap ordering.
- Keep comments minimal; include a brief complexity note (time and space).# Output format
- Output only one Python code block; no text outside the block.# Examples 
```python
import re, heapq
from collections import Counter
from typing import List, Tuple, Iterable_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE)def _tokens(s: str) -> Iterable[str]:# Case-insensitive match; lowercase per token to avoid copying the whole stringfor m in _TOKEN.finditer(s):yield m.group(0).lower()def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]:if k <= 0:return []cnt = Counter(_tokens(text))u = len(cnt)key = lambda kv: (-kv[1], kv[0])if k >= u:return sorted(cnt.items(), key=key)# Exact selection with bounded memoryreturn heapq.nsmallest(k, cnt.items(), key=key)# Compute from provided globals when available; demo only if missing and running as main
try:text; k  # type: ignore[name-defined]
except NameError:if __name__ == "__main__":demo_text = "A a b b b c1 C1 c1 -- d! d? e"demo_k = 3top_k = top_k_tokens(demo_text, demo_k)print(top_k)
else:top_k = top_k_tokens(text, k)  # type: ignore[name-defined]
# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k)

"""


生成 30 份优化版脚本并评估代码（保持原格式）：
```python
from scripts.gen_optimized import generate_optimized_topkMODEL = "gpt-5"
N_RUNS = 30
CONCURRENCY = 10
OUTPUT_DIR = "results_topk_optimized"USER_PROMPT = """
Task:
Given globals text (str) and k (int), produce the Top-K most frequent tokens.Tokenization:
- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Whole-string lowercasing is not required.
- Tokens are ASCII [a-z0-9]+ sequences; treat all other characters as separators.Output:
- Define top_k as a list of (token, count) tuples.
- Sort by count desc, then token asc.
- Length = min(k, number of unique tokens).Notes:
- Run as-is with the provided globals; no file or network I/O.
"""generate_optimized_topk(model=MODEL,n_runs=N_RUNS,concurrency=CONCURRENCY,output_dir=OUTPUT_DIR,dev_prompt=optimized_prompt,user_prompt=USER_PROMPT,
)

相同评估流程（优化版）

代码（保持原格式）：

from scripts.topk_eval import evaluate_folderevaluate_folder(folder_path="results_topk_optimized",k=500,scale_tokens=5_000_000,csv_path="run_results_topk_optimized.csv",
)

第4部分：加入“LLM 法官”主观评分

除了量化指标，我们也评估更偏主观的维度，如代码质量与任务遵从度。我们准备了示例提示 llm_as_judge.txt。

代码（保持原格式）：

from scripts.llm_judge import judge_folder# Run LLM-as-judge for baseline results
judge_folder(results_dir="results_topk_baseline",out_dir=None,  # auto-map to results_llm_as_judge_baselinemodel="gpt-5",system_prompt_path="llm_as_judge.txt",task_text=None,  # use default task descriptionconcurrency=6,
)# Run LLM-as-judge for optimized results
judge_folder(results_dir="results_topk_optimized",out_dir=None,  # auto-map to results_llm_as_judge_optimizedmodel="gpt-5",system_prompt_path="llm_as_judge.txt",task_text=None,concurrency=6,
)

结果汇总与可视化

结合量化与“LLM 法官”的结果进行汇总展示。

代码（保持原格式）：

from pathlib import Path
import importlib
import scripts.results_summarizer as rs
from IPython.display import Markdown, displayimportlib.reload(rs)fig = rs.render_charts(quant_baseline=Path("results_topk_baseline")/"run_results_topk_baseline.csv",quant_optimized=Path("results_topk_optimized")/"run_results_topk_optimized.csv",judge_baseline=Path("results_llm_as_judge_baseline")/"judgement_summary.csv",judge_optimized=Path("results_llm_as_judge_optimized")/"judgement_summary.csv",auto_display=True,close_after=True,
)
md = rs.build_markdown_summary(quant_baseline=Path("results_topk_baseline")/"run_results_topk_baseline.csv",quant_optimized=Path("results_topk_optimized")/"run_results_topk_optimized.csv",judge_baseline=Path("results_llm_as_judge_baseline")/"judgement_summary.csv",judge_optimized=Path("results_llm_as_judge_optimized")/"judgement_summary.csv",
)display(Markdown(md))print(md)

输出（由笔记本生成的图像与 Markdown）： image generated by notebook <IPython.core.display.Markdown object>

提示词优化结果——编码任务

即使 GPT-5 在基线下也能产出正确代码，但优化后的提示词通过收紧约束、澄清歧义，整体质量更优。

表格（保持结构，已翻译）：

### 提示词优化结果 - 编码任务| 指标                          | 基线     | 优化后   | Δ（优 - 基）   |
|------------------------------|---------:|---------:|---------------:|
| 平均时间（秒）               |    7.906 |     6.977 |        -0.929  |
| 峰值内存（KB）               |   3626.3 |     577.5 |       -3048.8  |
| 精确率（%）                  |    100.0 |     100.0 |           0.0  |
| 排序正确率（%）              |    100.0 |     100.0 |           0.0  |
| LLM 任务遵从度（1–5）        |     4.40 |      4.90 |         +0.50  |
| 代码质量（1–5）              |     4.73 |      4.90 |         +0.16  |

第5部分：上下文与检索（FailSafeQA 金融问答模拟）

背景

大多数生产场景面临不完美的查询与嘈杂的上下文。FailSafeQA 基准会故意扰动查询（拼写错误、不完整、域外表达）与上下文（缺失、OCR 破损、无关文档），并报告稳健性、上下文依赖（Grounding）与合规性：即当有信号时是否能答复；无信号时能否克制拒答。

FailSafeQA 图示

链接

论文（arXiv）：Expect the Unexpected: FailSafe Long Context QA for Finance — [2502.06329] Expect the Unexpected: FailSafe Long Context QA for Finance
数据集（Hugging Face）：https://huggingface.co/datasets/Writer/FailSafeQA
作者/机构：Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh (Writer.ai) —— 详见 arXiv 页面作者列表

评估流程

使用辅助脚本对比基线与优化后的提示词。

代码（保持原格式）：

# Define the Baseline FailSafeQA system prompt here for reuse
baseline_prompt_fsqa = ("You are a finance QA assistant. Answer ONLY using the provided context.\n""If the context is missing or irrelevant, politely refuse and state that you need the relevant document."
)

再次使用 Prompt Optimizer 构造更适配的提示词。基于长上下文 QA 的最佳实践，我们应反复提醒模型严格依据 [Context] 作答，并在上下文不足时拒答。仅按一次 Optimize（不提供额外指令），即可得到结构合理的优化提示，示例如下。

optimize_image

代码（保持原格式）：

optimized_fsqa_prompt = """You are a finance document QA assistant.Behavioral priorities (in order):
1) Grounding: Use ONLY the text inside [Context]. Do NOT use outside knowledge or assumptions.
2) Evidence check: Before answering, verify that the answer text (numbers, entities, dates, phrasing) is explicitly present or directly entailed by [Context]. If not, refuse (see Refusal policy).
3) Robustness to query noise: The user question may contain misspellings, missing words, or non-financial phrasing. Infer intent using the context and answer if the meaning is clear and supported by the context.
4) OCR noise handling: The context may include OCR artifacts (repeated characters, stray symbols, broken words). Ignore junk characters and reconstruct meaning when the underlying sentence is still recoverable. Do not guess beyond what the context supports.Refusal policy:
- If [Context] is empty or lacks the information to answer, reply with a brief refusal and guidance. Do NOT attempt a general-knowledge answer.
- If the question is unrelated to the content of [Context] (out of scope), reply with a brief refusal and guidance. Do NOT speculate.
- If the question is incomplete but the correct answer is unambiguous from [Context], infer the intent and answer exactly; do NOT refuse.Answer style:
- Default to the **shortest exact answer** needed to satisfy the question (e.g., the precise number/string/date as written). Preserve units, signs, casing, currency symbols, commas, and parentheses from the context. Do NOT round numbers unless asked.
- If the user explicitly asks to “write”, “draft”, or “generate” content, you may produce multi-sentence or formatted text—but still source every factual claim strictly from [Context].
- If the question is ambiguous, state the needed clarification in one short sentence, then provide the best supported answer if possible.Output format:
- If answerable from the context:FINAL: <exact answer here>(optional) EVIDENCE: "<very short quoted span from the context that contains the answer>"
- If refusing:FINAL: Insufficient information in the provided context to answer this question. Please upload the relevant document or refine your question to include the necessary details."""

运行评估（演示仅展示单个样本的对比，你也可以运行全量评估；需耗时）。

代码（保持原格式）：

import importlib
import run_FailSafeQA
import pandas as pd
import matplotlib.pyplot as plt
from openai import OpenAI# Ensure latest function signature is used after code edits
importlib.reload(run_FailSafeQA)
run_failsafeqa = run_FailSafeQA.run_failsafeqa# Set idx to an integer for a quick single-example comparison; set to None for full run
idx = 0  # e.g., 0 for a single datapoint#Helper functions:
class OpenAIAnswer:def __init__(self):self.client = OpenAI()def __call__(self, system_prompt: str, user_prompt: str, model: str) -> str:resp = self.client.responses.create(model=model,input=[{"role": "developer", "content": [{"type": "input_text", "text": system_prompt}]},{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},],text={"format": {"type": "text"}, "verbosity": "medium"},reasoning={"effort": "medium", "summary": "auto"},tools=[],)return resp.output_text
class OpenAIJudge:def __init__(self):self.client = OpenAI()def __call__(self, prompt: str, model: str) -> str:resp = self.client.responses.create(model=model,input=[{"role": "user", "content": [{"type": "input_text", "text": prompt}]}],text={"format": {"type": "text"}, "verbosity": "medium"},reasoning={"effort": "medium", "summary": "auto"},tools=[],)return resp.output_textif idx is not None:# Single example mode (with detailed prompt/response logging)run_failsafeqa(out="results_failsafeqa_baseline.csv",system_prompt=baseline_prompt_fsqa,indices=[idx],log_prompts=True,log_chars=800,log_file="failsafeqa_debug.log",)run_failsafeqa(out="results_failsafeqa_optimized.csv",system_prompt=optimized_fsqa_prompt,indices=[idx],log_prompts=True,log_chars=800,log_file="failsafeqa_debug.log",)base_df = pd.read_csv("results_failsafeqa_baseline.csv")opt_df = pd.read_csv("results_failsafeqa_optimized.csv")b_one = base_df[base_df["idx"] == idx]o_one = opt_df[opt_df["idx"] == idx]comparison_df = pd.concat([b_one, o_one], ignore_index=True)# Keep only relevant columnscomparison_df = comparison_df[["run", "kind", "rating", "compliance"]]# Display as tabledisplay(comparison_df)else:# Full run moderun_failsafeqa(out="results_failsafeqa_baseline.csv", system_prompt=baseline_prompt_fsqa)run_failsafeqa(out="results_failsafeqa_optimized.csv", system_prompt=optimized_fsqa_prompt)base_df = pd.read_csv("results_failsafeqa_baseline.csv")opt_df = pd.read_csv("results_failsafeqa_optimized.csv")def per_kind_summary(df: pd.DataFrame) -> pd.DataFrame:out = df.groupby("kind").agg(mean_rating=("rating", lambda x: pd.to_numeric(x, errors="coerce").mean()),compliance_rate=("compliance", lambda x: pd.to_numeric(x, errors="coerce").fillna(0).mean()),count=("rating", "count"),)return out.round(3)base_summary = per_kind_summary(base_df)opt_summary = per_kind_summary(opt_df)summary = base_summary.join(opt_summary, lsuffix="_base", rsuffix="_opt").fillna("NA")print("Per-kind comparison (baseline vs optimized):")display(summary)# Plot compliance rate comparison per kindkinds = summary.index.tolist()x = range(len(kinds))base_vals = summary["compliance_rate_base"].astype(float).tolist()opt_vals = summary["compliance_rate_opt"].astype(float).tolist()fig, ax = plt.subplots(figsize=(10, 4))width = 0.35ax.bar([i - width/2 for i in x], base_vals, width=width, label="Baseline", color="#cbd5e1")ax.bar([i + width/2 for i in x], opt_vals, width=width, label="Optimized", color="#60a5fa")ax.set_xticks(list(x))ax.set_xticklabels(kinds, rotation=45, ha="right")ax.set_ylim(0, 1)ax.set_ylabel("Compliance rate")ax.set_title("FailSafeQA — Per-kind Compliance (Baseline vs Optimized)")ax.legend()plt.tight_layout()plt.show()# Overall metricsdef overall(df: pd.DataFrame):return {"mean_rating": float(pd.to_numeric(df["rating"], errors="coerce").mean()),"mean_compliance": float(pd.to_numeric(df["compliance"], errors="coerce").fillna(0).mean()),}print("Overall — Baseline:", overall(base_df))print("Overall — Optimized:", overall(opt_df))from IPython.display import Markdown, displaydef build_markdown_summary_from_metrics(robust_base: float, ground_base: float,robust_opt: float, ground_opt: float,threshold: int = 6,src_base: str = "results_failsafeqa.csv",src_opt: str = "results_failsafeqa.csv",
) -> str:d_r = robust_opt - robust_based_g = ground_opt - ground_base# Data rowsrows = [["Metric", "Baseline", "Optimized", "Δ (Opt − Base)"],["Robustness (avg across datapoints)", f"{robust_base:.3f}", f"{robust_opt:.3f}", f"{d_r:+.3f}"],["Context Grounding (avg across datapoints)", f"{ground_base:.3f}", f"{ground_opt:.3f}", f"{d_g:+.3f}"],]# Calculate column widths for alignmentcol_widths = [max(len(str(row[i])) for row in rows) for i in range(len(rows[0]))]# Build table lines with paddinglines = []for i, row in enumerate(rows):padded = [str(cell).ljust(col_widths[j]) for j, cell in enumerate(row)]lines.append("| " + " | ".join(padded) + " |")if i == 0:  # after headersep = ["-" * col_widths[j] for j in range(len(row))]lines.append("| " + " | ".join(sep) + " |")table = "\n".join(lines)return f"""
## FailSafeQA — Summary**Compliance threshold:** ≥ {threshold}{table}_Source files:_ `{src_base}` · `{src_opt}`
""".strip()# Usage
md = build_markdown_summary_from_metrics(robust_base=0.320, ground_base=0.800,robust_opt=0.540, ground_opt=0.950,threshold=6,src_base="results_failsafeqa.csv",src_opt="results_failsafeqa.csv",
)# Notebook pretty
display(Markdown(md))print(md)

输出：

<IPython.core.display.Markdown object>
## FailSafeQA — 摘要合规阈值：≥ 6| 指标                                      | 基线    | 优化后  | Δ（优 - 基） |
| ----------------------------------------- | ------- | ------- | ------------ |
| 稳健性（跨样本平均）                      | 0.320   | 0.540   | +0.220       |
| 上下文依赖（跨样本平均）                  | 0.800   | 0.950   | +0.150       |来源文件：`results_failsafeqa.csv` · `results_failsafeqa.csv`

解读