当前位置：首页 > news >正文

语言模型（LM）：n-gram模型原理与困惑度（Perplexity）计算详解

news 2025/8/8 15:54:35

文章目录

- 一、N-gram模型原理
- - 1.1 N-gram介绍和核心要点
  - 1.3 N-gram的核心思想
  - 1.4 核心目标：计算句子概率
  - 1.5 马尔可夫假设：简化条件概率
  - 1.6 模型构建：从语料库中计数
  - 1.7 数据稀疏性问题与平滑技术
- 二、困惑度详解
- - 2.1 困惑度解释
  - 2.2 为什么需要困惑度？
  - 2.3 困惑度的直观解释
- 三、实战计算：一个完整的例子
- - 3.1 步骤1：构建Bigram模型（使用加一平滑）
  - 3.2 步骤2：计算测试句子的对数似然
  - 3.3 步骤3：计算最终困惑度
- 四、用python实现n-gram统计的完整代码
- - 4.1 基本实现
  - 4.2 困惑度(Perplexity)计算
  - 4.3 高级n-gram模型实现

语言模型（Language Model, LM）是自然语言处理的核心技术之一，用于计算一个句子或序列的概率分布。N-gram模型 作为经典统计语言模型，通过捕捉局部上下文信息实现概率估计，而困惑度（Perplexity, PPL） 是其性能评估的关键指标。

一、N-gram模型原理

1.1 N-gram介绍和核心要点

简单来说：n-gram模型是一种基于统计的语言模型，通过计算词序列的概率来预测下一个词。 n-gram模型的优势包括：

简单高效: 计算复杂度低，易于实现
可解释性强: 概率计算过程直观明了
内存效率: 相比神经网络模型，内存占用较少

局限性包括：

稀疏性问题: 高阶n-gram会出现数据稀疏
上下文限制: 只能考虑固定长度的上下文
长距离依赖: 无法有效处理长距离词依赖关系

应用场景包括：

拼写检查: 检测和纠正拼写错误
机器翻译: 作为基础语言模型
语音识别: 提高识别准确率
文本生成: 生成连贯的文本序列

1.3 N-gram的核心思想

N-gram模型基于马尔可夫假设：当前词的出现仅依赖于前( n-1 )个词。例如：

Unigram（1-gram）：每个词独立出现，忽略上下文。
Bigram（2-gram）：当前词仅依赖前一个词（如“I like”→“apple”）。
Trigram（3-gram）：依赖前两个词（如“I like apples”→“but”）。

1.4 核心目标：计算句子概率

语言模型最核心的目标是计算一个句子出现的概率。给定一个句子 S = w₁ w₂ w₃ ... wₙ，我们想知道 P(S) 有多大。
根据概率的链式法则，我们可以将这个联合概率分解为一系列条件概率的乘积：
P(S) = P(w₁) * P(w₂|w₁) * P(w₃|w₁, w₂) * ... * P(wₙ|w₁, w₂, ..., wₙ₋₁)
这个公式理论上很完美，但在实践中几乎无法计算。为什么？

数据稀疏性：当上下文（w₁, w₂, ..., wₙ₋₁）很长时，这个特定的上下文组合在有限的训练语料中可能从未出现过。那么 P(wₙ|w₁, w₂, ..., wₙ₋₁) 的概率就是0，导致整个句子的概率也为0。
计算复杂度：为了计算每个词的概率，我们需要存储和查询所有可能的上下文，这在计算上是不可行的。

1.5 马尔可夫假设：简化条件概率

为了解决上述问题，我们引入了马尔可夫假设。这个假设认为，一个词的出现概率只依赖于它前面的 n-1 个词，而不是整个句子历史。
这个假设极大地简化了我们的模型。一个 n-gram 模型就是基于这个假设的。

Unigram (1-gram)：一元模型。假设每个词的出现都是独立的，只考虑词本身。
P(wₙ|w₁, ..., wₙ₋₁) ≈ P(wₙ)
Bigram (2-gram)：二元模型。假设一个词只依赖于它前面的一个词。
P(wₙ|w₁, ..., wₙ₋₁) ≈ P(wₙ|wₙ₋₁)
Trigram (3-gram)：三元模型。假设一个词只依赖于它前面的两个词。
P(wₙ|w₁, ..., wₙ₋₁) ≈ P(wₙ|wₙ₋₂, wₙ₋₁)
General n-gram：n元模型。假设一个词只依赖于它前面的 n-1 个词。
P(wₙ|w₁, ..., wₙ₋₁) ≈ P(wₙ|wₙ₋(n-1), ..., wₙ₋₁)

1.6 模型构建：从语料库中计数

基于马尔可夫假设，我们只需要在训练语料中统计 n-gram 的出现次数，就可以估算出它们的概率。
最大似然估计 是最简单直接的方法：
P(wₙ|wₙ₋(n-1), ..., wₙ₋₁) = Count(wₙ₋(n-1), ..., wₙ₋₁, wₙ) / Count(wₙ₋(n-1), ..., wₙ₋₁)

分子：统计整个 n-gram 序列在语料中出现的次数。
分母：统计其上下文（前 n-1 个词）序列出现的次数。
例子 (Bigram):
假设语料库中有 “I love deep learning” 和 “I love machine learning”。
P(love | I) = Count("I love") / Count("I") = 2 / 2 = 1.0
P(deep | love) = Count("love deep") / Count("love") = 1 / 2 = 0.5
P(machine | love) = Count("love machine") / Count("love") = 1 / 2 = 0.5

1.7 数据稀疏性问题与平滑技术

即使在Bigram模型中，我们也可能遇到 Count("wₙ₋₁, wₙ") = 0 的情况，这会导致概率为0。例如，在我们的语料中没有 “I eat”，那么 P(eat | I) 就是0。如果一个句子中包含这个未观测到的组合，整个句子的概率就变成了0。
为了解决这个问题，我们需要平滑技术。平滑的核心思想是：将一些概率从已观测到的 n-gram 中分配给未观测到的 n-gram。
常见的平滑技术包括：

加一平滑：最简单粗暴的方法，给所有 n-gram 的计数都加1。
古德-图灵平滑：更聪明的方法，它根据低频词的频率来调整概率。
Kneser-Ney平滑：目前效果最好的平滑技术之一，它考虑了历史信息，不仅关心一个词作为后缀出现的次数，还关心它作为前缀出现的次数。

二、困惑度详解

2.1 困惑度解释

数学定义: PP(W) = P(w₁,w₂,...,wₙ)^(-1/N)
直观理解: 模型对下一个词预测的"困惑程度"
评价标准: 值越低表示模型越好
典型范围: 通常在几十到几百之间

2.2 为什么需要困惑度？

我们已经有了计算句子概率的方法，但概率本身并不是一个直观的评估指标。

概率值非常小（一个20个词的句子，概率可能是 10^-20），难以比较。
句子越长，其概率值越小，这不代表它就越“不合理”。

我们需要一个标准化的指标，它能够：

与概率单调相关：概率越高，模型越好，这个指标也应该越小。
标准化句子长度：能够公平地比较长度不同的句子。

困惑度 就是这样一个指标。

2.3 困惑度的直观解释

困惑度的直观解释是：一个语言模型在预测下一个词时，平均面临的“选择困惑程度”。

低困惑度：模型对下一个词的预测非常自信，选择范围小。例如，在句子 “The cat is on the…” 后，模型预测 “mat” 的概率很高，困惑度就低。
高困惑度：模型对下一个词的预测非常不确定，选择范围大。例如，在句子 “The…” 后，模型预测 “a”, “an”, “this”, “cat”, “dog” 等词的概率差不多，困惑度就高。
一个困惑度为 K 的模型，在预测下一个词时，感觉就像要从 K 个等可能的选项中随机猜一个一样。

三、实战计算：一个完整的例子

假设我们有一个非常小的训练语料库：
"I am Sam. Sam I am. I do not like green eggs and ham."
我们要评估一个测试句子：S_test = "I am Sam"

3.1 步骤1：构建Bigram模型（使用加一平滑）

首先，我们需要统计语料库中的所有unigram和bigram。
Unigram计数:

Count("I") = 3
Count("am") = 2
Count("Sam") = 2
Count(".") = 3
Count("do") = 1
Count("not") = 1
Count("like") = 1
Count("green") = 1
Count("eggs") = 1
Count("and") = 1
Count("ham") = 1
Total Unigrams (V): 11个不同的词。
Bigram计数:
Count("I am") = 2
Count("am Sam") = 1
Count("Sam .") = 2
Count("Sam I") = 1
Count("am .") = 1
Count("I do") = 1
Count("do not") = 1
Count("not like") = 1
Count("like green") = 1
Count("green eggs") = 1
Count("eggs and") = 1
Count("and ham") = 1
Count("ham .") = 1
应用加一平滑计算Bigram概率:
平滑后的概率公式为：
P'(wᵢ|wᵢ₋₁) = (Count(wᵢ₋₁, wᵢ) + 1) / (Count(wᵢ₋₁) + V)
其中 V 是词汇表大小（11）。
P'(am | I) = (Count("I am") + 1) / (Count("I") + V) = (2 + 1) / (3 + 11) = 3 / 14
P'(Sam | am) = (Count("am Sam") + 1) / (Count("am") + V) = (1 + 1) / (2 + 11) = 2 / 13
P'(. | Sam) = (Count("Sam .") + 1) / (Count("Sam") + V) = (2 + 1) / (2 + 11) = 3 / 13
注意：测试句子的最后一个词 “Sam” 没有后续词，我们通常只计算到句尾前的词。

3.2 步骤2：计算测试句子的对数似然

测试句子 S_test = "I am Sam"，长度 N=2（我们计算两个bigram的概率）。
log PPL(S_test) = - (1/N) * [ log₂ P'(am | I) + log₂ P'(Sam | am) ]

log₂ P'(am | I) = log₂(3/14) ≈ log₂(0.214) ≈ -2.229
log₂ P'(Sam | am) = log₂(2/13) ≈ log₂(0.154) ≈ -2.700
log PPL(S_test) = - (1/2) * [ -2.229 + (-2.700) ]
log PPL(S_test) = - (1/2) * [ -4.929 ]
log PPL(S_test) = 2.4645

3.3 步骤3：计算最终困惑度

PPL(S_test) = 2^[log PPL(S_test)] = 2^2.4645 ≈ 5.52
结论：
对于测试句子 “I am Sam”，我们的加一平滑Bigram模型的困惑度约为 5.52。这意味着，在预测句子中的下一个词时，模型感觉平均要从大约5.5个等可能的选项中进行选择。
如果另一个模型在同一个测试集上计算出更低的困惑度（例如4.8），我们就认为这个新模型更好。

四、用python实现n-gram统计的完整代码

4.1 基本实现

import numpy as np
from collections import defaultdict, Counter
import mathclass NGramModel:def __init__(self, n=2):"""初始化n-gram模型Args:n (int): n-gram的阶数，n=1为unigram，n=2为bigram，n=3为trigram"""self.n = nself.ngrams = defaultdict(Counter)  # 存储n-gram计数self.context_counts = Counter()     # 存储上下文计数self.vocab = set()                  # 词汇表def tokenize(self, text):"""简单的分词方法Args:text (str): 输入文本Returns:list: 分词结果"""# 简单按空格分词，实际应用中可以使用更复杂的分词器return text.lower().split()def train(self, texts):"""训练n-gram模型Args:texts (list): 训练文本列表"""for text in texts:tokens = self.tokenize(text)# 添加开始和结束标记tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']# 统计n-gramfor i in range(len(tokens) - self.n + 1):ngram = tuple(tokens[i:i + self.n])context = ngram[:-1]  # 上下文word = ngram[-1]      # 目标词self.ngrams[context][word] += 1self.context_counts[context] += 1self.vocab.add(word)def probability(self, context, word):"""计算给定上下文条件下词的概率 P(word|context)Args:context (tuple): 上下文word (str): 目标词Returns:float: 概率值"""# 最大似然估计count_context_word = self.ngrams[context][word]count_context = self.context_counts[context]if count_context == 0:return 0.0return count_context_word / count_contextdef smoothed_probability(self, context, word, smoothing='add_one'):"""带平滑的概率计算Args:context (tuple): 上下文word (str): 目标词smoothing (str): 平滑方法 ('add_one' 或 'good_turing')Returns:float: 平滑后的概率值"""if smoothing == 'add_one':# Add-one (Laplace) 平滑count_context_word = self.ngrams[context][word]count_context = self.context_counts[context]vocab_size = len(self.vocab)return (count_context_word + 1) / (count_context + vocab_size)elif smoothing == 'good_turing':# Good-Turing平滑（简化版）count_context_word = self.ngrams[context][word]count_context = self.context_counts[context]if count_context == 0:# 如果上下文从未出现，返回一个很小的概率return 1e-10return count_context_word / count_contextelse:# 无平滑return self.probability(context, word)# 使用示例
def demo_ngram_model():# 训练数据training_texts = ["the cat sat on the mat","the dog sat on the log","cats and dogs are pets","the mat was wet","the dog ran fast"]# 创建bigram模型bigram_model = NGramModel(n=2)bigram_model.train(training_texts)# 测试概率计算context = ('the',)word = 'cat'prob = bigram_model.probability(context, word)smoothed_prob = bigram_model.smoothed_probability(context, word, 'add_one')print(f"训练文本: {training_texts}")print(f"上下文: {context}")print(f"目标词: {word}")print(f"无平滑概率: {prob:.4f}")print(f"Add-one平滑概率: {smoothed_prob:.4f}")# 显示一些n-gram统计print("\n部分n-gram统计:")for context, words in list(bigram_model.ngrams.items())[:5]:print(f"  {context}: {dict(words)}")demo_ngram_model()

4.2 困惑度(Perplexity)计算

困惑度是衡量语言模型性能的重要指标，值越低表示模型越好。

class PerplexityCalculator:def __init__(self, ngram_model):"""初始化困惑度计算器Args:ngram_model (NGramModel): 训练好的n-gram模型"""self.model = ngram_modeldef sentence_probability(self, sentence, smoothing='add_one'):"""计算句子的概率Args:sentence (str): 输入句子smoothing (str): 平滑方法Returns:float: 句子概率"""tokens = self.model.tokenize(sentence)tokens = ['<s>'] * (self.model.n - 1) + tokens + ['</s>']prob = 1.0for i in range(len(tokens) - self.model.n + 1):ngram = tuple(tokens[i:i + self.model.n])context = ngram[:-1]word = ngram[-1]word_prob = self.model.smoothed_probability(context, word, smoothing)# 避免概率为0if word_prob == 0:word_prob = 1e-10prob *= word_probreturn probdef perplexity(self, sentences, smoothing='add_one'):"""计算困惑度困惑度公式: PP(W) = P(w1,w2,...,wN)^(-1/N)其中N是句子中词的总数Args:sentences (list or str): 测试句子或句子列表smoothing (str): 平滑方法Returns:float: 困惑度值"""if isinstance(sentences, str):sentences = [sentences]total_log_prob = 0.0total_words = 0for sentence in sentences:tokens = self.model.tokenize(sentence)tokens = ['<s>'] * (self.model.n - 1) + tokens + ['</s>']sentence_length = len(tokens)# 计算句子的对数概率log_prob = 0.0for i in range(len(tokens) - self.model.n + 1):ngram = tuple(tokens[i:i + self.model.n])context = ngram[:-1]word = ngram[-1]word_prob = self.model.smoothed_probability(context, word, smoothing)if word_prob > 0:log_prob += math.log(word_prob)else:# 处理概率为0的情况log_prob += math.log(1e-10)total_log_prob += log_probtotal_words += sentence_length# 计算平均对数概率avg_log_prob = total_log_prob / total_words if total_words > 0 else 0# 计算困惑度perplexity = math.exp(-avg_log_prob)return perplexity# 完整示例
def complete_ngram_demo():# 训练数据training_data = ["the cat sat on the mat","the dog sat on the log","cats and dogs are pets","the mat was wet","the dog ran fast","a cat and a dog played","the pet sat on the mat","dogs and cats are animals"]# 测试数据test_sentences = ["the cat sat on the mat","the dog ran on the log","a dog and cat played","the unknown word appeared"]print("=== n-gram语言模型与困惑度计算示例 ===\n")# 测试不同阶数的n-gram模型for n in [1, 2, 3]:  # unigram, bigram, trigramprint(f"--- {n}-gram 模型 ---")# 训练模型model = NGramModel(n=n)model.train(training_data)# 计算困惑度calculator = PerplexityCalculator(model)perplexity = calculator.perplexity(test_sentences)print(f"模型阶数: {n}")print(f"困惑度: {perplexity:.4f}")# 显示测试句子的概率print("测试句子概率:")for sentence in test_sentences:prob = calculator.sentence_probability(sentence)print(f"  '{sentence}': {prob:.6f}")print()complete_ngram_demo()

4.3 高级n-gram模型实现

class AdvancedNGramModel(NGramModel):def __init__(self, n=2, smoothing_method='kneser_ney'):"""高级n-gram模型，支持多种平滑方法Args:n (int): n-gram阶数smoothing_method (str): 平滑方法"""super().__init__(n)self.smoothing_method = smoothing_methodself.discount = 0.75  # Kneser-Ney平滑的折扣参数def kneser_ney_probability(self, context, word):"""Kneser-Ney平滑概率计算Args:context (tuple): 上下文word (str): 目标词Returns:float: Kneser-Ney平滑概率"""# 高阶项: 续接概率continuation_count = sum(1 for contexts in self.ngrams.values() if word in contexts)total_continuations = sum(len(contexts) for contexts in self.ngrams.values())# 续接概率continuation_prob = continuation_count / total_continuations if total_continuations > 0 else 0# 回退项count_context_word = self.ngrams[context][word]count_context = self.context_counts[context]if count_context == 0:return continuation_prob# Kneser-Ney公式discounted_count = max(count_context_word - self.discount, 0)interpolation_weight = self.discount * len(self.ngrams[context]) / count_contextreturn discounted_count / count_context + interpolation_weight * continuation_probdef interpolated_probability(self, context, word, lambdas=None):"""插值平滑概率计算Args:context (tuple): 上下文word (str): 目标词lambdas (list): 插值权重Returns:float: 插值概率"""if lambdas is None:# 默认权重lambdas = [1.0/self.n] * self.nprob = 0.0# 从n-gram到unigram的插值for i in range(self.n):if i == 0:# unigram概率word_count = sum(self.ngrams[()][w] for w in self.vocab)if word_count > 0:unigram_prob = self.ngrams[()][word] / word_countelse:unigram_prob = 1.0 / len(self.vocab) if len(self.vocab) > 0 else 0prob += lambdas[i] * unigram_probelse:# i-gram概率current_context = context[-i:] if len(context) >= i else contextcount_context_word = self.ngrams[current_context][word]count_context = self.context_counts[current_context]if count_context > 0:ngram_prob = count_context_word / count_contextelse:ngram_prob = 0prob += lambdas[i] * ngram_probreturn prob# 性能比较示例
def compare_smoothing_methods():training_data = ["the cat sat on the mat","the dog sat on the log","cats and dogs are pets","the mat was wet","the dog ran fast"]test_sentences = ["the cat sat on the mat","a new sentence with unknown words"]print("=== 不同平滑方法性能比较 ===\n")# 创建bigram模型models = {'Add-One': NGramModel(n=2),'Kneser-Ney': AdvancedNGramModel(n=2),}for name, model in models.items():model.train(training_data)calculator = PerplexityCalculator(model)perplexity = calculator.perplexity(test_sentences)print(f"{name} 平滑方法的困惑度: {perplexity:.4f}")compare_smoothing_methods()