当前位置：首页 > news >正文

一文搞懂SentencePiece的使用

news 2025/7/17 1:53:59

1. 什么是 SentencePiece？
2. SentencePiece 基础概念
- 2.1 SentencePiece 的工作原理
- 2.2 SentencePiece 的优点
3. SentencePiece 的使用
- 3.1 安装 SentencePiece
- 3.2 训练模型与加载模型
- 3.3 encode（高频）
- 3.4 decode（高频）
- 3.5 设置相关选项（不常用）
4. Trainer的使用
5. 大小写相关问题
6. 一些误区
Ref

1. 什么是 SentencePiece？

在自然语言处理（NLP）任务中，文本的预处理环节至关重要。无论是机器翻译、语言模型，还是问答系统，如何将原始文本转化为模型能够理解的输入是其中一个关键步骤。词汇表的构建和分词方式的选择，往往会直接影响模型的性能。而 SentencePiece 是谷歌开发的一款用于构建词汇表和分词的工具，它特别适用于那些没有明确分词边界的语言，能够在子词级别上实现无监督的文本分割。

SentencePiece 是一种基于子词单元的分词器，广泛应用于机器翻译和文本生成任务中。与传统分词方法不同，SentencePiece 并不依赖于语言的词汇结构，能够直接处理不带空格的语言（例如中文、日文）。它基于两种主要的算法：Byte-Pair Encoding (BPE) 和 Unigram Language Model，在生成子词单元的同时，提供了灵活的词汇表管理方式。

2. SentencePiece 基础概念

2.1 SentencePiece 的工作原理

SentencePiece 的核心思想是将文本分解为子词单元（subword units）。它不依赖预定义的词汇表，而是通过统计学习自动构建子词单元。无论是空格分隔的语言（如英语），还是无空格分隔的语言（如中文、日文），它都能够处理。这样做的好处是，它可以处理不在词汇表中的未知词，并且有效减少词汇表的大小，从而降低 OOV（Out of Vocabulary）问题。

SentencePiece 提供了两种主要的分词算法：

Byte-Pair Encoding (BPE)：通过反复合并最频繁的子词对，逐渐构建子词词汇表。BPE 是一种贪心算法，它通过合并最频繁的字符或子词对来构建词汇表。
Unigram Language Model：这是一种概率模型，基于一个语言模型来选择最优的子词分割方式。它从一个完整的子词词典开始，逐步移除低概率的子词，最终保留高概率的子词作为最终的词典。

📝 SentencePiece底层使用C++实现，并通过SWIG暴露接口

2.2 SentencePiece 的优点

语言无关性：SentencePiece 不依赖于语言的词汇结构，可以直接用于无空格分隔的语言（如中文）。
子词单元的灵活性：通过子词分割，模型可以处理未知词，避免 OOV 问题。
词汇表大小可控：通过设定词汇表大小，可以精确控制子词的数量，平衡模型性能与存储资源。
简化文本预处理：传统方法需要对文本进行分词和词汇表生成，而 SentencePiece 将这两个步骤合并，简化了工作流程。

3. SentencePiece 的使用

SentencePiece 提供了易用的 Python API，帮助开发者快速集成到项目中。在这一章节中，我们将详细介绍如何使用 SentencePiece 进行文本分割、词汇表训练、子词编码等操作。

3.1 安装 SentencePiece

在使用 SentencePiece 之前，需要先通过 pip 安装相关库：

pip install sentencepiece

安装完成后，我们就可以开始使用 SentencePiece 的 Python API 进行分词和词汇表构建。

3.2 训练模型与加载模型

SentencePiece 的第一个主要功能是训练分词模型。我们可以使用 SentencePiece 来学习文本中的子词分割模式，并生成一个可用于后续编码的词汇表。训练过程主要分为以下步骤：

准备好要训练的文本数据。
使用 SentencePieceTrainer 进行模型训练。

import sentencepiece as spm# 假设我们有一个文本文件 'data.txt'，其中包含我们希望训练的文本数据
spm.SentencePieceTrainer.train(input='data.txt', model_prefix='mymodel', vocab_size=8000)

上述代码中，train 函数用于训练模型，input 参数指定输入文件，model_prefix 用于指定输出模型的前缀，vocab_size 则是我们希望的词汇表大小。在训练完成后，会生成两个文件：

mymodel.model：这是分词模型文件，包含了子词分割的规则。
mymodel.vocab：这是词汇表文件，列出了所有生成的子词。

训练完成后，我们可以加载生成的模型进行文本分割和编码操作。首先，我们需要使用 SentencePieceProcessor 来加载模型文件。

import sentencepiece as spm# 加载已训练好的模型
sp = spm.SentencePieceProcessor(model_file='mymodel.model')

加载完成后，sp 对象即是我们用于进行文本处理的分词器。

📝 .model 是二进制文件，.vocab 是文本文件。.model 文件已经包含了词汇表，所以加载模型的时候不需要传入 .vocab 文件。.vocab 文件仅仅是用来辅助开发者了解词汇表的。

接下来以LLaMA Tokenizer为例。

查看词表大小：

# 四种方法效果相同
print(sp.vocab_size())
print(sp.get_piece_size())
print(sp.piece_size())
print(len(sp))
# 均是32000

获取整个词表：

for i in range(len(sp)):print(sp.id_to_piece(i))

或者执行

spm_export_vocab --model=mymodel.model --output=mymodel.vocab

获取某一个token的id不仅可以用 piece_to_id 方法，还可以直接调用 __getitem__：

print(sp['<unk>'])  # 0

查看特殊词元：

print(sp.unk_id())  # 0
print(sp.bos_id())  # 1
print(sp.eos_id())  # 2
print(sp.pad_id())  # -1，意味着没有设置pad token

3.3 encode（高频）

加载模型后，我们可以使用 SentencePieceProcessor 对文本进行分词和还原。

encode 和 decode 可以说是用的最多的两个方法了，前者用来分词，后者用来对分词后的结果进行还原。

encode 的函数签名如下：

def encode(self, input: str, out_type: Type[Union[int, str]] = int, add_bos: bool = False, add_eos: bool = False
) -> Union[List[int], List[str]]:

out_type 决定了分词结果是 List[int] 还是 List[str]。

text = "This is a test."
print(sp.encode(text))
print(sp.encode(text, out_type=str))
print(sp.encode(text, add_bos=True, add_eos=True))
print(sp.encode(text, add_bos=True, out_type=str))

输出分别为：

[910, 338, 263, 1243, 29889]
['▁This', '▁is', '▁a', '▁test', '.']
[1, 910, 338, 263, 1243, 29889, 2]
['<s>', '▁This', '▁is', '▁a', '▁test', '.']

当然我们还可以批量进行分词：

text1 = "This is a test."
text2 = "Another test sentence."
text3 = "SentencePiece is a useful tool for tokenization."
text4 = "Batch encoding is efficient."texts = [text1, text2, text3, text4]print(sp.encode(texts))
print('-' * 15)
print(sp.encode(texts, out_type=str))

输出：

[[910, 338, 263, 1243, 29889], [7280, 1243, 10541, 29889], [28048, 663, 29925, 347, 346, 338, 263, 5407, 5780, 363, 5993, 2133, 29889], [350, 905, 8025, 338, 8543, 29889]]
---------------
[['▁This', '▁is', '▁a', '▁test', '.'], ['▁Another', '▁test', '▁sentence', '.'], ['▁Sent', 'ence', 'P', 'ie', 'ce', '▁is', '▁a', '▁useful', '▁tool', '▁for', '▁token', 'ization', '.'], ['▁B', 'atch', '▁encoding', '▁is', '▁efficient', '.']]

如果你觉得每次都要指定 out_type 略显麻烦，sp 还提供了 encode_as_ids 和 encode_as_pieces 两种接口，它们相当于：

encode_as_ids = lambda input: encode(input, out_type=int)
encode_as_pieces = lambda input: encode(input, out_type=str)

3.4 decode（高频）

decode 就是将分词后的结果还原成字符串。其函数签名如下：

def decode(self, input: Union[List[int], List[str]], 
) -> str:

decode 会自动检测输入的类型，如下：

text = "This is a test."
list_int = sp.encode_as_ids(text)
list_str = sp.encode_as_pieces(text)print(sp.decode(list_int))
print(sp.decode(list_str))

均能正确还原原始的 text。

sp 中还提供了 decode_ids 和 decode_pieces 两种方法，但实际上比较鸡肋，它们其实都指向 decode 方法，所以不如直接用 decode。

3.5 设置相关选项（不常用）

SentencePiece 提供了许多选项，例如：

# 分词的时候默认加上eos
sp.set_encode_extra_options("eos")
text = "This is a test."
print(sp.encode(text))  # [910, 338, 263, 1243, 29889, 2]

此时即使设置 add_bos=False 也无济于事，所以 set_encode_extra_options 的优先级是最高的，要谨慎设置。

常见的选项有：

eos：默认在分词结果里加上 eos。
bos：默认在分词结果里加上 bos。
reverse：默认颠倒分词结果。
unk：在分词结果里对未出现的词元设置为 unk。

还可以通过 : 将多个选项组合在一起使用：

sp.set_encode_extra_options("bos:eos:reverse")
text = "This is a test."
print(sp.encode(text, out_type=str))
# ['</s>', '.', '▁test', '▁a', '▁is', '▁This', '<s>']

清空已设置的选项只需执行

sp.set_encode_extra_options('')

此外，set_vocabulary 限制分词器使用指定的词汇，reset_vocabulary 恢复完整词汇表，load_vocabulary 根据频率阈值动态加载词汇表，但这三个方法都不会改变原始模型的词汇表大小。

4. Trainer的使用

以上我们仅讨论了如何使用 Processor，但还没有讨论如何使用 Trainer。

spm.SentencePieceTrainer.train 中的参数可通过执行如下命令

spm_train --help

来查看。函数签名如下（仅列举了常用的参数）：

def train(input: Optional[Union[str, List[str]]] = None,model_prefix: Optional[str] = None,vocab_size: int = 8000,model_type: str = "unigram",character_coverage: float = 0.9995,max_sentence_length: int = 4192,user_defined_symbols: Optional[Union[str, List[str]]] = None,unk_id: int = 0,bos_id: int = 1,eos_id: int = 2,pad_id: int = -1,unk_piece: str = '<unk>',bos_piece: str = '<s>',eos_piece: str = '</s>',pad_piece: str = '<pad>',
):

input：输入可以是一个文件也可以是多个文件，多个文件就是 List[str]。文件中的每一行都应当是一个未经处理的原始句子。
model_prefix：训练结束后，生成的两个文件分别为 <model_prefix>.model 和 <model_prefix>.vocab。
vocab_size：词汇表大小，默认为8000。注意训练完后不一定会严格等于这个大小，可开启 hard_vocab_limit 来强行限制（事实上这个选项默认就是 True）。
model_type：选择哪一个模型。有 unigram（默认）、bpe、char、word 四种可选。如果选择 word，那么输入文件中的每一个句子必须是已经预分词的形态。
character_coverage：控制模型覆盖多少比例的字符。对于日文，中文基本字符比较多的语言，0.9995是一个不错的选择。对于英文这种基本字符比较少的语言，设置为1就行。默认为0.9995。
max_sentence_length：训练时每个句子在字节意义下的长度超过这个值就会被阶段。默认是4192。
user_defined_symbols：可以传入用户自定义的token。注意，这里不能包含 unk_piece，否则会报错。但可以包含 bos_piece 等特殊词元，只不过会合二为一。
特殊词元：可通过设置相应的id为 $- 1$ 来关闭这个词元。注意 <unk> 必须存在。默认没有 <pad> 词元。

如果将 bos_id 设为 $- 1$ ，那么 user_defined_symbols[0] 就会被放置在词汇表中索引为 $1$ 的地方（因为 bos_id 原先是1），以此类推，从前往后填满所有的空缺位置。如果 bos_id 没有设为 $- 1$ ，但是 user_defined_symbols 中又含有 bos_piece，那么 user_defined_symbols 中的 bos_piece 就会失效。

由此可知，.vocab 文件的前半部分由Special Tokens和User Defined Symbols组成，而后半部分就是模型训练过程中所产生的Subwords了。

我们可以观察词表中的前半部分是怎么排列的：

spm.SentencePieceTrainer.train(input='train.txt',model_prefix='m',vocab_size=16,user_defined_symbols=["<cls>", "<sep>", "<s>", "</s>", "<mask>"],unk_id=1,bos_id=3,eos_id=5,
)

词表：

<cls>	0
<unk>	0
<sep>	0
<s>	0
<mask>	0
</s>	0
▁	-1.4849
i	-2.31823
s	-2.31823
t	-2.31823
.	-3.31823
a	-3.31823
e	-3.31823
h	-3.31823
x	-3.31823
T	-3.31823

5. 大小写相关问题

细心的读者可能已经发现了，明明 __init__.py 文件中定义的方法是按照驼峰命名法来命名的，为什么实际使用的时候可以采用蛇形命名法的方法呢？

如下展示了部分源码：

def EncodeAsPieces(self, input, **kwargs):return self.Encode(input=input, out_type=str, **kwargs)def EncodeAsIds(self, input, **kwargs):return self.Encode(input=input, out_type=int, **kwargs)

这是因为在 __init__.py 文件中的第 1020 行，SentencePiece 通过 _add_snake_case 函数在原有的基础上注入了蛇形命名法的方法：

def _add_snake_case(classname):# 定义一个名为 _add_snake_case 的函数，它接受一个类名 classname 作为参数。# 该函数的作用是将类中的驼峰命名法方法转化为蛇形命名法方法。snake_map = {}# 初始化一个空字典 snake_map，用来存储从驼峰命名法转换为蛇形命名法的键值对。for k, v in classname.__dict__.items():# 遍历类 classname 的 __dict__ 属性，该属性是类的字典，包含类中的所有属性（包括方法）。if re.match(r'^[A-Z]+', k):# 检查属性名 k 是否以大写字母开头，这是驼峰命名法方法的特征。snake = re.sub(r'(?<!^)(?=[A-Z])', '_', k).lower().replace('n_best', 'nbest')# 使用正则表达式将属性名中的大写字母前添加下划线（忽略第一个字母）。# 然后将结果转换为小写，形成蛇形命名法。# replace('n_best', 'nbest') 是特殊处理 n_best 这种命名情况，将其替换为 nbest。snake_map[snake] = v# 将转换后的蛇形命名方法名与对应的原始方法 v 存入 snake_map 字典。for k, v in snake_map.items():# 遍历 snake_map 字典的键值对。setattr(classname, k, v)# 使用 setattr 函数，将新的蛇形命名方法 k 赋值给类 classname，使其指向原始方法 v。_add_snake_case(SentencePieceProcessor)
_add_snake_case(SentencePieceTrainer)

6. 一些误区

SentencePiece并不是一种分词算法，而是一些分词算法的implementation，并在其基础上做了一些优化。还有一些其他的implementations，例如fastBPE，BlingFire等。

from tokenizers import SentencePieceBPETokenizertokenizer = SentencePieceBPETokenizer()
print(tokenizer.pre_tokenizer.pre_tokenize_str("こんにちは世界"))
print(tokenizer.pre_tokenizer.pre_tokenize_str("Hello   world."))

输出：

[('▁こんにちは世界', (0, 7))]
[('▁Hello', (0, 5)), ('▁', (5, 6)), ('▁', (6, 7)), ('▁world.', (7, 14))]

对于不含空格的语种，例如日语，SentencePiece不会进行pre tokenize，而是将其视为1个token。

💻 pre_tokenize_str 的实现在 https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/pre_tokenizers.rs#L173

在HuggingFace源码中，SentencePiece的pre_tokenizer是MetaSpace，它的核心逻辑如下：

def pre_tokenize(text: str) -> list:if not text:return []text = text.replace(' ', '▁')if not text.startswith('▁'):text = '▁' + texttokens = re.findall(r'▁[^▁]*|▁', text)return tokens

▁[^▁]*：表示匹配以 ▁ 开头，后面跟随零个或多个非 ▁ 字符的子串，也就是单词前有 ▁ 标记。
|▁：表示单独匹配一个 ▁，即如果有连续的 ▁，也会被单独作为一个 token 返回。

我们可以通过构造各种极端测试用例来验证它的正确性：

import re
from tokenizers.pre_tokenizers import Metaspacedef pre_tokenize_1(text: str) -> list:if not text:return []text = text.replace(' ', '▁')if not text.startswith('▁'):text = '▁' + texttokens = re.findall(r'▁[^▁]*|▁', text)return tokensdef pre_tokenize_2(text: str) -> list:global modelres = model.pre_tokenize_str(text)return [token[0] for token in res]model = Metaspace()test_cases = ["hello world","hello     world","   leading spaces","trailing spaces   ","multiple   spaces in   between","▁▁hello▁world","▁hello     world▁▁","▁hello▁world▁",""," ","▁","▁ "," ▁","    ","▁▁▁▁","hello▁","▁▁hello▁▁world▁▁","a","▁▁","hello, world!","▁▁@hello▁#world$","▁▁(hello)▁[world]","1 2 3▁▁4 5","▁▁h3llo▁w0rld123▁","你好 世界","▁▁こんにちは▁世界","Привет▁мир","안녕하세요▁세계","👋🏽▁🌍","▁▁👋🏽▁🌍","Hello▁123▁!▁你好▁世界","▁▁H3llo▁▁123▁▁你好▁世界▁▁","▁▁▁▁▁▁", "     ▁▁    ▁▁", "▁a▁b▁c▁d▁e▁f▁g▁", "▁▁▁▁▁abc", "▁▁▁▁▁▁▁", "你好世界▁▁▁▁▁", "▁▁▁123▁456▁789▁", "This▁▁is▁▁▁▁an▁▁▁▁▁▁▁example", "▁▁▁▁▁▁🌟🌟🌟▁🌟▁", 
]def compare_and_validate_tokenizers(test_cases, verbose=False):for i, text in enumerate(test_cases):tokens_1 = pre_tokenize_1(text)tokens_2 = pre_tokenize_2(text)if verbose:print(f"Test case {i+1}: '{text}'")print(f"pre_tokenize_1: {tokens_1}")print(f"pre_tokenize_2: {tokens_2}")if tokens_1 != tokens_2:if verbose:print("❌ Results differ!\n")print(f"Difference: \n pre_tokenize_1: {tokens_1}\n pre_tokenize_2: {tokens_2}\n")raise AssertionError(f"Test case {i+1} failed: '{text}'\n pre_tokenize_1: {tokens_1}\n pre_tokenize_2: {tokens_2}")elif verbose:print("✅ Both methods produce the same result.\n")print("All test cases passed!")compare_and_validate_tokenizers(test_cases, verbose=True)