当前位置：首页 > news >正文

NMT - 构建双语概率词典(Probabilistic dictionaries)

news 2025/8/11 2:32:42

文章目录

- 一、安装依赖包
- - mosesdecoder
  - 安装 mgiza++
- 二、数据预处理
- 三、训练

本文参考：How to train your Bicleaner
https://github.com/bitextor/bicleaner/wiki/How-to-train-your-Bicleaner

一、安装依赖包

这个过程主要依赖于

mosesdecoder
mgiza++

mosesdecoder

github ： https://github.com/moses-smt/mosesdecoder
官方说明：http://www2.statmt.org/moses/?n=Development.GetStarted
官方介绍了 windows, macOS 和各版本 ubuntu 的安装细节，这里以 ubuntu 为例

1、安装依赖

sudo apt-get install [package name]

Packages:

   g++ git subversionautomakelibtoolzlib1g-devlibicu-devlibboost-all-devlibbz2-devliblzma-devpython-devgraphvizimagemagickmakecmakelibgoogle-perftools-dev (for tcmalloc)autoconfdoxygen

2、安装

./bjam -j4

如果手动安装了 boost，也可以指定 boost 位置
boost 安装教程：https://blog.csdn.net/lovechris00/article/details/125423796

./bjam --with-boost=~/workspace/temp/boost_1_64_0 -j8

3、安装成功测试

cd ~/mosesdecoder
wget http://www.statmt.org/moses/download/sample-models.tgz
tar xzf sample-models.tgz
cd sample-models# 运行
~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

得到如下结果，代表安装成功
翻译结果：Translating: das ist ein kleines haus

Defined parameters (per moses.ini or switch):config: phrase-model/moses.ini 
...
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
**The ARPA file is missing <unk>.  Substituting log10 probability -100.000.
**************************************************************************************************
FeatureFunction: LM start: 0 end: 0
line=Distortion
...
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Created input-output object : [0.685] seconds
Translating: das ist ein kleines haus 
...
Name:moses	VmPeak:193088 kB	VmRSS:11404 kB	RSSMax:37844 kB	user:0.684	sys:0.008	CPU:0.692	real:0.692

git clone https://github.com/moses-smt/mosesdecoder.git

安装 mgiza++

这里使用 linux 环境为例

# 安装 libboost （mgiza++ 的编译依赖于它）
sudo apt-get install -y cmake libboost-all-dev# 下载mgiza、安装 
git clone https://github.com/moses-smt/mgiza.gitcd mgiza/mgizappcmake . && make && make installcp scripts/merge_alignment.py bin/

二、数据预处理

上文给出的方式是使用 shell，主要实现对文本 tokenize 和 lower 的过程；
这里我使用 python 实现；
假设你有两个文件：raw.zh, raw.en

处理中文
这里使用 jieba 分词

import os ,sys     
import jieba def process(file_path): save_path = file_path + '_low.txt' print('\n-- start : ',file_path) for line in open(file_path):zh_toks = jieba.cut(line.strip())zh_text = ' '.join(zh_toks).lower() with open(save_path, 'a') as fa:fa.write(zh_text + '\n' )print('-- end : ', file_path, save_path)    if __name__ == '__main__':file_path = sys.argv[1]print('-- ', file_path)process(file_path)

处理英文

import os ,sys    
import nltk  def process(file_path): save_path = file_path + '_low.txt' print('\n-- start : ',file_path) for line in open(file_path):en_toks = nltk.word_tokenize(line.strip())en_text = ' '.join(en_toks).lower() with open(save_path, 'a') as fa:fa.write(en_text + '\n' )print('-- end : ', file_path, save_path)    if __name__ == '__main__':file_path = sys.argv[1]print('-- ', file_path)process(file_path)

处理后修改两个文件，以语种作为后缀；假设处理后的文件名为 clean.zh, clean.en；
除了语种后缀外，前面必须一致，方便后续处理；

三、训练

使用 mosesdecoder 的 train-model.perl 文件来训练；
需要添加 mgiza 的bin目录
--root-dir: 数据文件所在的根目录
-corpus 设置文件名前缀；这里为 clean
-e, -f 设置语种

/home/xx/mosesdecoder/scripts/training/train-model.perl \
--alignment grow-diag-final-and \
--root-dir /home/xx/data/230303  -\
-corpus clean -e en -f zh \
--mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 \
--external-bin-dir /home/xx/scode/mgiza/mgizapp/bin