当前位置: 首页 > news >正文

词对齐 - MGIZA++

文章目录

    • 关于 MGIZA++
      • giza-py
    • 安装 MGIZA++
    • 命令说明
      • mkcls
      • d4norm
      • hmmnorm
      • plain2snt
      • snt2cooc
      • snt2coocrmp
      • snt2plain
      • symal
      • mgiza
        • general parameters:
        • No. of iterations:
        • parameter for various heuristics in GIZA++ for efficient training:
        • parameters for describing the type and amount of output:
        • parameters describing input files:
        • smoothing parameters:
        • parameters modifying the models:
        • parameters modifying the EM-algorithm:


关于 MGIZA++

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.

  • Github: https://github.com/moses-smt/mgiza

MGiza++是在Giza++基础上扩充的一中多线程Giza++工具。
使用MGiza++时,可以根据自己的机器指定使用几个处理器

Pgiza++是运行在分布式机器上的Giza++工具,使用了 MapReduce 技术的框架。


giza-py

https://github.com/sillsdev/giza-py
giza-py is a simple, Python-based, command-line runner for MGIZA++, a popular tool for building word alignment models.


参考:Moses中模型训练的并行化问题
https://www.52nlp.cn/the-issue-of-parallel-in-moses-model-training


安装 MGIZA++

1、下载 repo https://github.com/moses-smt/mgiza

2、终端进入 mgizapp 文件,输入如下命令:

cmake . 
make
make install

在 bin 目录可以得到下面几个可执行文件

  • hmmnorm
  • mkcls
  • snt2cooc
  • snt2plain
  • d4norm
  • mgiza
  • plain2snt
  • snt2coocrmp
  • symal

命令说明

mkcls

mkcls - a program for making word classes: Usage:

mkcls [-nnum] [-ptrain] [-Vfile] opt

  • -V: output classes (Default: no file)
  • -n: number of optimization runs (Default: 1); larger number => better results
  • -p: filename of training corpus (Default: ‘train’)

Example:

mkcls -c80 -n10 -pin -Vout opt

(generates 80 classes for the corpus ‘in’ and writes the classes in ‘out’)

Literature:
Franz Josef Och: ?Maximum-Likelihood-Sch?tzung von Wortkategorien mit Verfahren der kombinatorischen Optimierung?Studienarbeit, Universit?t Erlangen-N?rnberg, Germany,1995.


d4norm

d4norm vcb1 vcb2 outputFile baseFile [additional1 ]…


hmmnorm

hmmnorm vcb1 vcb2 outputFile baseFile [additional1 ]…


plain2snt

Converts plain text into GIZA++ snt-format.

plain2snt txt1 txt2 [txt3 txt4 -weight w -vcb1 output1.vcb -vcb2 output2.vcb -snt1 output1_output2.snt -snt2 output2_output1.snt]


snt2cooc

Converts GIZA++ snt-format into plain text.

snt2cooc output vcb1 vcb2 snt12


snt2coocrmp

Converts GIZA++ snt-format into plain text.

snt2coocrmp output vcb1 vcb2 snt12


snt2plain

Converts GIZA++ snt-format into plain text.

snt2plain vcb1 vcb2 snt12 output_prefix [ -counts ]


symal

symal [-i=] [-o=] -a=[u|i|g] -d=[yes|no] -b=[yes|no] -f=[yes|no]
Input file or std must be in .bal format (see script giza2bal.pl).


mgiza

Starting MGIZA
Usage:

mgiza <config_file> [options]


Options (these override parameters set in the config file):

  • --v: print verbose message, Warning this is not very descriptive and not systematic.
  • --NODUMPS: Do not write any files to disk (This will over write dump frequency options).
  • --h[elp]: print this help
  • --p: Use pegging when generating alignments for Model3 training. (Default NO PEGGING)
  • --st: to use a fixed ditribution for the fertility parameters when tranfering from model 2 to model 3 (Default complicated estimation)

general parameters:

-------------------
ml = 101 (maximum sentence length)


No. of iterations:

-------------------
hmmiterations = 5 (mh)
model1iterations = 5 (number of iterations for Model 1)
model2iterations = 0 (number of iterations for Model 2)
model3iterations = 5 (number of iterations for Model 3)
model4iterations = 5 (number of iterations for Model 4)
model5iterations = 0 (number of iterations for Model 5)
model6iterations = 0 (number of iterations for Model 6)


parameter for various heuristics in GIZA++ for efficient training:

------------------------------------------------------------------
countincreasecutoff = 1e-06 (Counts increment cutoff threshold)
countincreasecutoffal = 1e-05 (Counts increment cutoff threshold for alignments in training of fertility models)
mincountincrease = 1e-07 (minimal count increase)
peggedcutoff = 0.03 (relative cutoff probability for alignment-centers in pegging)
probcutoff = 1e-07 (Probability cutoff threshold for lexicon probabilities)
probsmooth = 1e-07 (probability smoothing (floor) value )


parameters for describing the type and amount of output:

-----------------------------------------------------------
compactalignmentformat = 0 (0: detailled alignment format, 1: compact alignment format )
countoutputprefix = (The prefix for output counts)
dumpcount = 0 (Whether we are going to dump count (in addition to) final output?)
dumpcountusingwordstring = 0 (In count table, should actual word appears or just the id? default is id)
hmmdumpfrequency = 0 (dump frequency of HMM)
l = (log file name)
log = 0 (0: no logfile; 1: logfile)
model1dumpfrequency = 0 (dump frequency of Model 1)
model2dumpfrequency = 0 (dump frequency of Model 2)
model345dumpfrequency = 0 (dump frequency of Model 3/4/5)
nbestalignments = 0 (for printing the n best alignments)
nodumps = 0 (1: do not write any files)
o = (output file prefix)
onlyaldumps = 0 (1: do not write any files)
outputpath = (output path)
transferdumpfrequency = 0 (output: dump of transfer from Model 2 to 3)
verbose = 0 (0: not verbose; 1: verbose)
verbosesentence = -10 (number of sentence for which a lot of information should be printed (negative: no output))


parameters describing input files:

----------------------------------
c = (training corpus file name)
d = (dictionary file name)
previousa = (The a-table of previous step)
previousd = (The d-table of previous step)
previousd4 = (The d4-table of previous step)
previousd42 = (The d4-table (2) of previous step)
previoushmm = (The hmm-table of previous step)
previousn = (The n-table of previous step)
previousp0 = (The P0 previous step)
previoust = (The t-table of previous step)
restart = 0 (Restart training from a level,0: Normal restart, from model 1, 1: Model 1, 2: Model 2 Init (Using Model 1 model input and train model 2), 3: Model 2, (using model 2 input and train model 2), 4 : HMM Init (Using Model 1 model and train HMM), 5: HMM (Using Model 2 model and train HMM) 6 : HMM (Using HMM Model and train HMM), 7: Model 3 Init (Use HMM model and train model 3) 8: Model 3 Init (Use Model 2 model and train model 3) 9: Model 3, 10: Model 4 Init (Use Model 3 model and train Model 4) 11: Model 4 and on, )
s = (source vocabulary file name)
sourcevocabularyclasses = (source vocabulary classes file name)
t = (target vocabulary file name)
targetvocabularyclasses = (target vocabulary classes file name)
tc = (test corpus file name)


smoothing parameters:

---------------------
emalsmooth = 0.2 (f-b-trn: smoothing factor for HMM alignment model (can be ignored by -emSmoothHMM))
model23smoothfactor = 0 (smoothing parameter for IBM-2/3 (interpolation with constant))
model4smoothfactor = 0.2 (smooting parameter for alignment probabilities in Model 4)
model5smoothfactor = 0.1 (smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant))
nsmooth = 64 (smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility parameters)
nsmoothgeneral = 0 (smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters)


parameters modifying the models:

--------------------------------
compactadtable = 1 (1: only 3-dimensional alignment table for IBM-2 and IBM-3)
deficientdistortionforemptyword = 0 (0: IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient; 2: distoriton model of empty word is deficient (differently); setting this parameter also helps to avoid that during IBM-3 and IBM-4 training too many words are aligned with the empty word)
depm4 = 76 (d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
depm5 = 68 (d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
emalignmentdependencies = 2 (lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4: previous position; &8: French position; &16: French class)
emprobforempty = 0.4 (f-b-trn: probability for empty word)


parameters modifying the EM-algorithm:

--------------------------------------
m5p0 = -1 (fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training))
manlexfactor1 = 0 ()
manlexfactor2 = 0 ()
manlexmaxmultiplicity = 20 ()
maxfertility = 10 (maximal fertility for fertility models)
ncpus = 0 (Number of threads to be executed, use 0 if you just want all CPUs to be used)
p0 = -1 (fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training))
pegging = 0 (0: no pegging; 1: do pegging)


http://www.lryc.cn/news/27737.html

相关文章:

  • GUI 之 Tkinter编程
  • 【软件测试】性能测试面试题都问什么?面试官想要什么?回答惊险避坑......
  • 后端开发基础能力以及就Java的主流开发框架介绍
  • H2数据库连接时用户密码错误:Wrong user name or password [28000-214] 28000/28000 (Help)
  • 青岛诺凯达机械盛装亮相2023济南生物发酵展,3月与您相约
  • 【JAVA程序设计】【C00111】基于SSM的网上图书商城管理系统——有文档
  • 基于卷积神经网络CNN的三相故障识别
  • Java工厂设计模式详解,大厂的Java抽象工厂模式分享!
  • Git 企业级分支提交流程
  • C/C++每日一练(20230303)
  • Python3-条件控制
  • KDZD地埋电缆故障测试仪
  • 爆款升级!新系列南卡Neo最强旗舰杀到,业内首款无线充骨传导耳机!
  • 基于Spring Boot+Thymeleaf的在线投票系统
  • 【每日一题Day135】LC1487保证文件名唯一 | 哈希表
  • 计算机系统的基本组成 第一节
  • Scrapy爬虫框架入门
  • 最新使用nvm控制node版本步骤
  • Linux内核4.14版本——drm框架分析(1)——drm简介
  • Google的一道经典面试题 - 767. 重构字符串
  • E8-公共选择框相关的表
  • 再学C语言41:变长数组(VLA)
  • 物联网WEB大屏数据可视化
  • 新:DlhSoft Gantt Chart for WPF Crack
  • C++基础(一)—— C++概述、C++对C的扩展(作用域、struct类型、引用、内联函数、函数默认参数、函数占位参数、函数重载)
  • Rust学习总结之if,while,loop,for使用
  • Java知识复习(十一)RabbitMQ
  • thinkphp图片压缩类
  • 如何将图数据库应用于电影智能推荐
  • CSS实现动画效果的菜单收起展开图标,html实现动画效果的箭头