当前位置: 首页 > news >正文

Kaldi Data preparation

链接:GitHub - nessessence/Kaldi_ASR_Tutorial: speech recognition using Kaldi framework

Let's start with formatting data. We will randomly split wave files into test and train dataset(set the ratio as you want). Create a directory data and,then two subdirectories train and test in it.

Now, for each dataset (train, test), we need to generate these files representing our raw data - the audio and the transcripts.

  • text
    • Essentially, transcripts.
    • An utterance per line, <utt_id> <transcript>
      • e.g. Aaron-20080318-kdl_b0019 HIS SLIM HANDS GRIPPED THE EDGES OF THE TABLE
    • We will use filenames without extensions as utt_ids for now.
    • Although recordings are in Hebrew, we will use English words, YES and NO, to avoid complicating the problem.
  • wav.scp
    • Indexing files to unique ids.
    • <file_id> <wave filename with path OR command to get wave file>
      • e.g. Aaron-20080318-kdl_b0019 /mnt/data/VF_Main_16kHz/Aaron-20080318-kdl/wav/b0019.wav
    • Again, we can use file names as file_ids.
  • utt2spk
    • For each utterance, mark which speaker spoke it.
    • <utt_id> <speaker_id>
      • e.g. Aaron-20080318-kdl_b0019 Aaron
    • Since we have only one speaker in this example, let's use "global" as speaker_id
  • spk2utt
    • Simply inverse indexed utt2spk (<speaker_id> <all_hier_utterences>)
  • full_vocab : list of all the vocabulary in the text of training data. (this file will be used for making the dictionary)
  • (optional) segmentsnot used for this data.
    • Contains utterance segmentation/alignment information for each recording.
    • Only required when a file contains multiple utterances, which is not this case.
  • (optional) reco2file_and_channel: *not used for this data. *
    • Only required when audios were recorded in dual channels for conversational setup.
  • (optional) spk2gender: not used for this data.
    • Map from speakers to their gender information.
    • Used in vocal tract length normalization.

Our task is to generate these files. You can use this python notebook preparation_data.ipynb. but if this's your first time in Kaldi, I encourage you to write your own script because it'll improve your understanding of Kaldi format. Note: you can generate the "spk2utt" file using Kaldi utility: utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt

http://www.lryc.cn/news/34975.html

相关文章:

  • libevent 学习笔记
  • jupyter的使用
  • 中级数据开发工程师养成计
  • fastjson 返回 $ref 数据
  • Zookeeper特性和节点数据类型详解
  • Java代码是如何被CPU狂飙起来的?
  • Dynamics365安装失败解决及注册编写
  • Kafka 集群参数
  • 等保2.0与1.0 测评要求的变化
  • nodejs学习巩固笔记-nodejs基础,Node.js 高级编程(核心模块、模块加载机制)
  • 2023年春【移动计算技术】文献精读(二)-3 || 附:创新点、创新思想和技术路线总结
  • 企业新闻稿的格式和要求是什么?如何写好新闻稿?
  • String类的底层原理和版本演变
  • 软考高级信息系统项目管理师系列之二十三:项目采购管理
  • SpringMVC-0308
  • [数据结构]:14-选择排序(顺序表指针实现形式)(C语言实现)
  • 基于C/C++综合训练 ----- 贪吃蛇
  • Unity 混合操作(Blending)
  • Hive建表高阶语句
  • 面向新时代,海泰方圆战略升级!“1465”隆重发布!
  • 带你感受一次JVM调优实战
  • ALG和STUN
  • 原生HTML放大镜
  • C++——模板
  • Chapter2.1:线性表基础
  • Spring源码解析-Spring 循环依赖
  • 从零开始学架构——架构设计的目的
  • Python 异步: 异步生成器(16)
  • .net6 web api使用EF Core,根据model类自动生成表
  • 计算机科学导论笔记(五)