当前位置：首页 > news >正文

Kaldi Data preparation

news 2025/6/25 13:14:15

链接：GitHub - nessessence/Kaldi_ASR_Tutorial: speech recognition using Kaldi framework

Let's start with formatting data. We will randomly split wave files into test and train dataset(set the ratio as you want). Create a directory data and,then two subdirectories train and test in it.

Now, for each dataset (train, test), we need to generate these files representing our raw data - the audio and the transcripts.

text
- Essentially, transcripts.
- An utterance per line, <utt_id> <transcript>
  - e.g. Aaron-20080318-kdl_b0019 HIS SLIM HANDS GRIPPED THE EDGES OF THE TABLE
- We will use filenames without extensions as utt_ids for now.
- Although recordings are in Hebrew, we will use English words, YES and NO, to avoid complicating the problem.
wav.scp
- Indexing files to unique ids.
- <file_id> <wave filename with path OR command to get wave file>
  - e.g. Aaron-20080318-kdl_b0019 /mnt/data/VF_Main_16kHz/Aaron-20080318-kdl/wav/b0019.wav
- Again, we can use file names as file_ids.
utt2spk
- For each utterance, mark which speaker spoke it.
- <utt_id> <speaker_id>
  - e.g. Aaron-20080318-kdl_b0019 Aaron
- Since we have only one speaker in this example, let's use "global" as speaker_id
spk2utt
- Simply inverse indexed utt2spk (<speaker_id> <all_hier_utterences>)
full_vocab : list of all the vocabulary in the text of training data. (this file will be used for making the dictionary)
(optional) segments: not used for this data.
- Contains utterance segmentation/alignment information for each recording.
- Only required when a file contains multiple utterances, which is not this case.
(optional) reco2file_and_channel: *not used for this data. *
- Only required when audios were recorded in dual channels for conversational setup.
(optional) spk2gender: not used for this data.
- Map from speakers to their gender information.
- Used in vocal tract length normalization.

Our task is to generate these files. You can use this python notebook preparation_data.ipynb. but if this's your first time in Kaldi, I encourage you to write your own script because it'll improve your understanding of Kaldi format. Note: you can generate the "spk2utt" file using Kaldi utility: utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt

查看全文

http://www.lryc.cn/news/34975.html