Kaldi Data preparation
链接:GitHub - nessessence/Kaldi_ASR_Tutorial: speech recognition using Kaldi framework
Let's start with formatting data. We will randomly split wave files into test and train dataset(set the ratio as you want). Create a directory data and,then two subdirectories train and test in it.
Now, for each dataset (train, test), we need to generate these files representing our raw data - the audio and the transcripts.
text
- Essentially, transcripts.
- An utterance per line,
<utt_id> <transcript>
- e.g.
Aaron-20080318-kdl_b0019 HIS SLIM HANDS GRIPPED THE EDGES OF THE TABLE
- e.g.
- We will use filenames without extensions as utt_ids for now.
- Although recordings are in Hebrew, we will use English words, YES and NO, to avoid complicating the problem.
wav.scp
- Indexing files to unique ids.
<file_id> <wave filename with path OR command to get wave file>
- e.g.
Aaron-20080318-kdl_b0019 /mnt/data/VF_Main_16kHz/Aaron-20080318-kdl/wav/b0019.wav
- e.g.
- Again, we can use file names as file_ids.
utt2spk
- For each utterance, mark which speaker spoke it.
<utt_id> <speaker_id>
- e.g.
Aaron-20080318-kdl_b0019 Aaron
- e.g.
- Since we have only one speaker in this example, let's use "global" as speaker_id
spk2utt
- Simply inverse indexed
utt2spk
(<speaker_id> <all_hier_utterences>
)
- Simply inverse indexed
full_vocab
: list of all the vocabulary in the text of training data. (this file will be used for making the dictionary)- (optional)
segments
: not used for this data.- Contains utterance segmentation/alignment information for each recording.
- Only required when a file contains multiple utterances, which is not this case.
- (optional)
reco2file_and_channel
: *not used for this data. *- Only required when audios were recorded in dual channels for conversational setup.
- (optional)
spk2gender
: not used for this data.- Map from speakers to their gender information.
- Used in vocal tract length normalization.
Our task is to generate these files. You can use this python notebook preparation_data.ipynb. but if this's your first time in Kaldi, I encourage you to write your own script because it'll improve your understanding of Kaldi format. Note: you can generate the "spk2utt" file using Kaldi utility: utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt