10万元奖金语音识别赛进行中 | 自动化所博士生讲解业界主流 KALDI 基线模型

2020 年 3 月 12 日 AINLP

2019 年 12 月，北京智源人工智能研究院联合爱数智慧和数据评测平台 biendata，共同发布了“智源—MagicSpeechNet 家庭场景语音数据集挑战赛” （2019 年 12 月 — 2020年 3 月），总奖金为 10 万元。参赛者需要使用比赛提供的真实家庭环境中的双人对话音频数据，训练并优化语音识别（ASR）模型。比赛和数据复制下方链接查看，或点击“阅读原文”。

目前，赛事已接近半程，为便于选手熟悉和上手赛题，biendata 邀请长期处于排行榜前列的 zs_jxy 战队（队员：张帅与白烨，来自极限元-中科院智能交互联合实验室，专注于自动语音识别模型与系统））从赛题解析、数据处理、模型选择、提升方向等方面进行深入分析，分享基于 KALDI 的语音识别基线系统，并在线直播，与选手交流互动，可以抛砖引玉，共同探索 ASR 实际应用场景中可行的解决方案。Baseline 与直播回放地址欢迎点击下方链接查看。

比赛地址

baseline原文链接

直播回放地址：https://www.bilibili.com/video/av95030153/

Baseline 详情

概述
本次比赛的任务为日常家庭环境中的对话语音识别。所使用数据集为智源 MagicSpeechNet 家庭场景中文语音数据集，其中的语言材料来自数十段真实环境中的双人对话。我们提供一个基于 KALDI 语音识别框架的基线系统，可以供大家参考。KALDI 是目前工业界最流行的语音识别框架，完整地包含隐马尔可夫模型，高斯混合模型，决策树聚类，深度神经网络，解码器，以及加权有限状态转换器等技术。我们提供完整的 recipe，以及一些必要的资源文件，在附带的压缩包中。这里给出基线系统方案的介绍。KALDI 的安装请参见网上的博客。

本方案采用基于LF-MMI损失函数的Chain模型，特征采用MFCC特征，使用上下文相关的三音素构建GMM-HMM构建强制对齐，然后训练CNN+TDNNF的深度神经网络模型，语言模型采用训练集标注来训练3-gram，线上识别字错误率为0.502。

数据准备
赛题的提供的数据为16kHz的wav音频文件，对应的标注文件为json格式，标注了切分位置，说话人，标注文本等内容。测试集的标注json文件只标注语音段的切分位置，而没有标注文本。我们首先将其整理为KALDI的数据格式。KALDI基本的数据格式包含5个文件。

wav.scp 标明 wav 文件的路径，格式为: wavid path/to/wav

segments 标明句子对应的语音段，格式为: uttid wavid start_time end_time

utt2spk 标明句子对应的说话人，格式为: uttid spkid

text 为标注，格式为: uttid text文本标注

由于赛题提供的语音，说话人信息并不完整，所以我们简单地假设每一个句子是一个说话人。同时我们还对text进行了文本标准化，去掉标点符号，以及进行分词等。我们首先用 python 脚本，根据提供的json标注文件，生成wav.scp, segments, utt2spk, text四个文件。具体请参阅local/format_data.py和local/format_testdata.py两个文件。直接运行bash data_prep.sh即可。其主要内容为

ls ../kernel/

magic_comp.tar.gz

# 解压代码文件到当前工作目录
!tar -zxvf  ../kernel/magic_comp.tar.gz

# 转换到代码目录
!cd magic_comp/s5

python local/format_data.py $trans_dir/train $audio_dir/train "" data/train
utils/utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
mv data/train/utt2spk data/train/utt2spk.spk
mv data/train/spk2utt data/train/spk2utt.spk
cat data/train/segments | awk '{print $1" "$1}' > data/train/utt2spk
cp data/train/utt2spk data/train/spk2utt
utils/fix_data_dir.sh data/train

准备发音词典模型

我们采用common目录下开源aishell发音词典。调用local/prepare_dict.sh脚本即可把它整理成KALDI需要的格式。然后采用utils/prepare_lang.sh 构建 L.fst。

dictdir=data/local/dict
local/prepare_dict.sh common/aishell_lexicon.txt $dictdir || exit 1;
utils/prepare_lang.sh --position-dependent-phones false $dictdir \
    "<SPOKEN_NOISE>" data/local/lang data/lang || exit 1;

准备语言模型

然后我们来利用训练集的标注训练语言模型

bash local/train_lms_srilm.sh $dictdir || exit 1;
utils/format_lm.sh data/lang data/local/lm/3gram.wb.gz \
    $dictdir/lexicon.txt data/lang_test || exit 1;

训练GMM-HMM模型

然后我们对每一个文件夹提取MFCC特征

mfccdir=mfcc
for x in train dev_Android  dev_IOS  dev_Recorder  dev_SPK059  dev_SPK060  test_Android  test_IOS  test_Recorder; do
  steps/make_mfcc.sh --cmd "$train_cmd" --nj 30 --mfcc-config conf/mfcc.conf data/$x exp/make_mfcc/$x $mfccdir || exit 1;
  steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir || exit 1;
  utils/fix_data_dir.sh data/$x || exit 1;
done

接下来是训练GMM-HMM模型做帧级别的标注，为后面训练深度神经网络声学模型做准备。首先训练单音素GMM-HMM模型

steps/train_mono.sh --cmd "$train_cmd" --nj 30 \
  data/train data/lang exp/mono || exit 1;

然后利用训练好的单音素模型，进行强制对齐，训练三音素模型。

steps/align_si.sh --cmd "$train_cmd" --nj 30 \
  data/train data/lang exp/mono exp/mono_ali || exit 1;

steps/train_deltas.sh --cmd "$train_cmd" \
 2500 20000 data/train data/lang exp/mono_ali exp/tri1 || exit 1;

再利用前面训练的GMM-HMM模型，进行强制对齐，训练更复杂的GMM-HMM模型，依次进行迭代。

steps/train_deltas.sh --cmd "$train_cmd" \
 2500 20000 data/train data/lang exp/tri1_ali exp/tri2 || exit 1;

steps/align_si.sh --cmd "$train_cmd" --nj 30 \
  data/train data/lang exp/tri2 exp/tri2_ali || exit 1;

steps/train_lda_mllt.sh --cmd "$train_cmd" \
 2500 20000 data/train data/lang exp/tri2_ali exp/tri3a || exit 1;

steps/align_fmllr.sh --cmd "$train_cmd" --nj 30 \
  data/train data/lang exp/tri3a exp/tri3a_ali || exit 1;

steps/train_sat.sh --cmd "$train_cmd" \
  2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a || exit 1;

steps/align_fmllr.sh  --cmd "$train_cmd" --nj 10 \
  data/train data/lang exp/tri4a exp/tri4a_ali

steps/train_sat.sh --cmd "$train_cmd" \
  3500 100000 data/train data/lang exp/tri4a_ali exp/tri5a || exit 1;

utils/mkgraph.sh data/lang_test exp/tri5a exp/tri5a/graph || exit 1;
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
   exp/tri5a/graph data/dev_IOS exp/tri5a/decode_dev_IOS || exit 1;

steps/align_fmllr.sh --cmd "$train_cmd" --nj 10 \
  data/train data/lang exp/tri5a exp/tri5a_ali || exit 1;

tri5a就是我们最后获取的GMM-HMM模型。我们利用这个模型，解码IOS开发集测试性能

utils/mkgraph.sh data/lang_test exp/tri5a exp/tri5a/graph || exit 1;
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
   exp/tri5a/graph data/dev_IOS exp/tri5a/decode_dev_IOS || exit 1;

最后进行强制用此模型进行强制对齐，为DNN的训练做准备。

steps/align_fmllr.sh --cmd "$train_cmd" --nj 10 \
  data/train data/lang exp/tri5a exp/tri5a_ali || exit 1;

训练DNN模型

首先，我们利用速度扰动，进行数据增强，并提取40维MFCC特征，然后重新进行强制对齐。调用local/chain/data_aug.sh速度扰动，并对扰动后的特征进行强制对齐

if [ $stage -le 1 ]; then
  utils/data/perturb_data_dir_speed_3way.sh --always-include-prefix true \
           data/${train_set} data/${train_set}_sp

    mfccdir=mfcc_perturbed
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 50 \
                       data/${train_set}_sp exp/make_mfcc/${train_set}_sp $mfccdir
    steps/compute_cmvn_stats.sh data/${train_set}_sp exp/make_mfcc/${train_set}_sp $mfccdir
    utils/fix_data_dir.sh data/${train_set}_sp
fi

if [ $stage -le 2 ] && $generate_alignments; then
    # obtain the alignment of the perturbed data
    steps/align_fmllr.sh --nj 100 --cmd "$train_cmd" \
      data/${train_set}_sp data/$langdir $gmmdir ${gmmdir}_ali_sp
fi

提取40维MFCC特征

if [ $stage -le 2 ]; then
    mfccdir=mfcc_hires
    for x in train_sp dev_Recorder dev_IOS dev_Android; do
    utils/copy_data_dir.sh data/$x data/${x}_hires
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 50 --mfcc-config conf/mfcc_hires.conf \
          data/${x}_hires exp/make_mfcc/${x}_hires $mfccdir
    steps/compute_cmvn_stats.sh data/${x}_hires exp/make_mfcc/${x}_hires $mfccdir
    utils/fix_data_dir.sh data/${x}_hires
    done
fi

然后开始训练DNN模型首先进行lattice的生成

if [ $stage -le 9 ]; then
  nj=$(cat $ali_dir/num_jobs) || exit 1;
  steps/align_fmllr_lats.sh --nj $nj --cmd "$train_cmd" data/$train_set \
    data/lang exp/tri5a exp/tri5a_ali_sp_lats
  rm exp/tri5a_ali_sp_lats/fsts.*.gz # save space
fi

构建chain模型的HMM拓扑结构

if [ $stage -le 10 ]; then
  rm -rf $lang
  cp -r data/lang $lang
  silphonelist=$(cat $lang/phones/silence.csl) || exit 1;
  nonsilphonelist=$(cat $lang/phones/nonsilence.csl) || exit 1;
  steps/nnet3/chain/gen_topo.py $nonsilphonelist $silphonelist >$lang/topo
fi

构建决策树

if [ $stage -le 11 ]; then
  steps/nnet3/chain/build_tree.sh --frame-subsampling-factor 3 \
      --context-opts "--context-width=2 --central-position=1" \
      --cmd "$train_cmd" 7000 data/$train_set $lang $ali_dir $treedir
fi

配置神经网络，这里我们采用CNN+TDNNF的神经网络

if [ $stage -le 12 ]; then
  echo "$0: creating neural net configs using the xconfig parser";

  num_targets=$(tree-info $treedir/tree |grep num-pdfs|awk '{print $2}')
  learning_rate_factor=$(echo "print (0.5/$xent_regularize)" | python2)

  cnn_opts="l2-regularize=0.01"
  tdnnf_first_opts="l2-regularize=0.01 dropout-proportion=0.0 bypass-scale=0.0"
  tdnnf_opts="l2-regularize=0.01 dropout-proportion=0.0 bypass-scale=0.66"
  linear_opts="l2-regularize=0.01 orthonormal-constraint=-1.0"
  prefinal_opts="l2-regularize=0.01"
  output_opts="l2-regularize=0.002"

  mkdir -p $dir/configs
  cat <<EOF > $dir/configs/network.xconfig
  input dim=40 name=input
  # this takes the MFCCs and generates filterbank coefficients.  The MFCCs
  # are more compressible so we prefer to dump the MFCCs to disk rather
  # than filterbanks.
  idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
  batchnorm-component name=idct-batchnorm input=idct
  conv-relu-batchnorm-layer name=cnn1 $cnn_opts height-in=40 height-out=40 time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=64
  conv-relu-batchnorm-layer name=cnn2 $cnn_opts height-in=40 height-out=40 time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=64
  conv-relu-batchnorm-layer name=cnn3 $cnn_opts height-in=40 height-out=20 height-subsample-out=2 time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=128
  conv-relu-batchnorm-layer name=cnn4 $cnn_opts height-in=20 height-out=20 time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=128
  conv-relu-batchnorm-layer name=cnn5 $cnn_opts height-in=20 height-out=10 height-subsample-out=2 time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=256
  conv-relu-batchnorm-layer name=cnn6 $cnn_opts height-in=10 height-out=10  time-offsets=-1,0,1 height-offsets=-1,0,1 num-filters-out=256
  # the first TDNN-F layer has no bypass (since dims don't match), and a larger bottleneck so the
  # information bottleneck doesn't become a problem.  (we use time-stride=0 so no splicing, to
  # limit the num-parameters).
  tdnnf-layer name=tdnnf7 $tdnnf_first_opts dim=1536 bottleneck-dim=256 time-stride=0
  tdnnf-layer name=tdnnf8 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf9 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf10 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf11 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf12 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf13 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf14 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  tdnnf-layer name=tdnnf15 $tdnnf_opts dim=1536 bottleneck-dim=160 time-stride=3
  linear-component name=prefinal-l dim=256 $linear_opts
  ## adding the layers for chain branch
  prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts small-dim=256 big-dim=1536
  output-layer name=output include-log-softmax=false dim=$num_targets $output_opts
  # adding the layers for xent branch
  prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts small-dim=256 big-dim=1536
  output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts
EOF
  steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/
fi

训练神经网络

if [ $stage -le 13 ]; then

  steps/nnet3/chain/train.py --stage $train_stage \
    --cmd "$train_cmd" \
    --feat.cmvn-opts "--norm-means=true --norm-vars=false" \
    --chain.xent-regularize $xent_regularize \
    --chain.leaky-hmm-coefficient 0.1 \
    --chain.l2-regularize 0.0 \
    --chain.apply-deriv-weights false \
    --chain.lm-opts="--num-extra-lm-states=2000" \
    --trainer.dropout-schedule $dropout_schedule \
    --trainer.add-option="--optimization.memory-compression-level=2" \
    --egs.dir "$common_egs_dir" \
    --egs.stage $get_egs_stage \
    --egs.opts "--frames-overlap-per-eg 0 --constrained false" \
    --egs.chunk-width $frames_per_eg \
    --trainer.num-chunk-per-minibatch 64 \
    --trainer.frames-per-iter 1500000 \
    --trainer.num-epochs 6 \
    --trainer.optimization.num-jobs-initial 4 \
    --trainer.optimization.num-jobs-final 4 \
    --trainer.optimization.initial-effective-lrate 0.00025 \
    --trainer.optimization.final-effective-lrate 0.000025 \
    --trainer.max-param-change 2.0 \
    --cleanup.remove-egs $remove_egs \
    --feat-dir data/${train_set}_hires \
    --tree-dir $treedir \
    --lat-dir exp/tri5a_ali_sp_lats \
    --dir $dir  || exit 1;

fi

构建解码图

if [ $stage -le 14 ]; then
  utils/mkgraph.sh --self-loop-scale 1.0 data/lang_test $dir $dir/graph_tg
fi

解码
训练完成以后，我们对IOS测试集进行解码，运行bash decode_test_chain.sh

decode_nj=6
graphaffx=tg
dir=exp/chain/cnn_tdnnf
graph_dir=$dir/graph_$graphaffx


for decode_set in test_IOS; do
    if [ ! -d data/${decode_set}_hires ]; then
        utils/copy_data_dir.sh data/${decode_set} data/${decode_set}_hires
        steps/make_mfcc.sh --cmd "$train_cmd" --nj 4 --mfcc-config conf/mfcc_hires.conf data/${decode_set}_hires exp/make_mfcc/${decode_set}_hires $mfccdir || exit 1;
        steps/compute_cmvn_stats.sh data/${decode_set}_hires exp/make_mfcc/${decode_set}_hires $mfccdir || exit 1;
        utils/fix_data_dir.sh data/${decode_set}_hires || exit 1;
    fi
    (
    steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 --skip-scoring true \
        --nj $decode_nj --cmd "$decode_cmd" \
        $graph_dir data/${decode_set}_hires \
        $dir/decode_${decode_set}_${graphaffx} || exit 1;
    ) || touch $dir/.error &
done
wait

然后调用gen_submission.sh从lattice中提取出结果，并转换为提交的csv格式

testset=test_IOS
decodedir=exp/chain/cnn_tdnnf/decode_test_IOS_tg
graphdir=exp/chain/cnn_tdnnf/graph_tg


bash local/get_hyp_from_lats.sh \
    $decodedir $graphdir

比赛数据

“智源 MagicSpeechNet 家庭场景中文语音数据集”是当前业界稀缺的优质家居环境语音数据，其中包含数百小时的真实家庭环境中的双人对话，每段对话基于多种平台进行录制，并已完全转录和标注。比赛数据分为训练集、开发集和测试集三部分，测试集数据为需要识别的音频文件，每段音频分为安卓平台、iOS 平台，录音笔录制的三个文件。为便于选手分割每段音频，比赛提供了标明起始和结束时间点信息的 json 文件，选手需使用模型识别音频中的对话，并根据 json 中对应的 uttid 提交相应的文本。
相较于国内外同类多通道语音识别比赛，本比赛数据在数量、场景、声音特性等方面具有以下优势。

（1）大量的对话数据国内的语音识别比赛基本使用朗读类型的语音数据，而本比赛使用的数据为真实的对话数据。数据为完全真实场景的对话，说话人以放松和无脚本的方式，围绕所选主题自由对话。相比基于对话数据的国际同类比赛，在数据量方面仍旧具有极大的优势。同时，合理的说话人语音交叠更真实地体现日常家庭场景下的语音识别难度。

（2）场景真实多样本数据集采集于3个真实的家庭场景，说话人以放松和无脚本的方式，围绕所选主题自由对话。不同的采集环境丰富了数据的多样性，同时增强了比赛的难度。

（3）近讲与多平台远讲数据结合每段对话有 5 个通道的同步录音，包括 3 个远讲通道和2 个近讲通道。远讲通道分别由多个型号的安卓手机，苹果手机和录音笔录制，充分体现多平台录音数据的特性；近讲数据使用高保真麦克风录制，与说话人的嘴保持10 cm 的距离。

（4）丰富均衡的声音特性本数据集拥有丰富均衡的声音特性。录制本数据集的说话人来自中国大陆不同地域，存在一定的普通话口音。同时，说话人选自不同年龄段，性别均衡。