【导读】TiramisuASR实现了一些语音识别和语音增强架构,例如基于CTC的模型,语音增强生成对抗网络(SEGAN),RNN换能器(Conformer等)。这些模型可以转换为TFLite,以减少部署所需的内存和计算量。
Github地址:
https://github.com/usimarit/TiramisuASR
支持的模型:
CTCModel (End2end models using CTC Loss for training)
SEGAN (Refer to https://github.com/santi-pdp/segan), see examples/segan
Transducer Models (End2end models using RNNT Loss for training)
Conformer Transducer (Reference: https://arxiv.org/abs/2005.08100) See examples/conformer
安装要求:
Ubuntu distribution (ctc-decoders
and semetrics
require some packages from apt)
Python 3.6+
Tensorflow 2.2+: pip install tensorflow
配置安装环境与数据集
运行CTC模型:./scripts/install_ctc_decoders.sh
运行Transducer:./scripts/install_rnnt_loss.sh
运行SEGAN:./scripts/install_semetrics.sh
安装TensorFlow:pip3 install tensorflow
安装库:python3 setup.py install
清理环境(移去/build文件夹下内容):python3 setup.py clean --all
特征提取
特征提取部分分为语音特征提取与文本特征提取。
语音特征包括冲信号中得到的sample_rate
, frame_ms
, stride_ms
与 num_feature_bins
.
语音特征的大小为(B, T, num_feature_bins, num_channels).
文本特征从tiramisu_asr.featurizers.english.txt读入。
数据集
VIVOS: 15小时
https://ailab.hcmus.edu.vn/vivos
InfoRe Technology 1: 25小时,单人
Person https://files.huylenguyen.com/datasets/infore/25hours.zip
InfoRe Technology 2 (also used in VLSP2019): ~415小时
https://files.huylenguyen.com/datasets/infore/audiobooks.zip