In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.
翻译:在这项工作中,我们展示了用于单词唇读取的高连通时空演动网络(DC-TCN ) 。 尽管时空演动网络(TCN)最近在许多愿景任务中表现出巨大的潜力,但其可接收域不够密集,不足以模拟唇读情景中复杂的时间动态。为了解决这一问题,我们向网络引入了密集的连接,以捕捉更强大的时间特征。此外,我们的方法还利用挤压和抽查区这一轻量级关注机制,以进一步加强模型的分类能力。没有钟声和哨声,我们的DC-TCN 方法在野生(LRW)的唇读数据集中实现了88.36%的准确度,在LRW-1000数据集中实现了43.65%的准确度,该数据集超过了所有基线方法,是两个数据集的新状态。