Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation. Secondly, we propose a series of architectural changes, including a novel Depthwise Separable Temporal Convolutional Network (DS-TCN) head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8.2x and 3.9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.
翻译:由于神经网络的死灰复燃,唇印取得了许多进展。最近的工作重点是通过寻找最佳结构或改进一般化来改善业绩等方面,例如通过寻找最佳结构或改进一般化来改进业绩。然而,在目前的方法和在实际情景中有效部署唇读的要求之间仍然存在着巨大的差距。在这项工作中,我们提出了一系列创新,以大大缩小这一差距:首先,我们用自我蒸馏方法分别将LRW和LRW-1000和LRW-1000提高到88.5%和46.6%的宽度差提高最先进的业绩。第二,我们提出了一系列建筑变革,包括新颖的热源性静电动网络(DS-TCNN)头部,将计算成本压缩到最初模型的一小部分(已经相当高效的)。第三,我们表明,知识蒸馏是恢复轻量模型绩效的一个非常有效的工具。这导致一系列模型的精确效率取舍不同。然而,我们最有希望的轻量模型与目前的状况持平齐。同时,在实际部署成本和3.9的模型应用中,将分别显示8.2 和3.9的希望度参数的部署参数将降低。