In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
翻译:在本报告中,我们描述了我们向Ego4D视听语音传输(AV)2022挑战提交的呈件,我们的输油管以AVATAR为基础,AVATAR是AV-ASR最先进的编码器解码器模型,它早期结合了光谱和RGB图像,我们描述了数据集、实验设置和稀释,我们的最后方法在挑战测试上达到了68.40的WER,比基准高出43.7%,并赢得了挑战。