Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one's lipreading skills can be challenging. This is further exacerbated in the COVID19 pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become the most effective form of training for many types of skill development. However, online lipreading resources are scarce as creating such resources is an extensive process needing months of manual effort to record hired actors. Because of the manual pipeline, such platforms are also limited in vocabulary, supported languages, accents, and speakers and have a high usage cost. In this work, we investigate the possibility of replacing real human talking videos with synthetically generated videos. Synthetic data can easily incorporate larger vocabularies, variations in accent, and even local languages and many speakers. We propose an end-to-end automated pipeline to develop such a platform using state-of-the-art talking head video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading exercises to validate the quality of our designed platform against the existing lipreading platforms. Our studies concretely point toward the potential of our approach in developing a large-scale lipreading MOOC platform that can impact millions of people with hearing loss.
翻译:然而,由于人工管道,这些平台在词汇、支持的语言、口音和演讲者方面也有限,而且使用成本很高。 在这项工作中,我们研究用合成制作的视频取代真正的人说话视频的可能性。合成数据可以很容易地纳入更多的语音信条、口音变化,甚至当地语言和许多发言者。我们建议用一个端对端自动管道来开发这样一个平台,以便利用目前最先进的首席视频发电机网络、文本对口语模型和计算机视觉技术来进行认真的模拟。