This paper describes an end-to-end (E2E) neural architecture for the audio rendering of small portions of display content on low resource personal computing devices. It is intended to address the problem of accessibility for vision-impaired or vision-distracted users at the hardware level. Neural image-to-text (ITT) and text-to-speech (TTS) approaches are reviewed and a new technique is introduced to efficiently integrate them in a way that is both efficient and back-propagate-able, leading to a non-autoregressive E2E image-to-speech (ITS) neural network that is efficient and trainable. Experimental results are presented showing that, compared with the non-E2E approach, the proposed E2E system is 29% faster and uses 19% fewer parameters with a 2% reduction in phone accuracy. A future direction to address accuracy is presented.
翻译:本文描述一个用于低资源个人计算设备上小部分显示内容音频转换的端到端(E2E)神经结构,目的是解决硬件一级视障用户或视障用户的无障碍问题,审查了神经图像到文字(ITT)和文字到语音(TTS)方法,采用了一种新技术,以高效和后推进的方式有效地整合这些内容,从而形成一个高效和可培训的不偏向的E2E图像到语音(ITS)神经网络。 实验结果显示,与非E2E方法相比,拟议的E2E系统速度快29%,使用比19%少的参数,手机精确度降低2%。 提出了未来处理准确性的方向。</s>