Social ambiance describes the context in which social interactions happen, and can be measured using speech audio by counting the number of concurrent speakers. This measurement has enabled various mental health tracking and human-centric IoT applications. While on-device Socal Ambiance Measure (SAM) is highly desirable to ensure user privacy and thus facilitate wide adoption of the aforementioned applications, the required computational complexity of state-of-the-art deep neural networks (DNNs) powered SAM solutions stands at odds with the often constrained resources on mobile devices. Furthermore, only limited labeled data is available or practical when it comes to SAM under clinical settings due to various privacy constraints and the required human effort, further challenging the achievable accuracy of on-device SAM solutions. To this end, we propose a dedicated neural architecture search framework for Energy-efficient and Real-time SAM (ERSAM). Specifically, our ERSAM framework can automatically search for DNNs that push forward the achievable accuracy vs. hardware efficiency frontier of mobile SAM solutions. For example, ERSAM-delivered DNNs only consume 40 mW x 12 h energy and 0.05 seconds processing latency for a 5 seconds audio segment on a Pixel 3 phone, while only achieving an error rate of 14.3% on a social ambiance dataset generated by LibriSpeech. We can expect that our ERSAM framework can pave the way for ubiquitous on-device SAM solutions which are in growing demand.
翻译:社交氛围描述了社交互动发生的上下文,可以使用语音音频通过计算同时发言者的数量来测量。这种测量启用了各种精神健康跟踪和以人为中心的IoT应用。虽然在设备上使用社交氛围测量(SAM)非常理想,以确保用户隐私,从而促进上述应用的广泛采用,但基于深度神经网络(DNN)的现代SAM解决方案所需的计算复杂度与移动设备上的通常受限资源存在矛盾。此外,由于各种隐私约束和所需的人力投入,在临床环境下进行SAM时只有有限的标记数据可用或实用,这进一步挑战了在设备上实现的SAM解决方案的可实现准确性。为此,我们提出了一种专门的神经架构搜索框架,用于能源高效和实时SAM(ERSAM)。具体而言,我们的ERSAM框架可以自动搜索推进移动SAM解决方案的可实现准确性与硬件效率的前沿的DNN。例如,我们提供的ERSAM-DNN仅在Pixel 3手机上对5秒音频片段消耗40 mW x 12 h的能量和0.05秒的处理延迟,同时仅在通过LibriSpeech生成的社交氛围数据集上达到14.3%的误差率。我们可以预期,我们的ERSAM框架可以为普及的在设备上使用的SAM解决方案铺平道路,这在需求不断增长。