This paper introduces a parallel and asynchronous Transformer framework designed for efficient and accurate multilingual lip synchronization in real-time video conferencing systems. The proposed architecture integrates translation, speech processing, and lip-synchronization modules within a pipeline-parallel design that enables concurrent module execution through message-queue-based decoupling, reducing end-to-end latency by up to 3.1 times compared to sequential approaches. To enhance computational efficiency and throughput, the inference workflow of each module is optimized through low-level graph compilation, mixed-precision quantization, and hardware-accelerated kernel fusion. These optimizations provide substantial gains in efficiency while preserving model accuracy and visual quality. In addition, a context-adaptive silence-detection component segments the input speech stream at semantically coherent boundaries, improving translation consistency and temporal alignment across languages. Experimental results demonstrate that the proposed parallel architecture outperforms conventional sequential pipelines in processing speed, synchronization stability, and resource utilization. The modular, message-oriented design makes this work applicable to resource-constrained IoT communication scenarios including telemedicine, multilingual kiosks, and remote assistance systems. Overall, this work advances the development of low-latency, resource-efficient multimodal communication frameworks for next-generation AIoT systems.


翻译:本文提出了一种并行异步Transformer框架,专为实时视频会议系统中高效准确的多语言唇形同步而设计。该架构在流水线并行设计中集成了翻译、语音处理和唇形同步模块,通过基于消息队列的解耦机制实现模块的并发执行,与顺序处理方法相比,端到端延迟最高可降低3.1倍。为提升计算效率与吞吐量,各模块的推理流程通过底层图编译、混合精度量化及硬件加速内核融合进行优化。这些优化在保持模型精度与视觉质量的同时,显著提升了系统效率。此外,一个上下文自适应的静音检测组件在语义连贯的边界处对输入语音流进行切分,从而提升了翻译一致性及跨语言的时间对齐效果。实验结果表明,所提出的并行架构在处理速度、同步稳定性及资源利用率方面均优于传统的顺序流水线方法。该模块化、面向消息的设计使得本工作可适用于资源受限的物联网通信场景,包括远程医疗、多语言交互终端及远程辅助系统。总体而言,本研究推动了面向新一代人工智能物联网系统的低延迟、高资源效率多模态通信框架的发展。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员