This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.
翻译:本文介绍了Bhasha-Rupantarika,一种通过算法-硬件协同设计、专为资源受限环境定制的轻量高效多语言翻译系统。该方法研究了亚字节精度级别(FP8、INT8、INT4和FP4)下的模型部署,实验结果表明模型尺寸(FP4)减少了4.1倍,推理速度提升了4.2倍,对应吞吐量提升至66 tokens/s(提升4.8倍)。这凸显了在基于FPGA加速器的物联网设备中,采用超低精度量化技术对于实时部署的重要性,其性能达到了预期水平。我们的评估涵盖了印度语言与国际语言之间的双向翻译,展示了其在低资源语言环境下的适应性。FPGA部署实现了查找表(LUT)数量减少1.96倍、触发器(FF)数量减少1.65倍,与OPU相比吞吐量提升2.2倍,与HPTA相比提升4.6倍。总体而言,该评估提供了一种基于量化感知翻译的可行方案,兼具硬件效率,适用于可部署的多语言人工智能系统。完整的代码库[https://github.com/mukullokhande99/Bhasha-Rupantarika/]及可复现数据集均已公开,便于研究人员快速集成与进一步开发。