With the emergence of large model-based agents, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which may result in high bandwidth overhead, increased power consumption and latency. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve the efficiency of token transmission, we design a two-stage training algotithm, including cross-modal alignment and task-oriented fine-tuning, for large model-based token communication. Meanwhile, token compression is performed using a sliding window pooling operation to save communication resources. To balance the trade-off between latency and model performance caused by compression, we formulate a weighted-sum optimization problem over latency and validation loss. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.
翻译:随着基于大模型的智能体兴起,广泛采用的Transformer架构不可避免地产生过长的令牌嵌入用于传输,这可能导致高带宽开销、功耗增加和延迟上升。本文提出一种面向任务的多模态令牌传输方案,以实现高效的多模态信息融合与利用。为提高令牌传输效率,我们设计了一种两阶段训练算法,包括跨模态对齐和面向任务的微调,用于基于大模型的令牌通信。同时,采用滑动窗口池化操作进行令牌压缩以节省通信资源。为平衡压缩带来的延迟与模型性能之间的权衡,我们构建了一个以延迟和验证损失为目标的加权和优化问题。通过交替优化方法,我们联合优化了跨用户的带宽、功率分配和令牌长度。仿真结果表明,所提算法在不同带宽和功率预算下均优于基线方法。此外,与未进行跨模态对齐的方法相比,两阶段训练算法在多种信噪比条件下实现了更高的准确率。