Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.
翻译:在多智能体强化学习(MARL)中,有效通信对于成功至关重要,但受带宽限制;然而,以往方法仅限于复杂的门控机制,仅决定是否通信,而非通信的精确程度。在比特级别学习优化消息精度本质上更为困难,因为所需的离散化步骤会中断梯度流。我们通过推广可微分离散通信学习(DDCL)框架来解决这一问题,该框架支持离散消息的端到端优化。我们的主要贡献是将DDCL扩展至支持无界信号,将其转化为适用于任何MARL架构的通用即插即用层。我们通过三项关键结果验证了该方法。首先,在受控环境中进行定性分析,我们展示了智能体如何根据任务的信息需求动态调整消息精度。其次,我们将DDCL的变体集成到四种最先进的MARL算法中,结果表明其在匹配或超越任务性能的同时,将带宽降低了一个数量级以上。最后,我们为MARL通信中的“苦涩教训”提供了直接证据:基于Transformer的简单策略结合DDCL,其性能与复杂专用架构相当,这质疑了定制化通信设计的必要性。