Real-world wireless data are expensive to collect and often lack sufficient expert demonstrations, causing existing offline RL methods to overfit suboptimal behaviors and exhibit unstable performance. To address this issue, we propose CORE, an offline RL framework specifically designed for wireless environments. CORE identifies latent expert trajectories from noisy datasets via behavior embedding clustering, and trains a conditional variational autoencoder with a contrastive objective to separate expert and non-expert behaviors in latent space. Based on the learned representations, CORE constructs compensable rewards that reflect expert-likelihood, effectively guiding policy learning under limited or imperfect supervision. More broadly, this work represents one of the early systematic explorations of offline RL in wireless networking, where prior adoption remains limited. Beyond introducing offline RL techniques to this domain, we further examine intrinsic wireless data characteristics and develop a domain-aligned algorithm that explicitly accounts for their structural properties. While offline RL has not yet been fully established as a standard methodology in the wireless community, our study aims to provide foundational insights and empirical evidence to support its broader acceptance.
翻译:现实世界中的无线数据采集成本高昂,且往往缺乏足够的专家示范,导致现有离线强化学习方法容易对次优行为产生过拟合,并表现出不稳定的性能。为解决这一问题,我们提出了CORE——一个专为无线环境设计的离线强化学习框架。CORE通过行为嵌入聚类从含噪声数据集中识别潜在专家轨迹,并训练一个具有对比目标的条件变分自编码器,以在潜在空间中区分专家与非专家行为。基于学习到的表征,CORE构建了反映专家相似度的可补偿奖励,从而在有限或不完善的监督下有效指导策略学习。更广泛而言,本研究代表了无线网络领域对离线强化学习的早期系统性探索之一,此前该方向的应用仍较为有限。除了将离线强化学习技术引入该领域外,我们进一步剖析了无线数据的内在特性,并开发了一种显式考虑其结构特性的领域对齐算法。尽管离线强化学习尚未在无线领域完全确立为标准方法,但本研究旨在为其更广泛接受提供基础性见解与实证依据。