We propose and analyze a product segmentation newsvendor problem, which generalizes the phenomenon of segmentation sales of a class of perishable items. The product segmentation newsvendor problem is a new variant of the newsvendor problem, reflecting that sellers maximize profits by determining the inventory of the whole item in the context of uncertain demand for sub-items. We derive the closed-form robust ordering decision by assuming that the means and covariance matrix of stochastic demand are available but not the distributions. However, robust approaches that always trade-off in the worst-case demand scenario face a concern in solution conservatism; thus, the traditional robust schemes offer unsatisfactory. In this paper, we integrate robust and deep reinforcement learning (DRL) techniques and propose a new paradigm termed robust learning to increase the attractiveness of robust policies. Notably, we take the robust decision as human domain knowledge and implement it into the training process of DRL by designing a full-process human-machine collaborative mechanism of teaching experience, normative decision, and regularization return. Simulation results confirm that our approach effectively improves robust performance and can generalize to various problems that require robust but less conservative solutions. Simultaneously, fewer training episodes, increased training stability, and interpretability of behavior may have the opportunity to facilitate the deployment of DRL algorithms in operational practice. Furthermore, the successful attempt of RLDQN to solve the 1000-dimensional demand scenarios reveals that the algorithm provides a path to solve complex operational problems through human-machine collaboration and may have potential significance for solving other complex operational management problems.
翻译:我们提出并分析产品分割式新闻供应商问题。产品分割式新闻供应商问题是新闻供应商问题的一种新变体,反映了卖方通过在对子项目需求不确定的情况下确定整个项目的库存来最大限度地获得利润。我们提出并分析一个产品分割式新闻供应商问题。我们提出并分析一个产品分割式新闻供应商问题。我们提出并分析一个封闭式的稳健的订购决定,假设有随机需求的手段和共性矩阵,但不能分发。然而,在最坏情况需求情景中总是进行交易的稳健方法在解决方案保守主义中受到关注;因此,传统的稳健计划提供了不令人满意的解决办法。在本文件中,我们整合了强健和深入的强化学习(DRL)技术,并提出了一个新的模式,称为强有力地学习提高稳健政策的吸引力。我们将稳健的决定作为人类领域知识纳入DRL的培训过程,设计一个全过程的人类机器合作机制,用于教学经验、规范性决定和正规回归。模拟结果证实,我们的方法有效地改进了稳健的绩效,可以概括到各种问题,要求更稳健、更保守的运行式的运行式管理方法。同时,为稳定式培训提供更稳健、更难的RLLL 方向的学习的学习的学习过程提供机会。