The application of reinforcement learning in credit scoring has created a unique setting for contextual logistic bandit that does not conform to the usual exploration-exploitation tradeoff but rather favors exploration-free algorithms. Through sufficient randomness in a pool of observable contexts, the reinforcement learning agent can simultaneously exploit an action with the highest reward while still learning more about the structure governing that environment. Thus, it is the case that greedy algorithms consistently outperform algorithms with efficient exploration, such as Thompson sampling. However, in a more pragmatic scenario in credit scoring, lenders can, to a degree, classify each borrower as a separate group, and learning about the characteristics of each group does not infer any information to another group. Through extensive simulations, we show that Thompson sampling dominates over greedy algorithms given enough timesteps which increase with the complexity of underlying features.
翻译:信用评分中强化学习的应用为背景后勤强盗创造了一种独特的环境,不符合通常的勘探-开发权衡,而是偏向于无勘探算法。通过在一系列可观测环境中有足够的随机性,强化学习代理可以同时利用最高奖励的行动,同时仍然更多地了解该环境的结构。因此,贪婪的算法一贯优于有效探索的算法,如Thompson抽样。然而,在更务实的信用评分方案下,放款人可以在一定程度上将每个借款人归类为单独的集团,了解每个集团的特点并不能将任何信息推断给另一个集团。通过广泛的模拟,我们证明Thompson抽样对贪婪算法的主导时间跨过贪婪算法,而时间跨度随着基本特征的复杂而增加。