连续时间多臂赌博机问题及其采样成本的解决方法 (Continuous Time Bandits With Sampling Costs) - 专知论文

会员服务 ·

0

下界 · 多臂赌博机问题 · 多臂赌博机 · 赌博机 · 均值 ·

2023 年 4 月 19 日

Continuous Time Bandits With Sampling Costs

翻译：连续时间多臂赌博机问题及其采样成本的解决方法

Rahul Vaze,Manjesh K. Hanawal

We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single-arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single-arm case, we show that the lower bound on the regret is $\Omega((\log T)^2/\mu)$, where $\mu$ is the mean of the arm, and $T$ is the time horizon. For the multiple arms case, we show that the lower bound on the regret is $\Omega((\log T)^2 \mu/\Delta^2)$, where $\mu$ now represents the mean of the best arm, and $\Delta$ is the difference of the mean of the best and the second-best arm. We then propose an algorithm that achieves the bound up to constant terms.

翻译：我们考虑一种连续时间下的多臂赌博机问题（CTMAB），其中学习器可以在给定的时间间隔内任意次采样手臂，并从每次采样中随机获得奖励。然而，增加采样频率会引起一个附加成本/惩罚。因此，在获得高奖励和发生采样成本之间存在一个权衡关系，这个权衡关系是针对采样频率而言的。目标是设计一个学习算法，最小化遗憾，即首先确定一个oracle策略，并比较学习算法和oracle策略之间的收益差异。CTMAB与通常的多臂赌博机问题（MAB）有根本的区别，例如，即使是单臂情况也是非常复杂的，因为最优采样频率取决于手臂的平均值，而这需要被估计出来。我们首先确定任何算法可以实现的遗憾下界，然后提出算法以达到遗憾下界。对于单臂情况，我们展示了遗憾下界是 $\Omega ((\log T)^2 / \mu)$，其中 $\mu$ 是手臂的均值，而 $T$ 是时间范围。对于多臂情况，我们展示了遗憾下界是 $\Omega ((\log T)^2 \mu / \Delta ^ 2)$，其中 $\mu$ 现在表示最好的手臂的平均值，而 $\Delta$ 是最好的和次好的手臂平均值之间的差。然后，我们提出一个算法，它以常量项为界，实现了下界。

0

相关内容

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

专知会员服务

28+阅读 · 2022年4月8日

【ICLR2022】时序对齐预测的监督表示学习与少样本序列分类

【ICLR2022】时序对齐预测的监督表示学习与少样本序列分类

专知会员服务

21+阅读 · 2022年2月5日

【ICML2021】异质风险最小化，Heterogeneous Risk Minimization

专知会员服务

16+阅读 · 2021年5月21日

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

专知会员服务

11+阅读 · 2020年12月8日

【MIT】硬负样本的对比学习

【MIT】硬负样本的对比学习

专知会员服务

40+阅读 · 2020年10月14日

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

专知会员服务

13+阅读 · 2020年6月8日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【斯坦福大学ICLR2020】无任务的持续元学习，Continue Meta-learning without tasks

【斯坦福大学ICLR2020】无任务的持续元学习，Continue Meta-learning without tasks

专知会员服务

16+阅读 · 2019年12月18日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

CVPR2019 | 不同视角构造cycle-consistency，降低视频标注成本

CVPR2019 | 不同视角构造cycle-consistency，降低视频标注成本

极市平台

14+阅读 · 2019年5月1日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

模糊情况下的最优消费与投资

国家自然科学基金

3+阅读 · 2015年12月31日

连续时间马氏决策过程受约束问题的研究

国家自然科学基金

1+阅读 · 2015年12月31日

切换系统的容错保成本和容错H无穷控制

国家自然科学基金

0+阅读 · 2015年12月31日

广义线性模型的组变量选择及其在信用评分中的应用

国家自然科学基金

2+阅读 · 2014年12月31日

层次贝叶斯模型中隐性变量分布的非参数估计及在RNA-seq数据中的应用

国家自然科学基金

1+阅读 · 2013年12月31日

连续时间马氏决策过程均值-方差优化问题的研究

国家自然科学基金

0+阅读 · 2012年12月31日

受限制策略下多臂Bandit过程的理论与应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于反射随机过程理论的注资限制下带利率保险模型优化研究

国家自然科学基金

0+阅读 · 2012年12月31日

贝叶斯框架下风险度量的非参数估计及其应用研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于决策主体行为的多产品报童订货与定价研究

国家自然科学基金

0+阅读 · 2011年12月31日

Online Learning with Feedback Graphs: The True Shape of Regret

Arxiv

0+阅读 · 2023年6月5日

Convex Risk Bounded Continuous-Time Trajectory Planning and Tube Design in Uncertain Nonconvex Environments

Arxiv

0+阅读 · 2023年6月4日

Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Arxiv

0+阅读 · 2023年6月3日

Online Continuous Hyperparameter Optimization for Contextual Bandits

Arxiv

0+阅读 · 2023年6月2日

Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model

Arxiv

0+阅读 · 2023年6月2日

Beyond Active Learning: Leveraging the Full Potential of Human Interaction via Auto-Labeling, Human Correction, and Human Verification

Arxiv

0+阅读 · 2023年6月2日

A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Arxiv

0+阅读 · 2023年6月2日

A New Algebraic Approach for String Reconstruction from Substring Compositions

Arxiv

0+阅读 · 2023年6月1日

Near-optimal learning with average Hölder smoothness

Arxiv

0+阅读 · 2023年6月1日

Improved Algorithms for Multi-period Multi-class Packing Problems with Bandit Feedback

Arxiv

0+阅读 · 2023年5月31日

VIP会员

文章信息

相关主题

多臂赌博机问题

多臂赌博机

相关VIP内容

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

哥伦比亚大学最新博士论文《机器学习在金融市场中的应用》Essays on the Applications of Machine Learning in Financial Markets

专知会员服务

28+阅读 · 2022年4月8日

【ICLR2022】时序对齐预测的监督表示学习与少样本序列分类

【ICLR2022】时序对齐预测的监督表示学习与少样本序列分类

专知会员服务

21+阅读 · 2022年2月5日

【ICML2021】异质风险最小化，Heterogeneous Risk Minimization

专知会员服务

16+阅读 · 2021年5月21日

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

NeurIPS 2020最佳论文奖项出炉！GPT-3、伯克利等3篇论文摘得！

专知会员服务

11+阅读 · 2020年12月8日

【MIT】硬负样本的对比学习

【MIT】硬负样本的对比学习

专知会员服务

40+阅读 · 2020年10月14日

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

【北京大学】Locally Differentially Private (Contextual) Bandits Learning

专知会员服务

13+阅读 · 2020年6月8日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【斯坦福大学ICLR2020】无任务的持续元学习，Continue Meta-learning without tasks

【斯坦福大学ICLR2020】无任务的持续元学习，Continue Meta-learning without tasks

专知会员服务

16+阅读 · 2019年12月18日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

卫星导航技术发展综述

《美军"僚机"联合能力技术演示项目：有人-无人火炮作战》41页报告

美军条令《火力指挥》116页

可解释的人工智能在生物医学图像分析中的应用综述

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

CVPR2019 | 不同视角构造cycle-consistency，降低视频标注成本

CVPR2019 | 不同视角构造cycle-consistency，降低视频标注成本

极市平台

14+阅读 · 2019年5月1日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Online Learning with Feedback Graphs: The True Shape of Regret

Arxiv

0+阅读 · 2023年6月5日

Convex Risk Bounded Continuous-Time Trajectory Planning and Tube Design in Uncertain Nonconvex Environments

Arxiv

0+阅读 · 2023年6月4日

Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Arxiv

0+阅读 · 2023年6月3日

Online Continuous Hyperparameter Optimization for Contextual Bandits

Arxiv

0+阅读 · 2023年6月2日

Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model

Arxiv

0+阅读 · 2023年6月2日

Beyond Active Learning: Leveraging the Full Potential of Human Interaction via Auto-Labeling, Human Correction, and Human Verification

Arxiv

0+阅读 · 2023年6月2日

A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Arxiv

0+阅读 · 2023年6月2日

A New Algebraic Approach for String Reconstruction from Substring Compositions

Arxiv

0+阅读 · 2023年6月1日

Near-optimal learning with average Hölder smoothness

Arxiv

0+阅读 · 2023年6月1日

Improved Algorithms for Multi-period Multi-class Packing Problems with Bandit Feedback

Arxiv

0+阅读 · 2023年5月31日

相关基金

模糊情况下的最优消费与投资

国家自然科学基金

3+阅读 · 2015年12月31日

连续时间马氏决策过程受约束问题的研究

国家自然科学基金

1+阅读 · 2015年12月31日

切换系统的容错保成本和容错H无穷控制

国家自然科学基金

0+阅读 · 2015年12月31日

广义线性模型的组变量选择及其在信用评分中的应用

国家自然科学基金

2+阅读 · 2014年12月31日

层次贝叶斯模型中隐性变量分布的非参数估计及在RNA-seq数据中的应用

国家自然科学基金

1+阅读 · 2013年12月31日

连续时间马氏决策过程均值-方差优化问题的研究

国家自然科学基金

0+阅读 · 2012年12月31日

受限制策略下多臂Bandit过程的理论与应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于反射随机过程理论的注资限制下带利率保险模型优化研究

国家自然科学基金

0+阅读 · 2012年12月31日

贝叶斯框架下风险度量的非参数估计及其应用研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于决策主体行为的多产品报童订货与定价研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员