We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.
翻译:与线性模型相比,在非线性模型中出现了两个奇怪的现象:首先,除了“学习阶段”和标准的估算或遗憾参数率标准参数,还有一个“燃烧期”由非线性函数确定固定成本;第二,实现最小的燃烧成本需要新的探索算法。对于非线性函数的特殊组合,在文献中称为脊函数的非线性函数,我们得出最佳燃烧成本的上限和下限,此外,在燃烧期间的整个学习轨迹上,通过差异方程。 特别是,两阶段算法首先发现良好的初始动作,然后将问题作为局部线性的最佳统计方法来处理。 相比之下,一些经典算法,如UCB和依赖回归法或手法的算法,是极不理想的。