We study the task of online learning in the presence of Massart noise. Instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\mathbf{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\mathbf{x}$ with unknown probability at most $\eta$. We study the fundamental class of $\gamma$-margin linear classifiers and present a computationally efficient algorithm that achieves mistake bound $\eta T + o(T)$. Our mistake bound is qualitatively tight for efficient algorithms: it is known that even in the offline setting achieving classification error better than $\eta$ requires super-polynomial time in the SQ model. We extend our online learning model to a $k$-arm contextual bandit setting where the rewards -- instead of satisfying commonly used realizability assumptions -- are consistent (in expectation) with some linear ranking function with weight vector $\mathbf{w}^\ast$. Given a list of contexts $\mathbf{x}_1,\ldots \mathbf{x}_k$, if $\mathbf{w}^*\cdot \mathbf{x}_i > \mathbf{w}^* \cdot \mathbf{x}_j$, the expected reward of action $i$ must be larger than that of $j$ by at least $\Delta$. We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ \Delta T - o(T)$ bigger than choosing a random action at every round.
翻译:暂无翻译