Motivated by real-world machine learning applications, we consider a statistical classification task in a sequential setting where test samples arrive sequentially. In addition, the generating distributions are unknown and only a set of empirically sampled sequences are available to a decision maker. The decision maker is tasked to classify a test sequence which is known to be generated according to either one of the distributions. In particular, for the binary case, the decision maker wishes to perform the classification task with minimum number of the test samples, so, at each step, she declares that either hypothesis 1 is true, hypothesis 2 is true, or she requests for an additional test sample. We propose a classifier and analyze the type-I and type-II error probabilities. We demonstrate the significant advantage of our sequential scheme compared to an existing non-sequential classifier proposed by Gutman. Finally, we extend our setup and results to the multi-class classification scenario and again demonstrate that the variable-length nature of the problem affords significant advantages as one can achieve the same set of exponents as Gutman's fixed-length setting but without having the rejection option.
翻译:在现实世界机器学习应用程序的推动下,我们考虑在测试样品按顺序到达的顺序设置中进行统计分类工作。此外,生成的分布并不为人所知,决策者只能得到一组经验抽样序列。决策者的任务是对已知根据其中任何一个分布生成的测试序列进行分类。特别是,对于二进制案例,决策者希望用最低数量测试样品来完成分类任务,因此,每一步,她都宣布假设1属实,假设2属实,或者她要求额外测试样本。我们提出一个分类师,分析类型一和类型二错误概率。我们展示了与古特曼提议的现有非序列分类师相比,我们的顺序方案的重大优势。最后,我们将我们的设置和结果扩大到多级分类假设情景,并再次证明问题的变长性质具有重大优势,因为一个人可以达到与古特曼固定长度设定相同的排长,但没有拒绝选项。