We study the problem of learning a hierarchical tree representation of data from labeled samples, taken from an arbitrary (and possibly adversarial) distribution. Consider a collection of data tuples labeled according to their hierarchical structure. The smallest number of such tuples required in order to be able to accurately label subsequent tuples is of interest for data collection in machine learning. We present optimal sample complexity bounds for this problem in several learning settings, including (agnostic) PAC learning and online learning. Our results are based on tight bounds of the Natarajan and Littlestone dimensions of the associated problem. The corresponding tree classifiers can be constructed efficiently in near-linear time.
翻译:我们研究从任意(也可能是对抗性)分布中获取的标签样本数据分级树的问题;考虑根据等级结构收集标记的数据图象;为了能够准确标出随后的图象而需要的这种图象中数量最少的图象是机器学习中数据收集的兴趣所在;我们在若干学习环境中,包括(不可知的)PAC学习和在线学习中,为这一问题提供了最佳的样本复杂性界限;我们的结果基于相关问题的Natarajan和Littlestone层面的严格界限;相应的树分类器可以在近线时间有效建造。