Distribution regression refers to the supervised learning problem where labels are only available for groups of inputs instead of individual inputs. In this paper, we develop a rigorous mathematical framework for distribution regression where inputs are complex data streams. Leveraging properties of the expected signature and a recent signature kernel trick for sequential data from stochastic analysis, we introduce two new learning techniques, one feature-based and the other kernel-based. Each is suited to a different data regime in terms of the number of data streams and the dimensionality of the individual streams. We provide theoretical results on the universality of both approaches and demonstrate empirically their robustness to irregularly sampled multivariate time-series, achieving state-of-the-art performance on both synthetic and real-world examples from thermodynamics, mathematical finance and agricultural science.
翻译:分布回归是指监督的学习问题,即只有输入组而不是个人输入才有标签。在本文中,我们为输入为复杂数据流的分布回归制定了严格的数学框架。利用预期签名的特性和最近的签名内核把戏来利用随机分析的顺序数据,我们引入了两种新的学习技术,一种基于特性的,另一种基于内核的。在数据流的数量和单个流的维度方面,每种方法都适合不同的数据制度。我们提供了关于两种方法的普遍性的理论结果,并用经验来证明它们对于不定期抽样的多变时间序列的模型的可靠性能,在热力学、数学融资和农业科学的合成和现实世界实例中都取得了最先进的表现。