Integrated principal components analysis, or iPCA, is an unsupervised learning technique for grouped vector data recently defined by Tang and Allen. Like PCA, iPCA computes new axes that best explain the variance of the data, but iPCA is designed to handle corrupting influences by the elements within each group on one another - e.g. data about students at a school grouped into classrooms. Tang and Allen showed empirically that regularized iPCA finds useful features for such grouped data in practice. However, it is not yet known when unregularized iPCA generically exists. For contrast, PCA (which is a special case of iPCA) typically exists whenever the number of data points exceeds the dimension. We study this question and find that the answer is significantly more complicated than it is for PCA. Despite this complexity, we find simple sufficient conditions for a very useful case - when the groups are no more than half as large as the dimension and the total number of data points exceeds the dimension, iPCA generically exists. We also fully characterize the existence of iPCA in case all the groups are the same size. When all groups are not the same size, however, we find that the group sizes for which iPCA generically exists are the integral points in a non-convex union of polyhedral cones. Nonetheless, we exhibit a polynomial time algorithm to decide whether iPCA generically exists (based on the affirmative answer for the saturation conjecture by Knutson and Tao as well as a very simple randomized polynomial time algorithm.
翻译:集成主元件分析, 即 iPCA, 是一种不受监督的学习技术, 用于最近由唐氏和艾伦定义的组群矢量数据。 与 CPA 一样, iPCA 计算新的轴, 最能解释数据的差异, 但 iPCA 旨在处理各组内各元素的腐蚀性影响- 例如, 有关学校学生在课堂上的数据。 Tang 和 Allen 从经验上表明, 正规化的 iPCA 在实践中为这类类群量数据找到有用的特征。 但是, 当非常规的 iPCA 通常存在时, CPA (这是 iPCA 的一个特殊案例) 通常在数据点超过维度时就存在。 我们研究这个问题, 发现答案比对单组内部的复杂得多。 尽管如此复杂, 我们发现一个非常有用的案例, 当这些组的尺寸和数据点的直线值不大于维度大小时, IPA 通常存在。 我们还充分描述 PCA 的存在, 在所有组中, 的直径直径的直径的直径直线值 。