Given well-shuffled data, can we determine whether the data items are statistically (in)dependent? Formally, we consider the problem of testing whether a set of exchangeable random variables are independent. We will show that this is possible and develop tests that can confidently reject the null hypothesis that data is independent and identically distributed and have high power for (some) exchangeable distributions. We will make no structural assumptions on the underlying sample space. One potential application is in Deep Learning, where data is often scraped from the whole internet, with duplications abound, which can render data non-iid and test-set evaluation prone to give wrong answers.
翻译:鉴于数据结构完善,我们能否确定数据项目在统计(独立)方面是否独立? 正式地说,我们考虑测试一组可交换随机变量是否独立的问题。我们将表明,这是可能的,并将开发测试,以便满怀信心地否定关于数据独立、分布相同、对(某些)可交换分布具有高度权力的无效假设。我们将不对基础样本空间作出结构性假设。一个潜在应用是深智学习,数据往往从整个互联网上筛选出来,重复现象很多,这使得数据不出现二分制和测试设定的评估容易给出错误答案。