As machine learning becomes democratized in the era of Software 2.0, a serious bottleneck is acquiring enough data to ensure accurate and fair models. Recent techniques including crowdsourcing provide cost-effective ways to gather such data. However, simply acquiring data as much as possible is not necessarily an effective strategy for optimizing accuracy and fairness. For example, if an online app store has enough training data for certain slices of data (say American customers), but not for others, obtaining more American customer data will only bias the model training. Instead, we contend that one needs to selectively acquire data and propose Slice Tuner, which acquires possibly-different amounts of data per slice such that the model accuracy and fairness on all slices are optimized. This problem is different than labeling existing data (as in active learning or weak supervision) because the goal is obtaining the right amounts of new data. At its core, Slice Tuner maintains learning curves of slices that estimate the model accuracies given more data and uses convex optimization to find the best data acquisition strategy. The key challenges of estimating learning curves are that they may be inaccurate if there is not enough data, and there may be dependencies among slices where acquiring data for one slice influences the learning curves of others. We solve these issues by iteratively and efficiently updating the learning curves as more data is acquired. We evaluate Slice Tuner on real datasets using crowdsourcing for data acquisition and show that Slice Tuner significantly outperforms baselines in terms of model accuracy and fairness, even when the learning curves cannot be reliably estimated.
翻译:随着机器学习在软件2.0时代民主化,一个严重的瓶颈正在获取足够的数据,以确保准确和公正的模型。最新技术,包括众包提供了收集这些数据的成本效益高的方法。然而,只要尽可能获取数据,就不一定是优化准确性和公平性的有效战略。例如,如果一个在线应用程序商店为某些数据片段(如美国客户)拥有足够的培训数据,而不是为其他人获取更多的美国客户数据,那么获取更多的美国客户数据只会偏向模式培训。相反,我们争论的是,需要有选择地获取数据,并提议Slice Tunner(Slice Tunner),它每切片获得的数据数量可能不同,以便优化模型准确性和公平性,从而优化所有切片的模型准确性和公平性。这个问题不同于现有数据标签(如积极学习或监管薄弱),因为目标正在获取正确数量的新数据。在核心上,Slice Tuner保留了切片的学习曲线,以估计模型的准确度为更多的数据,并使用模型优化来找到最佳的数据获取战略。 估算学习曲线的主要挑战是,如果数据中的数据没有足够真实数据,那么它们可能不准确性就会在获取数据中进行。