This paper discusses a crowdsourcing based method that we designed to quantify the importance of different attributes of a dataset in determining the outcome of a classification problem. This heuristic, provided by humans acts as the initial weight seed for machine learning models and guides the model towards a better optimal during the gradient descent process. Often times when dealing with data, it is not uncommon to deal with skewed datasets, that over represent items of certain classes, while underrepresenting the rest. Skewed datasets may lead to unforeseen issues with models such as learning a biased function or overfitting. Traditional data augmentation techniques in supervised learning include oversampling and training with synthetic data. We introduce an experimental approach to dealing with such unbalanced datasets by including humans in the training process. We ask humans to rank the importance of features of the dataset, and through rank aggregation, determine the initial weight bias for the model. We show that collective human bias can allow ML models to learn insights about the true population instead of the biased sample. In this paper, we use two rank aggregator methods Kemeny Young and the Markov Chain aggregator to quantify human opinion on importance of features. This work mainly tests the effectiveness of human knowledge on binary classification (Popular vs Not-popular) problems on two ML models: Deep Neural Networks and Support Vector Machines. This approach considers humans as weak learners and relies on aggregation to offset individual biases and domain unfamiliarity.
翻译:本文讨论了一种基于众包的基于众包的方法,我们设计该方法的目的是量化数据集不同属性的重要性,以确定分类问题的结果。这种由人类提供的超常性,作为机器学习模型的初始重力种子,并引导模型在梯度下降过程中实现更好的优化。在处理数据时,通常处理偏斜的数据集并不罕见,该数据集代表某些类别的项目,而代表的则低于其余类别。扭曲的数据集可能导致一些模型的意外问题,如学习不透明功能或过度配置。在监督学习中的传统数据增强技术包括过度抽样和合成数据培训。我们采用实验方法处理这种不平衡的数据集,将人纳入培训过程。我们要求人类对数据集特性的重要性进行排序,并通过排名组合来确定模型的初始重度偏差。我们表明,集体人类偏差可以让ML模型了解真实人口,而不是偏差的样本。在本文中,我们使用两种等级分类隔离方法 Kemeny Young 和 Markov Chailing Negrable 网络, 采用实验方法处理这种不平衡的数据集。我们采用两种实验方法,主要用来量化人类分类的重要性。