Representation learning approaches require a massive amount of discriminative training data, which is unavailable in many scenarios, such as healthcare, smart city, education, etc. In practice, people refer to crowdsourcing to get annotated labels. However, due to issues like data privacy, budget limitation, shortage of domain-specific annotators, the number of crowdsourced labels is still very limited. Moreover, because of annotators' diverse expertise, crowdsourced labels are often inconsistent. Thus, directly applying existing supervised representation learning (SRL) algorithms may easily get the overfitting problem and yield suboptimal solutions. In this paper, we propose \emph{NeuCrowd}, a unified framework for SRL from crowdsourced labels. The proposed framework (1) creates a sufficient number of high-quality \emph{n}-tuplet training samples by utilizing safety-aware sampling and robust anchor generation; and (2) automatically learns a neural sampling network that adaptively learns to select effective samples for SRL networks. The proposed framework is evaluated on both one synthetic and three real-world data sets. The results show that our approach outperforms a wide range of state-of-the-art baselines in terms of prediction accuracy and AUC. To encourage reproducible results, we make our code publicly available at \url{https://github.com/tal-ai/NeuCrowd_KAIS2021}.
翻译:代表制学习方法需要大量歧视性培训数据,这在许多情景中都无法获得,如医疗保健、智能城市、教育等。在实践中,人们参考众包以获得附加说明的标签。然而,由于数据隐私、预算限制、特定领域评分员短缺等问题,众包标签的数量仍然非常有限。此外,由于评分员的专长多种多样,众包标签往往不尽相同。因此,直接应用现有的监督代表性学习算法(SRL)可能很容易获得过分适合的问题,并产生不完美的解决方案。在本文中,我们提议为众包标签的SRL提供一个统一框架。拟议的框架(1) 利用安全觉察采样和稳健的锚生成,创建了足够数量的高质量\emph{n} 众包标签培训样本;以及(2) 自动学习一个神经采样网络,以便适应性地学习为SRL网络选择有效的样本。拟议框架在1个和3个真实世界数据中都进行了评估。我们提出的框架在1个合成和3个真实的_NeuC-rowd}中,鼓励公开的准确性定义。结果显示我们现有的基准范围。