Digital data collected over the decades and data currently being produced with use of information technology is vastly the unlabeled data or data without description. The unlabeled data is relatively easy to acquire but expensive to label even with use of domain experts. Most of the recent works focus on use of active learning with uncertainty metrics measure to address this problem. Although most uncertainty selection strategies are very effective, they fail to take informativeness of the unlabeled instances into account and are prone to querying outliers. In order to address these challenges we propose an hybrid approach of computing both the uncertainty and informativeness of an instance, then automaticaly label the computed instances using budget annotator. To reduce the annotation cost, we employ the state-of-the-art pre-trained models in order to avoid querying information already contained in those models. Our extensive experiments on different sets of datasets demonstrate the efficacy of the proposed approach.
翻译:数十年来收集的数字数据和目前利用信息技术制作的数据在很大程度上是未贴标签的数据或没有描述的数据。即使使用域专家,未贴标签的数据相对容易获得,但标签费也比较高。最近大部分工作的重点是利用积极学习,采用不确定的衡量尺度衡量不确定性,以解决这一问题。虽然大多数不确定选择战略都非常有效,但它们没有考虑到未贴标签实例的信息,而且容易查询外部线。为了应对这些挑战,我们建议采用混合方法,既计算一个实例的不确定性和信息性,又使用预算说明器自动标出计算的情况。为降低注解成本,我们采用最先进的预先培训模型,以避免查询这些模型中已经包含的信息。我们对不同数据集的广泛实验显示了拟议方法的有效性。