Contextual advertising provides advertisers with the opportunity to target the context which is most relevant to their ads. However, its power cannot be fully utilized unless we can target the page content using fine-grained categories, e.g., "coupe" vs. "hatchback" instead of "automotive" vs. "sport". The widely used advertising content taxonomy (IAB taxonomy) consists of 23 coarse-grained categories and 355 fine-grained categories. With the large number of categories, it becomes very challenging either to collect training documents to build a supervised classification model, or to compose expert-written rules in a rule-based classification system. Besides, in fine-grained classification, different categories often overlap or co-occur, making it harder to classify accurately. In this work, we propose wiki2cat, a method to tackle the problem of large-scaled fine-grained text classification by tapping on Wikipedia category graph. The categories in IAB taxonomy are first mapped to category nodes in the graph. Then the label is propagated across the graph to obtain a list of labeled Wikipedia documents to induce text classifiers. The method is ideal for large-scale classification problems since it does not require any manually-labeled document or hand-curated rules or keywords. The proposed method is benchmarked with various learning-based and keyword-based baselines and yields competitive performance on both publicly available datasets and a new dataset containing more than 300 fine-grained categories.
翻译:上下文广告使广告商有机会针对与其广告最相关的背景。然而,除非我们能够使用细细分类类别,例如“coupe”与“hatchback”对“hatchback”对“utomotive”对“sport ” 。广泛使用的广告内容分类法(IAB分类法)由23个粗度分类和355个细度分类组成。由于类别众多,收集培训文件以建立受监督的分类模式,或在一个基于规则的分类制度中形成专家编写的规则,否则其权力是无法充分利用的。此外,在细分类法中,不同类别往往重叠或共同混杂,因此难以准确分类。在这项工作中,我们提议了wiki2c2cat,一种通过利用基于维基百科分类的图表来解决大规模细度文字分类问题的方法。在IAB分类中,现有类别首先被映射为图表中的节点。然后,在图表中传播标签,以获得具有竞争力的标定规则。此外,在精细分类法中,不同类别中,要求采用大等级的标度文件的标度为标准。