利用标签语义学,在公司与工业匹配的噪音标签下提高性能 (Harnessing label semantics to extract higher performance under noisy label for Company to Industry matching)

Assigning appropriate industry tag(s) to a company is a critical task in a financial institution as it impacts various financial machineries. Yet, it remains a complex task. Typically, such industry tags are to be assigned by Subject Matter Experts (SME) after evaluating company business lines against the industry definitions. It becomes even more challenging as companies continue to add new businesses and newer industry definitions are formed. Given the periodicity of the task it is reasonable to assume that an Artificial Intelligent (AI) agent could be developed to carry it out in an efficient manner. While this is an exciting prospect, the challenges appear from the need of historical patterns of such tag assignments (or Labeling). Labeling is often considered the most expensive task in Machine Learning (ML) due its dependency on SMEs and manual efforts. Therefore, often, in enterprise set up, an ML project encounters noisy and dependent labels. Such labels create technical hindrances for ML Models to produce robust tag assignments. We propose an ML pipeline which uses semantic similarity matching as an alternative to multi label text classification, while making use of a Label Similarity Matrix and a minimum labeling strategy. We demonstrate this pipeline achieves significant improvements over the noise and exhibit robust predictive capabilities.

翻译：在金融机构中,向公司指定适当的行业标签是一项关键任务,因为它会影响各种金融机制。然而,这仍然是一项复杂的任务。通常,这类行业标签在对照行业定义对公司业务方针进行评估之后,由主题事项专家(SME)分配。随着公司继续增加新的企业和形成新的行业定义,这种标签就更具挑战性。鉴于任务的周期性,可以合理地假定可以开发一个人工智能(AI)代理来高效地执行。虽然这是一个令人振奋的前景,但这种标签分配(或labeling)的历史模式的需要也带来了挑战。在机器学习(ML)中,由于对中小企业的依赖和手工工作,这种标签常常被视为最昂贵的任务。因此,在企业建立时,ML项目常常遇到吵闹和依赖性的标签。鉴于这种标签的周期性,为ML模型制作稳健的标签任务造成了技术障碍。我们建议建立一个ML管道,使用语义相似性匹配作为多标签文本分类的替代方法,同时对Label 类似性矩阵和管道进行重大的标签展示。