Categorization of mutual funds or Exchange-Traded-funds (ETFs) have long served the financial analysts to perform peer analysis for various purposes starting from competitor analysis, to quantifying portfolio diversification. The categorization methodology usually relies on fund composition data in the structured format extracted from the Form N-1A. Here, we initiate a study to learn the categorization system directly from the unstructured data as depicted in the forms using natural language processing (NLP). Positing as a multi-class classification problem with the input data being only the investment strategy description as reported in the form and the target variable being the Lipper Global categories, and using various NLP models, we show that the categorization system can indeed be learned with high accuracy. We discuss implications and applications of our findings as well as limitations of existing pre-trained architectures in applying them to learn fund categorization.
翻译:共同基金或汇兑-交易基金(ETF)的分类长期以来一直有助于财务分析家为各种目的进行同行分析,从竞争者分析开始,到对投资组合多样化进行量化。分类方法通常依靠从表N-1A中抽取的结构化格式的基金构成数据。在这里,我们发起一项研究,直接从使用自然语言处理(NLP)的表格所描述的无结构数据中学习分类系统。 作为一种多级分类问题,投入数据只是形式上报告的投资战略描述,目标变量是利珀全球分类,我们使用各种非盈利项目模型,表明分类系统确实可以非常准确地学习。我们讨论了我们调查结果的影响和应用,以及现有预先培训的结构在应用这些数据学习资金分类方面的局限性。