Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.
翻译:公开交易的公司必须提交包含 Exxtenive Business Report Special 语言 (XBRL) 字级标签的定期报告。 手工给报告贴标签既乏味又昂贵。 因此, 我们引入 XBRL 标签作为金融领域的新实体提取任务, 并发布 FinNER- 139 数据集 1. 1M, 包含金 XBRL 标签。 不同于典型的实体提取数据集, FinNER- 139 使用大得多的139 实体类型标签组。 多数附加说明的标牌是数字牌, 其每个标牌的正确标记主要取决于上下文, 而不是符号本身。 我们显示数字表达的子字组分解会伤害 BERT 的性能, 允许 BERT 字级 BILSTM 更好地运行。 为了改善 BERT 的性能, 我们提出了两个简单有效的解决方案, 来取代数字表达方式, 反映原始代号形状和数字大小。 我们还试验FIN-BERT, 一个现有的金融领域的BERT模型模型, 以及释放我们自己的 BERT (EC-BERT),, 之前训练过的错误, 最终的进度, 确定工作, 进行最佳的标签。