Our goal is to build classification models using a combination of free-text and structured data. To do this, we represent structured data by text sentences, DataWords, so that similar data items are mapped into the same sentence. This permits modeling a mixture of text and structured data by using only text-modeling algorithms. Several examples illustrate that it is possible to improve text classification performance by first running extraction tools (named entity recognition), then converting the output to DataWords, and adding the DataWords to the original text -- before model building and classification. This approach also allows us to produce explanations for inferences in terms of both free text and structured data.
翻译:我们的目标是利用自由文本和结构化数据的组合来建立分类模型。 为此,我们通过文字句子(DataWords)来代表结构化数据,从而将类似的数据项目映射到同一个句子中。这允许通过只使用文本模型算法来模拟文本和结构化数据。几个例子说明,可以通过首先运行提取工具(名称实体识别)来改进文本分类性能,然后将输出转换为数据文件,并在原始文本中添加数据文件 -- -- 在模型构建和分类之前。这个方法还使我们能够对自由文本和结构化数据的推论作出解释。