As the amount and variety of energetics research increases, machine aware topic identification is necessary to streamline future research pipelines. The makeup of an automatic topic identification process consists of creating document representations and performing classification. However, the implementation of these processes on energetics research imposes new challenges. Energetics datasets contain many scientific terms that are necessary to understand the context of a document but may require more complex document representations. Secondly, the predictions from classification must be understandable and trusted by the chemists within the pipeline. In this work, we study the trade-off between prediction accuracy and interpretability by implementing three document embedding methods that vary in computational complexity. With our accuracy results, we also introduce local interpretability model-agnostic explanations (LIME) of each prediction to provide a localized understanding of each prediction and to validate classifier decisions with our team of energetics experts. This study was carried out on a novel labeled energetics dataset created and validated by our team of energetics experts.
翻译:随着高能研究的数量和种类的增加,有必要通过机器识别专题来简化未来的研究管道。自动识别专题过程的构成包括建立文件表述和进行分类。然而,执行这些关于高能研究的过程带来了新的挑战。能源数据集包含许多必要的科学术语,这些术语对于理解文件的背景是必要的,但可能需要更复杂的文件表述。第二,分类预测必须由管道内的化学家理解和信任。在这项工作中,我们通过采用三种在计算复杂性方面各不相同的嵌入文件的方法,研究预测准确性和可解释性之间的权衡。我们还采用了每种预测的本地可解释性模型----不可解释性解释性解释性解释性解释性解释(LIME),以提供对每种预测的本地理解,并与我们的高能专家小组一起验证分类决定。本研究是在由我们的高能专家小组创建和验证的新颖的贴标签的高能数据集上进行的。