In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have only focused on processing text for classification, whereas technical documents often contain multimodal information. To leverage multimodal information for document classification to improve the model performance, this paper presents a novel multimodal deep learning architecture, TechDoc, which utilizes three types of information, including natural language texts and descriptive images within documents and the associations among the documents. The architecture synthesizes the convolutional neural network, recurrent neural network, and graph neural network through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that TechDoc presents a greater classification accuracy than the unimodal methods and other state-of-the-art benchmarks. The trained model can potentially be scaled to millions of real-world multimodal technical documents, which is useful for data and knowledge management in large technology companies and organizations.
翻译:在大型技术公司中,工程师和管理人员制作的技术文件的管理和组织要求近年来急剧增加,导致对更可扩缩、准确和自动化文件分类的需求增加; 先前的研究仅侧重于分类处理文本,而技术文件往往包含多式联运信息; 为利用多式联运信息进行文件分类,以改进示范性工作,本文件提出了一个新型的多式深层次学习结构TechDoc, 它利用了三种类型的信息,包括自然语言文本和文件内部的描述性图像,该结构通过综合培训过程综合了革命神经网络、经常性神经网络和图形神经网络; 我们将该结构应用于一个大型多式联运技术文件数据库,并培训了基于分级国际专利分类系统的文件分类模式; 我们的结果表明,TechDoc比单式方法和其他最先进的基准提供了更高的分类准确性; 经过培训的模型有可能扩大到数百万个真实世界的多式联运技术文件,这对大型技术公司和组织的数据和知识管理很有帮助。