The Brazilian Supreme Court receives tens of thousands of cases each semester. Court employees spend thousands of hours to execute the initial analysis and classification of those cases -- which takes effort away from posterior, more complex stages of the case management workflow. In this paper, we explore multimodal classification of documents from Brazil's Supreme Court. We train and evaluate our methods on a novel multimodal dataset of 6,510 lawsuits (339,478 pages) with manual annotation assigning each page to one of six classes. Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text extracted through optical character recognition. We first train two unimodal classifiers: a ResNet pre-trained on ImageNet is fine-tuned on the images, and a convolutional network with filters of multiple kernel sizes is trained from scratch on document texts. We use them as extractors of visual and textual features, which are then combined through our proposed Fusion Module. Our Fusion Module can handle missing textual or visual input by using learned embeddings for missing data. Moreover, we experiment with bi-directional Long Short-Term Memory (biLSTM) networks and linear-chain conditional random fields to model the sequential nature of the pages. The multimodal approaches outperform both textual and visual classifiers, especially when leveraging the sequential nature of the pages.
翻译:巴西最高法院每学期受理数万个案件。法院雇员花数千小时对这些案件进行初步分析和分类,这需要努力远离案件管理工作流程的后期、更复杂的阶段。在本论文中,我们探索巴西最高法院文件的多式联运分类。我们用人工注解方式对6,510项诉讼(339,478页)进行新式多式联运数据集(339,478页)进行培训和评估,将每页归入6个类别之一。每起诉讼都是一个有顺序的页面序列,作为图像和通过光学字符识别提取的相应文本存储。我们首先培训两个单式分类师:在图像网络上预先培训的ResNet对图像进行微调,而一个具有多个内核大小过滤器的革命网络则从文件文本的刮痕中接受培训。我们用它们作为视觉和文字特征的提取器,然后通过我们提议的发包模块组合起来。我们的发包模块可以处理缺失的文字或视觉输入。此外,我们先用双向导式的长端图像网络进行实验,特别是直径级的连续的图像图像存储系统。