In this paper, we fill the research gap by adopting state-of-the-art computer vision techniques for the data extraction stage in a data mining system. As shown in Fig.1, this stage contains two subtasks, namely, plot element detection and data conversion. For building a robust box detector, we comprehensively compare different deep learning-based methods and find a suitable method to detect box with high precision. For building a robust point detector, a fully convolutional network with feature fusion module is adopted, which can distinguish close points compared to traditional methods. The proposed system can effectively handle various chart data without making heuristic assumptions. For data conversion, we translate the detected element into data with semantic value. A network is proposed to measure feature similarities between legends and detected elements in the legend matching phase. Furthermore, we provide a baseline on the competition of Harvesting raw tables from Infographics. Some key factors have been found to improve the performance of each stage. Experimental results demonstrate the effectiveness of the proposed system.
翻译:在本文中,我们通过在数据开采系统中采用最先进的数据提取阶段的计算机视觉技术来填补研究差距。 如Fig.1所示,本阶段包含两个子任务,即绘图元素探测和数据转换。为了建立一个强健的盒子探测器,我们全面比较不同的深层次学习方法,并找到一种以高精度探测盒子的适当方法。为了建立一个强大的点探测器,我们采用了具有特征聚合模块的完全连动网络,可以区分与传统方法的近点。拟议的系统可以有效地处理各种图表数据,而不必作出超常假设。对于数据转换,我们将所检测到的元素转化为具有语义价值的数据。我们建议建立一个网络,以测量图例匹配阶段的传说和被检测到的元素之间的相似性。此外,我们还提供了一个关于从Inflogs采集原始表格的竞争的基线。已经找到了一些关键因素来改进每个阶段的性能。实验结果显示了拟议系统的有效性。