文本本DS:分配系统通用文本数据基准 (TextBenDS: a generic Textual data Benchmark for Distributed Systems)

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing only subsets of the dataset. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of top-k keywords and documents, it is customary to design benchmarks that compare weighting schemes within various configurations of distributed frameworks and database management systems. Thus, we propose a generic document-oriented benchmark for storing textual data and constructing weighting schemes (TextBenDS). Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB proves to have the best overall performance, while Spark's execution time remains almost the same, regardless of the weighting schemes.

翻译：使用加权办法提取顶端关键字和文件是用于文字挖掘和机器学习的不同分析和检索任务的流行技术。加权通常在数据处理前步骤中计算,因为更新和跟踪数据集上的所有修改费用高昂。此外,在分析数据集的子集时,会引入计算错误。因此,在大数据环境下,必须降低计算加权办法的运行时间,同时不妨碍分析过程和机器学习算法的准确性。为了满足对顶端关键字和文件任务的要求,通常设计基准,比较分布式框架和数据库管理系统各种配置中的加权办法。因此,我们提议一个通用文件导向基准,用于储存文本数据和构建加权办法(Text BenDS)。我们的基准提供了一个通用数据模型,设计了储存文本文件的多层面方法。我们还提议使用具有各种复杂性和选择性的汇总查询方法来构建术语加权办法,用于提取顶端关键字和文件。我们评估计算加权办法的计算性能,不论分布在分布式框架和数据库管理系统的各种配置中的加权办法如何。因此,我们提出了一个通用的文件导向了储存文本数据和构建加权办法的总体业绩,同时验证了我们的一些生态系统环境。