Big data, i.e. collecting, storing and processing of data at scale, has recently been possible due to the arrival of clusters of commodity computers powered by application-level distributed parallel operating systems like HDFS/Hadoop/Spark, and such infrastructures have revolutionized data mining at scale. For data mining project to succeed more consistently, some methodologies were developed (e.g. CRISP-DM, SEMMA, KDD), but these do not account for (1) very large scales of processing, (2) dealing with textual (unstructured) data (i.e. Natural Language Processing (NLP, "text analytics"), and (3) non-technical considerations (e.g. legal, ethical, project managerial aspects). To address these shortcomings, a new methodology, called "Data to Value" (D2V), is introduced, which is guided by a detailed catalog of questions in order to avoid a disconnect of big data text analytics project team with the topic when facing rather abstract box-and-arrow diagrams commonly associated with methodologies.
翻译:大型数据,即大规模数据的收集、储存和处理,最近之所以成为可能,是因为通过HDFS/Hadoop/Spark等应用级分布式平行操作系统驱动的商品计算机集群的到来,而这种基础设施使数据开采的规模发生了革命性的变化;为了使数据开采项目更一致地取得成功,已经制定了一些方法(例如CRIPS-DM、SEMMA、KDD),但这些方法并不包括:(1) 大规模处理,(2) 处理文本(非结构化)数据(即自然语言处理(NLP,“文字分析”)和(3) 非技术考虑(例如法律、道德、项目管理方面),为了解决这些缺陷,采用了称为“数据到价值”(D2V)的新方法,该方法以一个详细的问题目录为指导,以避免大数据文本分析项目小组在面对与方法通常相关的非常抽象的方框和缩略图时与主题脱钩。