Background: The need for big data analysis requires being able to process large data which are being held fine-tuned for usage by corporate. It is only very recently that the need for big data has caught attention for low budget corporate groups and academia who typically do not have money and resources to buy expensive licenses of big data analysis platforms such as SAS. The corporate continue to work on SAS data format largely because of systemic organizational history and that the prior codes have been built on them. The data-providers continue to thus provide data in SAS formats. Acute sudden need has arisen because of this gap of data being in SAS format and the coders not having a SAS expertise or training background as the economic and inertial forces acting of having shaped these two class of people have been different. Method: We analyze the differences and thus the need for SasCsvToolkit which helps to generate a CSV file for a SAS format data so that the data scientist can then make use of his skills in other tools that can process CSVs such as R, SPSS, or even Microsoft Excel. At the same time, it also provides conversion of CSV files to SAS format. Apart from this, a SAS database programmer always struggles in finding the right method to do a database search, exact match, substring match, except condition, filters, unique values, table joins and data mining for which the toolbox also provides template scripts to modify and use from command line. Results: The toolkit has been implemented on SLURM scheduler platform as a `bag-of-tasks` algorithm for parallel and distributed workflow though serial version has also been incorporated.
翻译:需要大数据分析:大数据分析需要能够处理大数据,而大数据是公司使用所需的。直到最近,对大数据的需求才引起低预算公司集团和学术界的注意,这些集团和学术界通常没有钱和资源购买昂贵的大数据分析平台(如SAS)的许可证。公司继续使用SAS数据格式,这主要是因为有系统性的组织历史,而且先前的代码已经建立在SAS格式上。数据提供者因此继续以SAS格式提供数据。由于数据差距以SAS格式显示数据,突然出现了严重的需求。由于数据差距以SAS格式显示数据,而编码员没有SAS专门知识或培训背景,因此,大数据需求已经引起关注,因为那些通常没有钱和资源购买高数据分析平台等高数据分析平台的低预算公司集团和学术界。我们分析了差异,因此SASVToolkit公司需要继续使用SAS数据格式的CSV文件格式,这样数据科学家就可以利用其他工具处理CSSV模板,如R、SPSS,甚至Microsoft Excard。 同时,CSA-L的SL 也提供了CSAR文档的转换方法,但SAS-rex 数据库除外数据库除外数据库除外格式。