Learning from raw data input, thus limiting the need for manual feature engineering, is one of the key components of many successful applications of machine learning methods. While machine learning problems are often formulated on data that naturally translate into a vector representation suitable for classifiers, there are data sources, for example in cybersecurity, that are naturally represented in diverse files with a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to vector (tensor) representation is generally done by manual feature engineering, which is laborious, lossy, and prone to human bias about the importance of particular features. Mill and JsonGrinder is a tandem of libraries, which fully automates the conversion. Starting with an arbitrary set of JSON samples, they create a differentiable machine learning model capable of infer from further JSON samples in their raw form.
翻译:从原始数据输入中学习,从而限制对手工地物工程的需要,这是机器学习方法许多成功应用的关键组成部分之一。虽然机器学习问题往往是在自然转化为适合分类者矢量代表的数据上形成的,但有数据源,例如网络安全数据源,它们自然地包含在具有统一等级结构的不同文档中,如XML、JSON和协议缓冲。将这些数据转换为矢量(10)代表通常是由手工地物工程完成的,这种工程是艰苦的、损失的,并且容易使人对特定地物的重要性产生偏见。Mills和JsonGrinder是图书馆的结合体,它们完全自动地将转换。从一套任意的JSON样本开始,它们创造了一种不同的机器学习模型,能够从新的JSON样本的原始形式中推断出来。