In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
翻译:在模拟时间序列数据时,我们往往需要增加现有数据记录,以提高模型的准确性。在这项工作中,我们描述了若干技术,以获取关于大型科学工作流程现状的动态信息,这种动态信息可以推广到其他类型的应用中。要建模的具体任务是将文件从实验设施转移到数据中心所需的时间。我们方法的关键思想是找到最近过去的数据传输事件,以某种方式与当前事件相匹配。测试表明,我们可以识别与某些记录属性相匹配的近期事件,并将预测错误减少约12%,而类似模型只有静态特征。我们还探索了一种具体应用技术,以提取数据制作过程的信息,并能够将平均预测错误减少44%。