Machine Learning models are increasingly being adopted in many applications. The quality of these models critically depends on the input data on which they are trained, and by augmenting their input data with external data, we have the opportunity to create better models. However, the massive number of datasets available on the Web makes it challenging to find data suitable for augmentation. In this demo, we present our ongoing efforts to develop a dataset search engine tailored for data augmentation. Our prototype, named Auctus, automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries. Auctus is already being used in a real deployment environment to improve the performance of ML models. The demonstration will include various real-world data augmentation examples and visitors will be able to interact with the system.
翻译:在许多应用中,正在越来越多地采用机器学习模型。这些模型的质量关键取决于培训它们所依据的输入数据,并且通过利用外部数据增加它们的输入数据,我们有机会创建更好的模型。然而,由于网络上提供的大量数据集,很难找到适合扩增的数据。在这个演示中,我们介绍了我们为开发数据扩增定制的数据集搜索引擎而正在作出的努力。我们的原型名为Aquarts,在网上自动发现数据集,与现有的数据集搜索引擎不同,我们推断出索引的一致元数据,支持合并和联合搜索查询。结构已经在实际部署环境中使用,以改善ML模型的性能。演示将包括各种真实世界数据扩增示例,访问者将能够与系统互动。