Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
翻译:合成数据对于评估集群技术、补充和扩大真实数据以及更完整地覆盖特定问题空间至关重要。反过来,合成数据生成者有可能创造大量数据 -- -- 在现实世界数据受到重视的情况下,这是一个至关重要的活动 -- -- 同时提供一种非常清楚的生成程序和一种可解释的工具,用于有条不紊地调查集群分析算法。在这里,我们提出了合成数据生成模块化程序,能够通过任意分布产生由线段支持的多维集群。\textit{Clugen}是开放源,100 ⁇ 单位测试和充分记录,可供Python、R、Julia和MATLAB/Octave生态系统使用。我们表明,我们的提案能够产生不同层面的丰富和不同的结果,适合用于组合算法评估,并有可能成为多样化集群相关研究任务中广泛使用的框架。