In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods.
翻译:在数据研究和分析具有协作性的多用户环境中,生成了多种版本的同一数据集。虽然管理和存储数据版本在研究文献中受到了一些关注,但这种变化的语义性质仍然没有得到充分探讨。在这项工作中,我们引入了\textt{Explain-Da-V},这个框架旨在解释两个特定数据集版本之间的变化。\textt{Extrain-Da-V}生成了\emph{explanation},它使用\emph{data transform}来解释变化。我们进一步引入了一套评估这些解释的有效性、可概括性和可解释性的措施。我们从经验上表明,使用经调整的现有基准和新创建的基准,\textt{Explain-Da-V} 产生比现有数据转换合成方法更好的解释。