We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions.
翻译:我们考虑在包含数字值和每个行的单位符号的数据列中确定计量单位的问题,例如“5.2升”、“7品脱”等数据列中的数值和单位符号。在这种情况下,我们力求确定该列的尺寸(如体积),并将单位符号与从知识图中获得的有效单位(如升,品脱)联系起来。下面我们介绍PUC,这是一个概率性单位加固器,可以准确确定计量单位,提取定量数据列的语义说明,并能够将其条目化。我们为测量单位提供了第一个混乱真实世界的表格数据集,这些数据集能够促进和加速这一领域的研究。我们在这些数据集上的实验表明,PUC比现有解决方案取得更好的结果。