Solving the challenges of automatic machine translation of Building Automation System text metadata is a crucial first step in efficiently deploying smart building applications. The vocabulary used to describe building metadata appears small compared to general natural languages, but each term has multiple commonly used abbreviations. Conventional machine learning techniques are inefficient since they need to learn many different forms for the same word, and large amounts of data must be used to train these models. It is also difficult to apply standard techniques such as tokenisation since this commonly results in multiple output tags being associated with a single input token, something traditional sequence labelling models do not allow. Finite State Transducers can model sequence-to-sequence tasks where the input and output sequences are different lengths, and they can be combined with language models to ensure a valid output sequence is generated. We perform a preliminary analysis into the use of transducer-based language models to parse and normalise building point metadata.
翻译:解决自动机器翻译“建设自动化系统”文本元数据的挑战是有效部署智能建筑应用程序的关键第一步。描述建筑元数据的词汇与一般自然语言相比似乎很小,但每个术语都有多种常用缩略语。常规机器学习技术效率低下,因为它们需要为同一个词学习多种不同的形式,并且必须使用大量数据来培训这些模型。同样难以应用标准技术,例如象征性化,因为这通常导致多个输出标记与单一输入符号相关联,传统序列标签模式不允许。在输入和输出序列为不同长度的情况下,精致国家转换器可以建模序列到序列任务,它们可以与语言模型相结合,以确保生成有效的输出序列。我们初步分析以导出器为基础的语言模型用于分析和正常构建点元数据。