Mathematical notation makes up a large portion of STEM literature, yet, finding semantic representations for formulae remains a challenging problem. Because mathematical notation is precise and its meaning changes significantly with small character shifts, the methods that work for natural text do not necessarily work well for mathematical expressions. In this work, we describe an approach for representing mathematical expressions in a continuous vector space. We use the encoder of a sequence-to-sequence architecture, trained on visually different but mathematically equivalent expressions, to generate vector representations (embeddings). We compare this approach with an autoencoder and show that the former is better at capturing mathematical semantics. Finally, to expedite future projects, we publish a corpus of equivalent transcendental and algebraic expression pairs.
翻译:数学符号占STEM文献的一大部分,然而,找到公式的语义表达方式仍是一个具有挑战性的问题。由于数学符号精确,其含义随着字符的细微变化而发生重大变化,自然文本使用的方法不一定对数学表达方式有效。在这项工作中,我们描述一种在连续矢量空间中代表数学表达方式的方法。我们用一个序列到序列结构的编码器生成矢量表达方式(组合),该编码器受过视觉不同但数学等同的表达方式的培训。我们把这个方法与一个自动编码器进行比较,并表明前者在获取数学语义学方面比较好。最后,为了加速未来的工程,我们出版了一套等同的超文本和代数表达式配对。