Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4\% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.
翻译:以变异器为基础的预训练语言模型是依赖词汇的, 默认每个符号都映射到相应的嵌入中。 这种一对一的映射结果将包含大量记忆的嵌入矩阵( 即百万参数), 并随着词汇的大小而成线增长 。 之前关于设备变异器的工作动态地生成在现场的象征性嵌入, 而不嵌入矩阵, 使用对位置敏感的杂质, 而不是形态学信息 。 这些嵌入随后被输入到变异器层中进行文本分类 。 然而, 这些方法没有预先训练 。 受这一行的启发, 我们建议HashFormers, 这是一种新的基于词汇( 即百万参数) 的预培训前变异器组合, 支持一个无限的词汇( 即, 体积中所有可能的符号) 。 我们首先采用计算便宜的散装功能, 将单个符号混在一起进行嵌入 。 我们还建议三种不需要嵌入矩阵, 进一步减少记忆要求。 我们实验性地证明, 质化的HashForforformorsFors 将最高效的缩缩缩缩缩缩缩缩缩变缩模型, 与升级的图像相比, 性变缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的动作,, 将运行的缩缩缩缩缩缩缩缩缩缩的动作将动作将动作将动作将动作将动作比成成为S- 性动作将动作比成为SBL 。