Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (\texttt{SILICONE}). \texttt{SILICONE} is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over $2.3$ billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.
翻译:在这项工作中,我们提出一种新的方法来学习适应口语对话的通用代表,我们根据新的基准来评估,我们称之为Squences laxlg evalutIon benChmark f Or laNguagE(\ textt{SILONE}) 的基准。\ textt{SILICONE}是模型学的,包含10个不同大小的数据集。我们得到了基于变压器结构的等级编码器,为此我们扩展了两个众所周知的训练前目标。在OpenSubtits上进行了预先培训:有大量口语对话,其中包括超过23亿美元的标语。我们展示了等级编码器如何以与最新模型相比持续较少的参数取得竞争结果,我们展示了它们对于培训前和微调的重要性。