We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.
翻译:我们建议使用 InDEX 来测试凝聚。 数据集包含 289 个语系和表达式的10438 个独有的句子, 我们为此生成了15种不同的分散器, 从而形成一个巨大的凝聚型体。 许多凝聚测试理解的基线模型应用随机初始化BERT来学习嵌入式。 但语系和固定表达式不同, 使得这些短语的字面含义可能或不符合其上下文含义 。 因此, 我们探索不同的方式, 将静态和背景表达方式结合起来, 以建立一个更强的基线模型。 实验显示, 将定义和随机初始化结合起来, 将更好地支持独立地或与固定表达式混在一起的静态测试模型性能。 虽然对于没有特殊意义的固定表达式来说, 随机初始化的静态嵌入对于凝聚测试模型来说就足够了 。