In the presented study, we discover that so called "transition freedom" metric appears superior for unsupervised tokenization purposes, compared to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored corpora. We find that different languages require different derivatives of that metric (such as variance and "peak values") for successful tokenization. Larger training corpora does not necessarily effect in better tokenization quality, while compacting the models eliminating statistically weak evidence tends to improve performance. Proposed unsupervised tokenization technique provides quality better or comparable to lexicon-based one, depending on the language.
翻译:在提交的研究报告中,我们发现,所谓的“过渡自由”指标对于未受监督的象征性化目的而言,与诸如相互信息和有条件概率等统计指标相比,在所探索的公司之间提供0.71至1.0的F度量计,在相互探索的公司之间提供0.71至1.0之间的F度量计。 我们发现,不同的语言要求该指标的不同衍生物(如差异和“峰值 ” ) 才能成功象征性化。 大型培训公司不一定在提高象征性化质量方面产生效果,而将那些在统计上薄弱的证据压缩起来往往会改善绩效。 拟议的未经监督的象征性化技术根据语言提供质量更好或更可比于以词汇为基础的技术。