Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.
翻译:对日本用户产生的文本(UGT)来说,精神分析(MA)和词典正常化(LN)都是重要的任务。为了评估和比较不同的MA/LN系统,我们建立了一个可公开查阅的日本UGT系统。我们的资料包括929个句子,附有形态学和正常化信息,以及我们分类的常见UGT特定现象的类别信息。对物典的实验表明,现有的MA/LN非通用词句和非标准表格方法表现不佳,表明对UGT的进一步研究来说,该物质将是一个具有挑战性的基准。