A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention. The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods. Implementation of our algorithm can be found at https://github.com/ranarag/UnsupClean.
翻译:今天可获得的大量文本数据包含各种类型的“噪音”,例如数字化文件中的OCR噪音、微博客网站用户非正式书写风格引起的噪音等等。为了能够完成搜索/检索和分类所有现有数据等任务,我们需要为文本正常化,即清除文本中不同种类的噪音制定强有力的算法。为清洁或使噪音文本正常化作出了若干努力;然而,许多现有的文本正常化方法受到监督,需要依赖语言的资源或难以获得的大量培训数据。我们建议对文本正常化采用一种不受监督的算法,不需要任何培训数据/人类干预。提议的算法适用于不同语言的文本,可以处理机器产生的和人类产生的噪音。对若干标准数据集的实验表明,与使用几种基线文本正常化方法相比,通过拟议的算法实现文本正常化可以更好地检索和观察立场。我们的算法可在https://github.com/ranarag/UnsupClean查阅。