Hashtag segmentation, also known as hashtag decomposition, is a common step in preprocessing pipelines for social media datasets. It usually precedes tasks such as sentiment analysis and hate speech detection. For sentiment analysis in medium to low-resourced languages, previous research has demonstrated that a multilingual approach that resorts to machine translation can be competitive or superior to previous approaches to the task. We develop a zero-shot hashtag segmentation framework and demonstrate how it can be used to improve the accuracy of multilingual sentiment analysis pipelines. Our zero-shot framework establishes a new state-of-the-art for hashtag segmentation datasets, surpassing even previous approaches that relied on feature engineering and language models trained on in-domain data.
翻译:在社会媒体数据集的预处理管道中,混凝土分离(又称标签分解)是一个常见的步骤,通常先于情绪分析和仇恨言论检测等任务。对于中、低资源语言的情绪分析,先前的研究显示,采用机器翻译的多语种方法可以具有竞争力或优于以前的任务方法。我们开发了一个零弹标签分解框架,并演示如何利用它来提高多语种情绪分析管道的准确性。我们的零弹框架为标签分解数据集建立了新的最新技术,甚至超过了以前依赖地物工程和语言模型进行内部数据培训的方法。