检测文本形式:对文本分类方法的研究 (Detecting Text Formality: A Study of Text Classification Approaches)

Formality is an important characteristic of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks, such as retrieval of texts with a desired formality level, integration in language learning and document editing platforms, or evaluating the desired conversation tone by chatbots. Recently two large-scale datasets were introduced for multiple languages featuring formality annotation. However, they were primarily used for the training of style transfer models. However, detection text formality on its own may also be a useful application. This work proposes the first systematic study of formality detection methods based on current (and more classic) machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of BiLSTM-based models over transformer-based ones for the formality classification task. We release formality detection models for several languages yielding state of the art results and possessing tested cross-lingual capabilities.

翻译：自动检测文本的正规程度对于各种自然语言处理任务可能是有益的,例如检索具有理想的正规程度的文本,融入语言学习和文件编辑平台,或评价聊天室想要的谈话语调。最近为具有正规性注释的多种语言引入了两个大型数据集。然而,这些数据主要用于培训风格传输模式。但是,检测文本的正规程度本身也可能是一种有用的应用。这项工作提议根据当前(和较经典)机器学习方法,对正规性检测方法进行首次系统研究,并提供最佳的公共使用模式。我们进行了三类实验 -- -- 单一语言、多语言和跨语言的实验。研究显示克服了基于BILSTM的模型而不是基于变压器的模型来完成正规性分类任务。我们为产生艺术成果并拥有经过测试的跨语言能力的若干语言发布了形式检测模型。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日