The short message service (SMS) was introduced a generation ago to the mobile phone users. They make up the world's oldest large-scale network, with billions of users and therefore attracts a lot of fraud. Due to the convergence of mobile network with internet, SMS based scams can potentially compromise the security of internet services as well. In this study, we present a new SMS scam dataset consisting of 153,551 SMSes. This dataset that we will release publicly for research purposes represents the largest publicly-available SMS scam dataset. We evaluate and compare the performance achieved by several established machine learning methods on the new dataset, ranging from shallow machine learning approaches to deep neural networks to syntactic and semantic feature models. We then study the existing models from an adversarial viewpoint by assessing its robustness against different level of adversarial manipulation. This perspective consolidates the current state of the art in SMS Spam filtering, highlights the limitations and the opportunities to improve the existing approaches.
翻译:短信息服务(SMS)是一代前向移动电话用户推出的,它构成了世界上最古老的大型网络,拥有数十亿用户,因此吸引了许多欺诈。由于移动网络与互联网的融合,基于SMS的骗局也有可能损害互联网服务的安全。在这项研究中,我们提出了一个新的SMS骗骗局数据集,由153 551个短信息数据集组成。我们将为研究目的公开发布该数据集是公众可公开获得的最大SMS骗局数据集。我们评估和比较了新数据集上若干既定机器学习方法的绩效,从浅机学习方法到深层神经网络到合成和语义特征模型。我们随后从对立的角度研究现有的模型,评估其强健性,以对付不同级别的对抗性操纵。这一视角巩固了SMS垃圾过滤系统当前的艺术状态,突出了改进现有方法的局限性和机会。