智能合同分类基于字元码的方法 (A Bytecode-based Approach for Smart Contract Classification)

With the development of blockchain technologies, the number of smart contracts deployed on blockchain platforms is growing exponentially, which makes it difficult for users to find desired services by manual screening. The automatic classification of smart contracts can provide blockchain users with keyword-based contract searching and helps to manage smart contracts effectively. Current research on smart contract classification focuses on Natural Language Processing (NLP) solutions which are based on contract source code. However, more than 94% of smart contracts are not open-source, so the application scenarios of NLP methods are very limited. Meanwhile, NLP models are vulnerable to adversarial attacks. This paper proposes a classification model based on features from contract bytecode instead of source code to solve these problems. We also use feature selection and ensemble learning to optimize the model. Our experimental studies on over 3,300 real-world Ethereum smart contracts show that our model can classify smart contracts without source code and has better performance than baseline models. Our model also has good resistance to adversarial attacks compared with NLP-based models. In addition, our analysis reveals that account features used in many smart contract classification models have little effect on classification and can be excluded.

翻译：随着链式技术的发展,在链式平台上部署的智能合同数量正在成倍增长,使用户很难通过人工筛选找到所需的服务。智能合同的自动分类可以为链式用户提供基于关键词的合同搜索,并帮助有效管理智能合同。目前对智能合同分类的研究侧重于基于合同源代码的自然语言处理(NLP)解决方案。然而,超过94%的智能合同不是开放源码,因此NLP方法的应用设想也非常有限。与此同时,NLP模式很容易受到对抗性攻击。本文建议基于合同的字码而不是源代码的特征的分类模式来解决这些问题。我们还利用特征选择和共同学习来优化该模式。我们对3 300多份现实世界Etheum智能合同的实验研究表明,我们的模型可以对没有源代码的智能合同进行分类,而且其性能比基线模型要好。我们的模型与NLP型模型相比,对对抗性攻击也有很强的阻力。此外,我们的分析还表明,许多智能合同分类模型中使用的账户特征对分类没有多大影响,可以排除。