The Internet of Things (IoT) is an emerging concept that directly links to the billions of physical items, or "things", that are connected to the Internet and are all gathering and exchanging information between devices and systems. However, IoT devices were not built with security in mind, which might lead to security vulnerabilities in a multi-device system. Traditionally, we investigated IoT issues by polling IoT developers and specialists. This technique, however, is not scalable since surveying all IoT developers is not feasible. Another way to look into IoT issues is to look at IoT developer discussions on major online development forums like Stack Overflow (SO). However, finding discussions that are relevant to IoT issues is challenging since they are frequently not categorized with IoT-related terms. In this paper, we present the "IoT Security Dataset", a domain-specific dataset of 7147 samples focused solely on IoT security discussions. As there are no automated tools to label these samples, we manually labeled them. We further employed multiple transformer models to automatically detect security discussions. Through rigorous investigations, we found that IoT security discussions are different and more complex than traditional security discussions. We demonstrated a considerable performance loss (up to 44%) of transformer models on cross-domain datasets when we transferred knowledge from a general-purpose dataset "Opiner", supporting our claim. Thus, we built a domain-specific IoT security detector with an F1-Score of 0.69. We have made the dataset public in the hope that developers would learn more about the security discussion and vendors would enhance their concerns about product security.
翻译:物联网( IoT) 是一个新兴概念, 直接连接数十亿物理项目, 或“ 东西”, 与 Internet 连接, 并正在收集和交换设备与系统之间的信息。 然而, IoT 设备不是在安全意识中建造的, 这可能会导致多设备系统中的安全脆弱性。 我们通常通过投票的 IoT 开发者和专家来调查 IoT 问题。 但是, 这个技术无法推广, 因为调查所有 IoT 开发者都不可行。 另一个研究IoT 问题的方法是查看与 Internet 连接的 IoT 开发者关于主要在线开发论坛( 如 Stack Over ) 的讨论。 然而, 找到与 IoT 问题相关的讨论具有挑战性, 因为这些讨论往往没有与 IoT 相关术语的分类。 在本文中, 我们展示了“ IO 安全数据库” 的域域域域域数据集, 仅仅侧重于IO 安全讨论。 因为没有自动标记这些样品, 我们用手动的标签。 我们进一步使用多个变换模型来自动检测安全讨论 。