The rapidly evolving Android malware ecosystem demands high-quality, real-time datasets as a foundation for effective detection and defense. With the widespread adoption of mobile devices across industrial systems, they have become a critical yet often overlooked attack surface in industrial cybersecurity. However, mainstream datasets widely used in academia and industry (e.g., Drebin) exhibit significant limitations: on one hand, their heavy reliance on VirusTotal's multi-engine aggregation results introduces substantial label noise; on the other hand, outdated samples reduce their temporal relevance. Moreover, automated labeling tools (e.g., AVClass2) suffer from suboptimal aggregation strategies, further compounding labeling errors and propagating inaccuracies throughout the research community.
翻译:快速演进的Android恶意软件生态系统需要高质量、实时的数据集作为有效检测与防御的基础。随着移动设备在工业系统中的广泛采用,它们已成为工业网络安全中至关重要却常被忽视的攻击面。然而,学术界与工业界广泛使用的主流数据集(如Drebin)存在显著局限性:一方面,其对VirusTotal多引擎聚合结果的严重依赖引入了大量标签噪声;另一方面,过时的样本降低了其时效相关性。此外,自动化标注工具(如AVClass2)因聚合策略欠佳而进一步加剧了标注错误,导致不准确性在整个研究社区中持续传播。