Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://github.com/PerryXDeng/weaponizing_unicode
翻译:视觉相似的字符, 或同义词, 可以用来进行社会工程攻击, 或躲避垃圾邮件和病原体探测器。 因此, 理解攻击者识别同义体的能力, 特别是以前没有发现的同义体, 并在攻击中利用它们。 我们用嵌入学习、 转移学习和增强来调查深学习模式, 以确定字符的相近性, 从而确定潜在的同义体。 我们的方法可以独到地利用从大多数字符都不是同义体这一事实中产生的薄弱标签。 我们的模型大大超过了对称同义体识别的正常化复理法距离方法, 因而非常重要。 我们为此实现了平均精确的0.97。 我们的将同义体组合到等同类类的组合, 这比安全从业者快速查找同义体的相近信息, 或将可调和的字符串编码标准化。 为了测量大多数字符不是同义体, 我们提议在典型的内部- 统一( IOUU) 的对等义调调调法度方法。 我们先前的GILIA 模型可以达到0. 0.292 的精确的精确度指标。