揭示Unicode在削弱作者身份识别中的潜在影响 (Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution)

When using a public communication channel -- whether formal or informal, such as commenting or posting on social media -- end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence -- using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting -- one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message -- necessarily open for public consumption -- exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.

翻译：当使用公共通信渠道——无论是正式还是非正式的，例如在社交媒体上发表评论或帖子——终端用户并不期望隐私：他们撰写信息并向全世界广播。即使终端用户采取最高级别的预防措施来匿名化其在线存在——使用别名或化名；掩盖IP地址；伪造地理位置；隐藏操作系统和用户代理；部署加密；使用一次性电话号码或电子邮件注册；禁用非必要设置；撤销权限；以及阻止Cookie和指纹识别——一个明显的元素仍然存在：信息本身。假设他们避免了判断失误或意外的自我暴露，那么验证其真实身份的证据应该很少，对吗？错了。他们信息的内容——必然公开供公众消费——暴露了一个攻击向量：文体计量分析，或作者画像。在本文中，我们剖析了文体计量技术，讨论了对抗性文体计量中的对立反制策略，并通过Unicode隐写术设计了增强方法。