Since the 90s, keyword-based search engines have been helping people locate relevant web content via a simple query, so have the recent full-text-based search engines mainly used for plagiarism detection following an article upload. However, these "free" or paid services operate by storing users' search queries and preferences for personal profiling and targeted ads delivery, while user-uploaded articles can further profit the service providers as part of their expanding databases. In short, search engine privacy has not been an option for web exploration in the past decades. Here we demonstrate that a database or internet search, provided with the entire article as a query, can be correctly carried out without revealing users' sensitive queries by an irreversible encoding scheme and an efficient FM-index search routine that is generally used in the NGS of genomes. In our solution, Sapiens Aperio Veritas Engine (S.A.V.E.), every word in the query is encoded into one of 12 "amino acids" (a.a.) comprising a pseudo-biological sequence (PBS) at users' local machines. The PBS-mediated plagiarism detection is done by users' submission of locally encoded PBS through our cloud service to locate identical duplicates in the collected web contents which had been encoded in the same way as the query. It is found that PBSs with a length longer than 12 a.a., can return correct results with a false positive rate <0.8%. S.A.V.E. runs at a similar speed as Bowtie and is 4 orders faster than BLAST. S.A.V.E., functioning in both regular and in-private search modes, provides a new option for efficient internet search and plagiarism detection in a compressed search space without a chance of storing and revealing users' confidential contents. We expect that future privacy-aware search engines can reference the ideas proposed herein. S.A.V.E. is made available at https://dyn.life.nthu.edu.tw/SAVE/
翻译:自90年代以来,基于关键字的搜索引擎一直在帮助人们通过简单的查询找到相关的网络内容,因此最近的基于全文的搜索引擎主要用于在文章上传后进行图像检测。然而,这些“免费”或付费服务通过存储用户的搜索询问和个人特征描述偏好以及有针对性的广告交付来运作,而用户上传的文章可以进一步为其数据库的一部分而为服务提供商带来更多隐私。简言之,搜索引擎隐私在过去几十年中不是网上探索的一个选项。这里我们证明,如果以整个文章作为查询提供的数据库或互联网搜索可以正确进行,而不必通过不可逆转的编码办法和在基因组NGS中普遍使用的高效调频指数搜索程序来显示用户的敏感查询。在我们的解决办法中,Spiens Aperio Veritas 引擎(S.A.V.E.),每个查询中的单词都可以被编码成12个“暗淡酸”(a.a.a.a.a.), 由用户的虚拟生物序列(PBS.A.A.A.