Despite the constant evolution of similarity searching research, it continues to face the same challenges stemming from the complexity of the data, such as the curse of dimensionality and computationally expensive distance functions. Various machine learning techniques have proven capable of replacing elaborate mathematical models with combinations of simple linear functions, often gaining speed and simplicity at the cost of formal guarantees of accuracy and correctness of querying. The authors explore the potential of this research trend by presenting a lightweight solution for the complex problem of 3D protein structure search. The solution consists of three steps -- (i) transformation of 3D protein structural information into very compact vectors, (ii) use of a probabilistic model to group these vectors and respond to queries by returning a given number of similar objects, and (iii) a final filtering step which applies basic vector distance functions to refine the result.
翻译:尽管类似搜索研究不断演变,但它继续面临数据复杂性带来的同样挑战,如维度诅咒和计算成本昂贵的远程功能,各种机器学习技术证明能够以简单的线性功能组合取代精心设计的数学模型,往往以查询准确和正确性的正式保证为代价而速度和简便。作者探讨了这一研究趋势的潜力,为3D蛋白结构搜索这一复杂问题提供了一个轻量级解决方案。解决方案包括三个步骤:(一) 将3D蛋白结构信息转换为非常紧凑的矢量,(二) 使用概率模型将这些矢量分组,并回答询问,归还一定数量的类似对象,以及(三) 采用基本矢量距离功能改进结果的最后过滤步骤。