Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.
翻译:关于半结构等级数据的常见查询是内容和结构(CAS)查询,这些查询根据数据项目在等级结构中的位置和某些属性的价值来过滤数据项目。我们建议采用强势和可缩放的内容和结构(RSCAS)指数,以便有效地回答关于大半结构数据的CAS查询。要获得一种能应对不同选择的查询的指数,我们引入了一种新的动态互换功能,以平衡的方式将复合键的路径和价值维度合并起来。我们把断开的钥匙存储在基于三重的RSCAS指数中,这有效地支持了广泛的CAS查询,包括用通配卡和后方轴查询。我们把RSCAS作为日志结构合并(LSM)树,以便以高插入率将它缩放到数据密集的应用中。我们通过将软件遗产(SWH)档案的数据索引化来说明RSCAS的可靠性和可缩放性。这是世界上最大的、可公开获得的来源代码档案。