Distributed data analytics platforms such as Apache Spark enable cost-effective processing and storage. These platforms allow users to distribute data to multiple nodes and enable arbitrary code execution over this distributed data. However, such capabilities create new security and privacy challenges. First, the user-submitted code may potentially contain malicious code to circumvent existing security checks. In addition, providing fine-grained access control for different types of data (e.g., text, images, etc.) may not be feasible for different data storage options. To address these challenges, we provide a fine-grained access control framework tailored for distributed data analytics platforms, which is protected against evasion attacks with two distinct layers of defense. Access control is implemented with runtime injection of access control logic on a submitted data analysis job. The proactive security layer utilizes state-of-the-art program analysis to detect potentially malicious user code. The reactive security layer consists of binary integrity checking, instrumentation-based runtime checks, and sandboxed execution. To the best of our knowledge, this is the first work that provides fine-grained attribute-based access control for distributed data analytics platforms using code rewriting and static program analysis. Furthermore, we evaluated the performance of our security system under different settings and show that the performance overhead due to added security is low.
翻译:Apache Spark等分布式数据分析平台能够以成本效益高的方式处理和储存数据分析平台。 这些平台允许用户将数据传播到多个节点,并允许对分布式数据进行任意代码执行。 然而,这些能力带来了新的安全和隐私挑战。 首先,用户提交的代码可能含有恶意代码,以规避现有的安全检查。 此外,为不同类型数据(如文本、图像等)提供细微的存取控制可能不可行。 为了应对这些挑战,我们提供了为分布式数据分析平台定制的精细访问控制框架,通过两层不同的防御来保护这些平台免遭规避攻击。 访问控制是通过在提交的数据分析工作中实时输入访问控制逻辑来实施的。 预防性安全层利用最新工艺程序分析来检测潜在的恶意用户代码。 反应式安全层包括二元完整性检查、仪器操作时间检查和沙箱操作。 最先进的知识是提供精确的基于属性的访问访问控制框架,通过两种不同的防御层面保护, 使用我们配置的静态安全性操作平台, 进行我们最新的安全性分析。