Distributed data analytics platforms such as Apache Spark enable cost-effective processing and storage. These platforms allow users to distribute data to multiple nodes and enable arbitrary code execution over this distributed data. However, such capabilities create new security and privacy challenges. First, the user-submitted code may potentially contain malicious code to circumvent existing security checks. In addition, providing fine-grained access control for different types of data (e.g., text, image, etc.) may not be feasible for different data storage options. To address these challenges, we provide a fine-grained access control framework tailored for distribute data analytics platforms, which is protected against evasion attacks with two distinct layers of defenses. Access control is implemented with runtime injection of access control logic on a submitted data analysis job. The proactive security layer utilizes state-of-the-art program analysis to detect potentially malicious user code. The reactive security layer consists of binary integrity checking, instrumentation-based runtime checks, and sandboxed execution. To the best of our knowledge, this is the first work that provides fine-grained attribute-based access control for distributed data analytics platforms using code rewriting and static program analysis. Furthermore, we evaluated the performance of our security system under different settings and show that the performance overhead due to added security is low.
翻译:Apache Spark等分布式数据分析平台能够以成本效益高的方式处理和存储数据分析平台。这些平台允许用户将数据传播到多个节点,并允许对分布式数据进行任意代码执行。但是,这些能力造成了新的安全和隐私挑战。首先,用户提交的代码可能含有恶意代码,以规避现有的安全检查。此外,为不同类型数据(如文本、图像等)提供细微的接入控制可能不可行。为了应对这些挑战,我们提供了一种精细的接入控制框架,专门设计用于分发数据分析平台,保护这些平台免受两层不同的规避攻击。使用这种能力,在提交的数据分析工作中,通过实时输入访问控制逻辑来实施访问控制。预防性安全层利用最新工艺的程序分析来检测潜在的恶意用户代码。反应式安全层包括二元完整性检查、仪器测试运行时间检查和沙箱执行。我们的知识中最好的是第一种工作,即提供精细的属性访问分析,防止有两种不同的防御层的偷袭攻击。我们使用不同的安全性分析平台对基于静态的系统进行了重新评价。