Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will be problematic until after-the-fact. On the other hand, frameworks can trace everything and later keep only the interesting edge-case traces (tail sampling), but this has high overheads on the traced application and enormous data ingestion costs. In this paper we circumvent this trade-off for any edge-case with symptoms that can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, which implements a retroactive sampling abstraction: instead of eagerly ingesting and processing traces, Hindsight lazily retrieves trace data only after symptoms of a problem are detected. Hindsight is analogous to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Developers using Hindsight receive the exact edge-case traces they desire without undue overhead or dependence on luck. Our evaluation shows that Hindsight scales to millions of requests per second, adds nanosecond-level overhead to generate trace data, handles GB/s of data per node, transparently integrates with existing distributed tracing systems, and successfully persists full, detailed traces in real-world use cases when edge-case problems are detected.
翻译:今天分布式的追踪框架不足以解决稀有的边缘请求。 问题的症结在于特殊性和管理管理之间的权衡。 一方面, 框架可以任意选择在进入系统时进行追踪的请求( 头抽样), 但是这不太可能捕捉到相关的边缘案例追踪, 因为框架无法在事后之前知道哪些请求会有问题 。 另一方面, 框架可以追踪一切, 后来只能追踪有趣的边缘案例的痕迹( 尾巴取样 ), 但问题的关键在于追踪应用程序和数据吸收成本之间的权衡。 在本文中, 我们绕过任何边缘案例的权衡, 其症状可以通过程序检测得到, 如高尾部延缩、 错误和瓶颈列队列。 我们建议一个轻度和总是分布式追踪系统, 因为在事后检测和处理痕迹时, Hindsights 可能会追踪所有线索, 但是Hindsight Lazily在检测到问题的症状后才能检索到追踪跟踪数据。 最后, 与车尾递缩缩缩缩缩缩缩缩略图相似, 在检测到我们头部的图像时, 持续地记录。