Benchmarking and evaluating deep learning models and systems necessitate a meticulous approach to ensure comprehensive assessment. In practical applications, it is paramount to consider both the inference quality and the inference time, particularly within critical contexts, where stringent requirements demand the simultaneous satisfaction of both metrics. Neglecting either aspect can result in severe and irreversible consequences, including loss of human life and property damage. Unfortunately, many studies lack a comprehensive consideration of these metrics, often conducted under ideal or permissive conditions, thereby leading to incomplete or non-intuitive evaluation methodologies. This study reveals that deep learning inference quality exhibits fluctuations, which further introduces complications and challenges to the benchmarking and evaluation. To better characterize the phenomenon, the concept of "tail quality" is introduced, which indicates the quality at the tail of distributions. "Tail quality" can offer a more objective evaluation, overcoming the limitations of conventional inference quality and inference time metrics in capturing the quality fluctuation phenomenon. To capture the phenomenon, this paper also proposes a pioneering evaluation framework for comprehensive assessment and analysis of various factors affecting inference time and quality. Leveraging this framework enables the anticipation of the potential distribution of inference time and inference quality, thus capturing "tail quality" before practically applying deep learning. The effectiveness of the evaluation framework is validated through experiments conducted on deep learning models for three different tasks across four systems. Furthermore, employing this evaluation framework, the experiments conducted a preliminary analysis of several factors influencing inference quality and inference time.
翻译:暂无翻译