Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
翻译:在边缘设备上部署本地大语言模型与视觉语言模型,需要在模型精度与受限的计算及能耗预算之间取得平衡。尽管图形处理器主导着现代人工智能的部署,但大多数消费级硬件——包括笔记本电脑、台式机、工业控制器和嵌入式系统——仍依赖于中央处理器。尽管如此,针对本地语言与视觉语言任务、仅使用中央处理器进行推理的计算规律在很大程度上仍未得到充分探索。我们在两种广泛用于本地推理的代表性中央处理器层级上,系统性地对大语言模型和视觉语言模型进行了基准测试:一是反映主流笔记本电脑级部署的MacBook Pro M2,二是代表受限低功耗嵌入式场景的Raspberry Pi 5。通过采用基于处理器与内存使用情况的连续采样及曲线下面积积分的统一方法,我们刻画了计算负载如何随语言模型的输入文本长度以及视觉语言模型的图像分辨率而缩放。我们揭示了两条经验性缩放定律:(1) 语言模型推理的计算成本大致随令牌长度线性增长;(2) 视觉语言模型表现出由预处理驱动的“分辨率拐点”,即计算量在高于内部分辨率钳位值时保持恒定,而在低于该值时急剧下降。除了这些定律,我们还证明,量子启发的压缩技术可将处理器和内存使用量降低高达71.9%,能耗降低高达62%,同时保持甚至提升语义准确性。这些结果为本地语言与视觉语言任务在仅使用中央处理器时的多模态缩放提供了系统性量化分析,并指出模型压缩与输入分辨率预处理是实现可持续边缘推理的有效且低成本的手段。