Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems, which requires cross-domain knowledge of hardware and software has not been studied. In this paper we develop an extensible, open source hardware-in-the-loop framework to systematically evaluate leading LLMs (GPT-3.5, GPT-4, PaLM 2) to assess their capabilities and limitations for embedded system development. We observe through our study that even when these tools fail to produce working code, they consistently generate helpful reasoning about embedded design tasks. We leverage this finding to study how human programmers interact with these tools, and develop an human-AI based software engineering workflow for building embedded systems. Our evaluation platform for verifying LLM generated programs uses sensor actuator pairs for physical evaluation. We compare all three models with N=450 experiments and find surprisingly that GPT-4 especially shows an exceptional level of cross-domain understanding and reasoning, in some cases generating fully correct programs from a single prompt. In N=50 trials, GPT-4 produces functional I2C interfaces 66% of the time. GPT-4 also produces register-level drivers, code for LoRa communication, and context-specific power optimizations for an nRF52 program resulting in over 740x current reduction to 12.2uA. We also characterize the models' limitations to develop a generalizable human-AI workflow for using LLMs in embedded system development. We evaluate our workflow with 15 users including novice and expert programmers. We find that our workflow improves productivity for all users and increases the success rate for building a LoRa environmental sensor from 25% to 100%, including for users with zero hardware or C/C++ experience.
翻译:暂无翻译