The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.
翻译:当前人工智能发展时代高度重视在日益扩大的数据集上训练大型模型。这种范式催生了诸如LLM聊天机器人等全新产品类别,同时也引发了关于数据隐私和消费者选择的担忧。本文在利用思维链(CoT)轨迹进行"推理"的LLM背景下,探讨数据可移植性与用户自主权问题——这类模型在生成最终输出前会从用户输入计算中间文本产物。我们首先通过解读近期数据隐私与可移植性法律,论证这些中间计算过程应被视作用户个人数据。随后,基于现有"有意识数据贡献"框架,我们展示了从现有模型获得低效用的社区如何聚合并蒸馏其共享知识,从而构建出更符合其目标的替代模型。我们通过实证验证该方法,并研究社区多样性、推理粒度及社区规模对蒸馏性能的影响。