Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to academic writing. In this paper, we introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. We also provide ManuScript, an original dataset annotated with a simplified version of our taxonomy to show writer actions and the intentions behind them. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow and identify the distinct writer activities embedded within each higher-level process. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory, such that writing assistants can provide stronger feedback and suggestions on an end-to-end level. The collected writing trajectories are viewed at https://minnesotanlp.github.io/REWARD_demo/
翻译:----
学术手稿的端到端写作轨迹解码
论文摘要:
学术写作呈现出一个复杂的空间,通常遵循一种方法论程序来规划和产生既有理论基础又有创意的成稿作品。近年来,大型语言模型(LLM)在文本生成和修订任务方面取得了相当大的成功,然而,LLMs 在提供对文档层面的结构和创意反馈方面仍然存在困难,而这对于学术写作来说至关重要。在本文中,我们引入了一种新的分类法,根据意图、作者行为和书面数据信息类型来对学术写作行为进行分类。我们提供 ManuScript,这是一个用我们的分类法简化注释的原始数据集,用于展示作者行为及其背后的意图。受认知写作理论的激发,我们的科学论文分类方法包括三个层次的分类,以便跟踪一般的写作流程并识别嵌入在每个更高级别进程中的不同作者活动。ManuScript 旨在通过捕捉写作轨迹的线性和非线性特征提供学术写作过程的完整图景,从而使写作助手能够在端到端级别上提供更强的反馈和建议。收集的写作轨迹可在以下网址查看:https://minnesotanlp.github.io/REWARD_demo/