Step-Audio-EditX技术报告 (Step-Audio-EditX Technical Report) - 专知论文

会员服务 ·

0

间隔 · 语音合成 · 报告 · 语言特征 · 鲁棒 ·

Step-Audio-EditX Technical Report

翻译：Step-Audio-EditX技术报告

Chao Yan,Boyong Wu,Peng Yang,Pengfei Tan,Guoqiang Hu,Yuxin Zhang, Xiangyu, Zhang,Fei Tian,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

翻译：我们提出了Step-Audio-EditX，这是首个基于大型语言模型的开源音频模型，在情感、说话风格和副语言特征方面表现出卓越的表达性和迭代音频编辑能力，同时具备鲁棒的零样本文本到语音（TTS）功能。我们的核心创新在于仅利用大间隔合成数据，从而避免了基于嵌入的先验知识或辅助模块的需求。这种大间隔学习方法实现了跨语音的迭代控制和高表达性，并代表了对传统表征级解耦关注的根本性转变。评估结果表明，Step-Audio-EditX在情感编辑及其他细粒度控制任务上超越了MiniMax-2.6-hd和Doubao-Seed-TTS-2.0。

0

相关内容

【Google AI-Yi Tay】Transformer记忆为可微搜索索引”(DSI)

【Google AI-Yi Tay】Transformer记忆为可微搜索索引”(DSI)

专知会员服务

10+阅读 · 2022年3月4日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

【Tutorial】计算机视觉中的Transformer，98页ppt

【Tutorial】计算机视觉中的Transformer，98页ppt

专知

21+阅读 · 2021年10月25日

将Python用于NLP：Pattern 库简介

将Python用于NLP：Pattern 库简介

Python程序员

15+阅读 · 2019年6月7日

Auto-Keras与AutoML：入门指南

Auto-Keras与AutoML：入门指南

云栖社区

18+阅读 · 2019年2月9日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

Generative Adversarial Text to Image Synthesis论文解读

Generative Adversarial Text to Image Synthesis论文解读

统计学习与视觉计算组

13+阅读 · 2017年6月9日

反问题的数学建模、计算及应用

国家自然科学基金

4+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

复杂多元数据的半参数统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

Sigma-Moe-Tiny Technical Report

Arxiv

0+阅读 · 12月18日

GLM-TTS Technical Report

Arxiv

0+阅读 · 12月16日

LongCat-Image Technical Report

Arxiv

0+阅读 · 12月8日

Qwen3-VL Technical Report

Arxiv

0+阅读 · 11月26日

EmbeddingGemma: Powerful and Lightweight Text Representations

Arxiv

0+阅读 · 11月1日

VIP会员

文章信息

相关主题

相关VIP内容

【Google AI-Yi Tay】Transformer记忆为可微搜索索引”(DSI)

【Google AI-Yi Tay】Transformer记忆为可微搜索索引”(DSI)

专知会员服务

10+阅读 · 2022年3月4日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

前沿人工智能趋势报告（Frontier AI Trends Report）

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

音退化问题：基于输入操控的鲁棒语音转换综述

相关资讯

【Tutorial】计算机视觉中的Transformer，98页ppt

【Tutorial】计算机视觉中的Transformer，98页ppt

专知

21+阅读 · 2021年10月25日

将Python用于NLP：Pattern 库简介

将Python用于NLP：Pattern 库简介

Python程序员

15+阅读 · 2019年6月7日

Auto-Keras与AutoML：入门指南

Auto-Keras与AutoML：入门指南

云栖社区

18+阅读 · 2019年2月9日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

Generative Adversarial Text to Image Synthesis论文解读

Generative Adversarial Text to Image Synthesis论文解读

统计学习与视觉计算组

13+阅读 · 2017年6月9日

相关论文

Sigma-Moe-Tiny Technical Report

Arxiv

0+阅读 · 12月18日

GLM-TTS Technical Report

Arxiv

0+阅读 · 12月16日

LongCat-Image Technical Report

Arxiv

0+阅读 · 12月8日

Qwen3-VL Technical Report

Arxiv

0+阅读 · 11月26日

EmbeddingGemma: Powerful and Lightweight Text Representations

Arxiv

0+阅读 · 11月1日

相关基金

反问题的数学建模、计算及应用

国家自然科学基金

4+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

复杂多元数据的半参数统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员