TensorFlow 支持 Unicode 编码 - 专知

会员服务 ·

0

TensorFlow 支持 Unicode 编码

2018 年 12 月 22 日 谷歌开发者

文 / Laurence Moroney, Google TensorFlow 团队；Edward Loper, Google Research 团队

TensorFlow 现在可以支持 Unicode，这是一种标准编码系统，可以表示几乎所有语言的字符。处理自然语言时，了解字符的编码方式非常重要。在像英语这样的小字符集的语言中，每个字符都可以使用 ASCII 进行表示。但是这种方法对于其他语言来说并不实用，例如中文，这些语言有数千个字符。即使处理英文文本，Emojis 等特殊字符也不能用 ASCII 表示。

定义字符及其编码的最常用标准是 Unicode，它几乎支持所有语言。对于 Unicode，每个字符使用唯一的整数 code point 表示，其值介于 0 和 0x10FFFF 之间。当按顺序放置 code point 时，将形成 Unicode 字符串。

Unicode tutorial colab 展示了如何在 TensorFlow 中表示 Unicode 字符串。使用 TensorFlow 时，有两种标准方式来表示 Unicode 字符串：

作为整数向量，其中每个位置包含单个 code point
作为字符串，使用字符编码将 code point 序列编码到字符串中。有许多字符编码，其中一些最常见的是 UTF-8，UTF-16 等

以下代码分别使用 code point、UTF-8 和 UTF-16 显示字符串 “语言处理” 的编码。

当然，您可能需要在各种表示方式之间进行转换，而 TensorFlow 1.13 已添加了执行此操作的函数：

tf.strings.unicode_decode: 将字符串标量转换为 code point 的向量（https://www.tensorflow.org/versions/r1.13/api_docs/python/tf/strings/unicode_decode）
tf.strings.unicode_encode: 将 code point 向量转换为字符串标量（https://www.tensorflow.org/versions/r1.13/api_docs/python/tf/strings/unicode_decode）
tf.strings.unicode_transcode: 将字符串标量转换为不同的编码（https://www.tensorflow.org/versions/r1.13/api_docs/python/tf/strings/unicode_transcode）

因此，如果要将上述示例中的 UTF-8 解码为 code point 向量，则可以执行以下操作：

当解码包含多个字符串的 Tensor 时，字符串可能具有不同的长度。 unicode_decode 将结果作为 RaggedTensor 返回，其中内部维度的长度根据每个字符串中的字符数而变化。

要了解有关 TensorFlow 中 Unicode 支持的更多信息，请查看 Unicode tutorial colab 并浏览 tf.strings 文档（https://www.tensorflow.org/tutorials/representation/unicode）。

更多 AI 相关阅读：

登录查看更多

0

相关内容

TensorFlow

Google发布的第二代深度学习系统TensorFlow

Python导论，476页pdf，现代Python计算

Python导论，476页pdf，现代Python计算

专知会员服务

264+阅读 · 2020年5月17日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

【Google】利用AUTOML实现加速感知神经网络设计

【Google】利用AUTOML实现加速感知神经网络设计

专知会员服务

30+阅读 · 2020年3月5日

【MIT深度学习课程】深度序列建模，Deep Sequence Modeling

【MIT深度学习课程】深度序列建模，Deep Sequence Modeling

专知会员服务

78+阅读 · 2020年2月3日

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

专知会员服务

70+阅读 · 2020年1月17日

KGCN：使用TensorFlow进行知识图谱的机器学习

KGCN：使用TensorFlow进行知识图谱的机器学习

专知会员服务

83+阅读 · 2020年1月13日

【新书】学习TensorFlow2.0，177页pdf，使用Python实现机器学习和深度学习模型

【新书】学习TensorFlow2.0，177页pdf，使用Python实现机器学习和深度学习模型

专知会员服务

224+阅读 · 2019年12月28日

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

专知会员服务

5+阅读 · 2019年12月1日

【ACL 2019 Tutorials】基于图的含义表示:设计和处理（Graph-Based Meaning Representations: Design and Processing），Alexander Koller，Stephan Oepen，孙薇薇

【ACL 2019 Tutorials】基于图的含义表示:设计和处理（Graph-Based Meaning Representations: Design and Processing），Alexander Koller，Stephan Oepen，孙薇薇

专知会员服务

10+阅读 · 2019年11月16日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

新智元

19+阅读 · 2019年6月8日

用 TensorFlow hub 在 Keras 中做 ELMo 嵌入

用 TensorFlow hub 在 Keras 中做 ELMo 嵌入

AI研习社

5+阅读 · 2019年5月12日

Github项目推荐 | tntorch - 使用PyTorch进行张量网络学习

Github项目推荐 | tntorch - 使用PyTorch进行张量网络学习

AI研习社

8+阅读 · 2019年4月17日

TensorFlow 2.0新特性之Ragged Tensor

TensorFlow 2.0新特性之Ragged Tensor

深度学习每日摘要

18+阅读 · 2019年4月5日

从张量到自动微分：PyTorch入门教程

从张量到自动微分：PyTorch入门教程

论智

9+阅读 · 2018年10月10日

推荐一些有助于理解TensorFlow机制的资料（二）

推荐一些有助于理解TensorFlow机制的资料（二）

专知

4+阅读 · 2018年5月11日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

机器学习的大局：用神经网络和TensorFlow分类文本

机器学习的大局：用神经网络和TensorFlow分类文本

专知

4+阅读 · 2017年12月21日

教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型

教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型

机器之心

6+阅读 · 2017年11月20日

使用 TensorFlow 做文本情感分析

使用 TensorFlow 做文本情感分析

Datartisan数据工匠

15+阅读 · 2017年11月16日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Web Table Extraction, Retrieval and Augmentation: A Survey

Arxiv

7+阅读 · 2020年2月5日

Attention Forcing for Sequence-to-sequence Model Training

Attention Forcing for Sequence-to-sequence Model Training

Arxiv

7+阅读 · 2019年9月26日

Universal Transformers

Universal Transformers

Arxiv

5+阅读 · 2019年3月5日

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Arxiv

4+阅读 · 2019年1月9日

A Probe into Understanding GAN and VAE models

A Probe into Understanding GAN and VAE models

Arxiv

9+阅读 · 2018年12月13日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

Deep Communicating Agents for Abstractive Summarization

Arxiv

5+阅读 · 2018年3月27日

Fictitious GAN: Training GANs with Historical Models

Arxiv

4+阅读 · 2018年3月23日

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Arxiv

3+阅读 · 2018年3月17日

VIP会员

相关主题

Laurence Moroney

International Conference on Conceptual Modeling

相关VIP内容

Python导论，476页pdf，现代Python计算

Python导论，476页pdf，现代Python计算

专知会员服务

264+阅读 · 2020年5月17日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

【Google】利用AUTOML实现加速感知神经网络设计

【Google】利用AUTOML实现加速感知神经网络设计

专知会员服务

30+阅读 · 2020年3月5日

【MIT深度学习课程】深度序列建模，Deep Sequence Modeling

【MIT深度学习课程】深度序列建模，Deep Sequence Modeling

专知会员服务

78+阅读 · 2020年2月3日

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

TensorFlow Lite指南实战《TensorFlow Lite A primer》，附48页PPT

专知会员服务

70+阅读 · 2020年1月17日

KGCN：使用TensorFlow进行知识图谱的机器学习

KGCN：使用TensorFlow进行知识图谱的机器学习

专知会员服务

83+阅读 · 2020年1月13日

【新书】学习TensorFlow2.0，177页pdf，使用Python实现机器学习和深度学习模型

【新书】学习TensorFlow2.0，177页pdf，使用Python实现机器学习和深度学习模型

专知会员服务

224+阅读 · 2019年12月28日

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

专知会员服务

5+阅读 · 2019年12月1日

【ACL 2019 Tutorials】基于图的含义表示:设计和处理（Graph-Based Meaning Representations: Design and Processing），Alexander Koller，Stephan Oepen，孙薇薇

【ACL 2019 Tutorials】基于图的含义表示:设计和处理（Graph-Based Meaning Representations: Design and Processing），Alexander Koller，Stephan Oepen，孙薇薇

专知会员服务

10+阅读 · 2019年11月16日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争背景下俄罗斯的战略性海军分析（2022-2025年）》最新100页报告

【斯坦福博士论文】数据、决策与依赖：构建可信人工智能的挑战

人工智能时代背景下的未来海战

接触战中的无人机优势：美军旅级部队面临的小型无人机系统挑战与调整

相关资讯

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

新智元

19+阅读 · 2019年6月8日

用 TensorFlow hub 在 Keras 中做 ELMo 嵌入

用 TensorFlow hub 在 Keras 中做 ELMo 嵌入

AI研习社

5+阅读 · 2019年5月12日

Github项目推荐 | tntorch - 使用PyTorch进行张量网络学习

Github项目推荐 | tntorch - 使用PyTorch进行张量网络学习

AI研习社

8+阅读 · 2019年4月17日

TensorFlow 2.0新特性之Ragged Tensor

TensorFlow 2.0新特性之Ragged Tensor

深度学习每日摘要

18+阅读 · 2019年4月5日

从张量到自动微分：PyTorch入门教程

从张量到自动微分：PyTorch入门教程

论智

9+阅读 · 2018年10月10日

推荐一些有助于理解TensorFlow机制的资料（二）

推荐一些有助于理解TensorFlow机制的资料（二）

专知

4+阅读 · 2018年5月11日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

机器学习的大局：用神经网络和TensorFlow分类文本

机器学习的大局：用神经网络和TensorFlow分类文本

专知

4+阅读 · 2017年12月21日

教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型

教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型

机器之心

6+阅读 · 2017年11月20日

使用 TensorFlow 做文本情感分析

使用 TensorFlow 做文本情感分析

Datartisan数据工匠

15+阅读 · 2017年11月16日

相关论文

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Web Table Extraction, Retrieval and Augmentation: A Survey

Arxiv

7+阅读 · 2020年2月5日

Attention Forcing for Sequence-to-sequence Model Training

Attention Forcing for Sequence-to-sequence Model Training

Arxiv

7+阅读 · 2019年9月26日

Universal Transformers

Universal Transformers

Arxiv

5+阅读 · 2019年3月5日

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Arxiv

4+阅读 · 2019年1月9日

A Probe into Understanding GAN and VAE models

A Probe into Understanding GAN and VAE models

Arxiv

9+阅读 · 2018年12月13日

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Arxiv

5+阅读 · 2018年6月4日

Deep Communicating Agents for Abstractive Summarization

Arxiv

5+阅读 · 2018年3月27日

Fictitious GAN: Training GANs with Historical Models

Arxiv

4+阅读 · 2018年3月23日

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Arxiv

3+阅读 · 2018年3月17日

大家都在搜

大型语言模型

蓝牙安全攻防

朱克爱德华兹家族

【论文笔记】用于数据驱动交通预测的扩散卷积循环神经网络（DCRNN）

微信扫码咨询专知VIP会员