上百种预训练中文词向量：Chinese-Word-Vectors

2019 年 2 月 26 日 AINLP

昨天聊到腾讯 AI Lab 的词向量：相似词查询：玩转腾讯 AI Lab 中文词向量，今天趁热打铁，推荐Github上的一个中文词向量项目：Chinese-Word-Vectors ，Github地址，可点击文末"阅读原文"查看：

https://github.com/Embedding/Chinese-Word-Vectors

这个项目发布于去年年中，和ACL 2018的论文相伴而发：《Analogical Reasoning on Chinese Morphological and Semantic Relations》，作者团队来自北京师范大学和中国人民大学。该项目基于百度百科，中文维基百科、人民日报、搜狗新闻、知乎问答、微博等中文语料预训练了上百种中文词向量，Github主页上每个词向量都对应有百度网盘链接，感兴趣的同学可以关注，以下来自Github原文。

Chinese Word Vectors 中文词向量

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,
  title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
  author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
  booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
  pages={209--221},
  year={2018},
  publisher={Springer}
}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

Window Size	Dynamic Window	Sub-sampling	Low-Frequency Word	Iteration	Negative Sampling*
5	Yes	1e-5	10	5	5

*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus	Context Features
Corpus	Word	Word + Ngram	Word + Character	Word + Character + Ngram
Baidu Encyclopedia 百度百科	300d	300d	300d	300d
Wikipedia_zh 中文维基百科	300d	300d	300d	300d
People's Daily News 人民日报	300d	300d	300d	300d
Sogou News 搜狗新闻	300d	300d	300d	300d
Financial News 金融新闻	300d	300d	300d	300d
Zhihu_QA 知乎问答	300d	300d	300d	300d
Weibo 微博	300d	300d	300d	300d
Literature 文学作品	300d	300d	300d	300d
Complete Library in Four Sections 四库全书*	300d	300d	NAN	NAN
Mixed-large 综合	300d	300d	300d	300d

Positive Pointwise Mutual Information (PPMI)
Corpus	Context Features
Corpus	Word	Word + Ngram	Word + Character	Word + Character + Ngram
Baidu Encyclopedia 百度百科	Sparse	Sparse	Sparse	Sparse
Wikipedia_zh 中文维基百科	Sparse	Sparse	Sparse	Sparse
People's Daily News 人民日报	Sparse	Sparse	Sparse	Sparse
Sogou News 搜狗新闻	Sparse	Sparse	Sparse	Sparse
Financial News 金融新闻	Sparse	Sparse	Sparse	Sparse
Zhihu_QA 知乎问答	Sparse	Sparse	Sparse	Sparse
Weibo 微博	Sparse	Sparse	Sparse	Sparse
Literature 文学作品	Sparse	Sparse	Sparse	Sparse
Complete Library in Four Sections 四库全书*	Sparse	Sparse	NAN	NAN
Mixed-large 综合	Sparse	Sparse	Sparse	Sparse

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

Feature	Co-occurrence Type	Target Word Vectors	Context Word Vectors
Word	Word → Word	300d	300d
Ngram	Word → Ngram (1-2)	300d	300d
	Word → Ngram (1-3)	300d	300d
	Ngram (1-2) → Ngram (1-2)	300d	300d
Character	Word → Character (1)	300d	300d
	Word → Character (1-2)	300d	300d
	Word → Character (1-4)	300d	300d
Radical	Radical	300d	300d
Position	Word → Word (left/right)	300d	300d
Position	Word → Word (distance)	300d	300d
Global	Word → Text	300d	300d
Syntactic Feature	Word → POS	300d	300d
Syntactic Feature	Word → Dependency	300d	300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus	Size	Tokens	Vocabulary Size	Description
Baidu Encyclopedia 百度百科	4.1G	745M	5422K	Chinese Encyclopedia data from https://baike.baidu.com/
Wikipedia_zh 中文维基百科	1.3G	223M	2129K	Chinese Wikipedia data from https://dumps.wikimedia.org/
People's Daily News 人民日报	3.9G	668M	1664K	News data from People's Daily(1946-2017) http://data.people.com.cn/
Sogou News 搜狗新闻	3.7G	649M	1226K	News data provided by Sogou labs http://www.sogou.com/labs/
Financial News 金融新闻	6.2G	1055M	2785K	Financial news collected from multiple news websites
Zhihu_QA 知乎问答	2.1G	384M	1117K	Chinese QA data from https://www.zhihu.com/
Weibo 微博	0.73G	136M	850K	Chinese microblog data provided by NLPIR Lab http://www.nlpir.org/download/weibo.7z
Literature 文学作品	0.93G	177M	702K	8599 modern Chinese literature works
Mixed-large 综合	22.6G	4037M	10653K	We build the large corpus by merging the above corpora.
Complete Library in Four Sections 四库全书	1.5G	714M	21.8K	The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt

登录查看更多

相关内容

Parse

关注 0

【NeurIPS 2019】多关系庞加莱图嵌入，Multi-relational Poincaré Graph Embeddings

专知会员服务

49+阅读 · 2020年6月15日

深度学习自然语言处理概述，216页ppt，Jindřich Helcl

专知会员服务

216+阅读 · 2020年4月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

资源 | 李航老师《统计学习方法》(第2版）课件下载

专知会员服务

255+阅读 · 2019年11月10日

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

【Github】GPT2-Chinese：中文的GPT2训练代码

AINLP

52+阅读 · 2019年8月23日

NLP Chinese Corpus项目：大规模中文自然语言处理语料

AINLP

13+阅读 · 2019年2月11日

干货下载 | 中文自然语言处理语料/数据集

七月在线实验室

41+阅读 · 2018年12月27日

100+中文词向量，总有一款适合你

专知

12+阅读 · 2018年5月13日

Comprehensive Analysis of Aspect Term Extraction Methods using Various Text Embeddings

Arxiv

5+阅读 · 2019年9月11日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Glyce: Glyph-vectors for Chinese Character Representations

Arxiv

6+阅读 · 2019年1月29日

Learned in Translation: Contextualized Word Vectors

Arxiv

6+阅读 · 2018年6月20日

Deep contextualized word representations

Arxiv

10+阅读 · 2018年3月22日

VIP会员