PyTorch-专知-链路化知识-7、NLP-基于字符级 RNN 的姓名分类

7、NLP-基于字符级 RNN 的姓名分类

Practical PyTorch: 用字符集RNN进行名称分类

本文翻译自spro/practical-pytorch 原文:https://github.com/spro/practical-pytorch/blob/master/char-rnn-classification/char-rnn-classification.ipynb 翻译: fujie 辅助: huaiwen

初始

我们将建立和训练一个基本的字符级RNN来分类单词。字符级RNN将字作为一系列字符读入 - 在每个步骤输出预测和“隐藏状态”,将其先前的隐藏状态馈送到每个下一步骤。我们将最终预测作为输出,即该词属于哪一类。具体来说,我们将从18种语言的起源开始列出数千个姓氏,并根据拼写预测该名字来源于哪种语言

举例

$ python predict.py Hinton
(-0.47) Scottish 
(-1.52) English 
(-3.57) Irish 

$ python predict.py Schmidhuber 
(-0.19) German 
(-2.48) Czech 
(-2.68) Dutch

推荐阅读

假设你至少安装了PyTorch,知道Python,并了解Tensors:

  • http://pytorch.org/ ( 有关安装说明的网址)
  • Deep Learning with PyTorch: A 60-minute Blitz (这个链接让你大致了解什么是PyTorch )
  • jcjohnson's PyTorch examples ( 深入了解PyTorch )
  • Introduction to PyTorch for former Torchies ( 如果你之前用过 Lua Torch )

知道并了解RNNs 以及它们是如何工作的是很有用的

  • The Unreasonable Effectiveness of Recurrent Neural Networks ( 展示了一堆现实生活中的例子)
  • Understanding LSTM Networks(是关于LSTM具体的,但也是关于RNN的一般介绍)

准备数据

包含在data/names目录中的是18个文本文件,名称为“[Language] .txt”。每个文件包含一堆名称,每行一个名称,主要是罗马字体化的(但是我们仍然需要从Unicode转换为ASCII)。

我们最终会得到一个每种语言名称列表的字典,{language:[names ...]}。通用变量“category”和“line”(在我们的例子中用于语言和名称)用于后续的可扩展性。

import glob
all_filenames = glob.glob('../data/names/*.txt')
print(all_filenames)

['../data/names/Arabic.txt', '../data/names/Chinese.txt', '../data/names/Czech.txt', '../data/names/Dutch.txt', '../data/names/English.txt', '../data/names/French.txt', '../data/names/German.txt', '../data/names/Greek.txt', '../data/names/Irish.txt', '../data/names/Italian.txt', '../data/names/Japanese.txt', '../data/names/Korean.txt', '../data/names/Polish.txt', '../data/names/Portuguese.txt', '../data/names/Russian.txt', '../data/names/Scottish.txt', '../data/names/Spanish.txt', '../data/names/Vietnamese.txt']

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicode_to_ascii('Ślusàrski'))

Slusarski

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename).read().strip().split('\n')
    return [unicode_to_ascii(line) for line in lines]

for filename in all_filenames:
    category = filename.split('/')[-1].split('.')[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)
print('n_categories =', n_categories)

n_categories = 18

现在我们有category_lines,一个将每个类别(语言)映射到行 列表(名称)的字典。我们还跟踪所有类别(只是一个语言列表)和n_categories以供以后参考。

print(category_lines['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']

把名字变成Tensors

现在我们已经组织了所有的名字,我们需要把它们变成Tensors来使用它们。 为了表示单个字母,我们使用大小为<1 x n_letters>的“one-hot vector”。一个热向量填充0,除了当前字母的索引1,例如“b”= <0 1 0 0 0 ...>。

为了表达我们的意思,我们将一大堆加入到2维矩阵中。

这个额外的1维是因为PyTorch假设一切都是批量的 -我们只是在这里使用批量大小为1。

import torch

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letter_to_tensor(letter):
    tensor = torch.zeros(1, n_letters)
    letter_index = all_letters.find(letter)
    tensor[0][letter_index] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def line_to_tensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        letter_index = all_letters.find(letter)
        tensor[li][0][letter_index] = 1
    return tensor

print(letter_to_tensor('J'))    

Columns 0 to 12 0 0 0 0 0 0 0 0 0 0 0 0 0

Columns 13 to 25 0 0 0 0 0 0 0 0 0 0 0 0 0

Columns 26 to 38 0 0 0 0 0 0 0 0 0 1 0 0 0

Columns 39 to 51 0 0 0 0 0 0 0 0 0 0 0 0 0

Columns 52 to 56 0 0 0 0 0 [torch.FloatTensor of size 1x57]

print(line_to_tensor('Jones').size())

torch.Size([5, 1, 57])

创建网络

在自动格式化之前,在Torch中创建一个循环神经网络涉及到克隆了多个时间步长的层的参数。这些层保持隐藏的状态和渐变,现在完全由图形本身处理。这意味着你可以以非常“纯净”的方式实施RNN,作为正常的前馈层。

这个RNN模块(大部分来自PyTorch for Torch用户教程的复制)只是2个线性层,它们在输入和隐藏状态下运行,输出后面有LogSoftmax层。

import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

手动测试网络

定义了我们定制的RNN类,我们可以创建一个新的实例:

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

为了开始运行这个网络,我们需要传递一个输入(在我们的例子中是当前字母的Tensor)和一个先前的隐藏状态(我们首先初始化为零)。我们将取回输出(每种语言的概率)和下一个隐藏状态(我们为下一步保留)。

请记住,PyTorch模块在变量上运行,而不是直接在Tensors。

input = Variable(letter_to_tensor('A'))
hidden = rnn.init_hidden()

output, next_hidden = rnn(input, hidden)
print('output.size =', output.size())

output.size = torch.Size([1, 18])

为了提高效率,我们不希望为每个步骤创建一个新的Tensor,所以我们将使用line_to_tensor而不是letter_to_tensor并使用slice。这可以通过预先计算批量的Tensors进一步优化。

input = Variable(line_to_tensor('Albert'))
hidden = Variable(torch.zeros(1, n_hidden))

output, next_hidden = rnn(input[0], hidden)
print(output)

Variable containing:

Columns 0 to 9 -2.8658 -2.8801 -2.7945 -2.9082 -2.8309 -2.9718 -2.9366 -2.9416 -2.7900 -2.8467

Columns 10 to 17 -2.9495 -2.9496 -2.8707 -2.8984 -2.8147 -2.9442 -2.9257 -2.9363 [torch.FloatTensor of size 1x18]

可以看到输出是<1 x n_categories> Tensor,其中每个项目都是该类别的可能性(更高的可能性)。

准备训练

在进行训练之前,我们应该制造一些功能函数。第一个是解释网络的输出,我们知道这是每个类别的可能性。我们可以使用Tensor.topk得到最大值的索引:

def category_from_output(output):
    top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data
    category_i = top_i[0][0]
    return all_categories[category_i], category_i

print(category_from_output(output))

('Irish', 8)

我们还需要一个快速的方式来获得训练示例(名称及其语言):

import random

def random_training_pair():                                                                                                               
    category = random.choice(all_categories)
    line = random.choice(category_lines[category])
    category_tensor = Variable(torch.LongTensor([all_categories.index(category)]))
    line_tensor = Variable(line_to_tensor(line))
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = random_training_pair()
    print('category =', category, '/ line =', line)

category = Italian / line = Campana category = Korean / line = Koo category = Irish / line = Mochan category = Japanese / line = Kitabatake category = Vietnamese / line = an category = Korean / line = Kwak category = Portuguese / line = Campos category = Vietnamese / line = Chung category = Japanese / line = Ise category = Dutch / line = Romijn

训练网络

现在,训练这个网络所需要的就是展示一大堆例子,让它做出猜测,并告诉它是否错误。 对于损耗函数nn.NLLLoss是适当的,因为RNN的最后一层是nn.LogSoftmax。

criterion = nn.NLLLoss()

我们还将创建一个“优化器”,根据其梯度更新我们的模型的参数。我们将使用具有低学习率的SGD算法。

learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)

每个训练循环将会:

  • 创建输入和目标 tensors
  • 创建一个归零的初始隐藏状态
  • 阅读每个字母和保持下一个字母的隐藏状态
  • 将最终输出与目标进行比较
  • 反向传播
  • 返回输出值和丢失值
def train(category_tensor, line_tensor):
    rnn.zero_grad()
    hidden = rnn.init_hidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    optimizer.step()

    return output, loss.data[0]

现在我们只需要运行一些例子。由于train函数返回输出和损失,我们可以打印其猜测,并跟踪绘制的损失。由于有1000个例子,我们只需打印每一个print_every时间步长,并且得到平均损失。

import time
import math

n_epochs = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def time_since(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for epoch in range(1, n_epochs + 1):
    # Get a random training input and target
    category, line, category_tensor, line_tensor = random_training_pair()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print epoch number, loss, name and guess
    if epoch % print_every == 0:
        guess, guess_i = category_from_output(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (epoch, epoch / n_epochs * 100, time_since(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if epoch % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0
  • 5000 5% (0m 7s) 2.7940 Neil / Chinese ✗ (Irish)
  • 10000 10% (0m 14s) 2.7166 O'Kelly / English ✗ (Irish)
  • 15000 15% (0m 23s) 1.1694 Vescovi / Italian ✓
  • 20000 20% (0m 31s) 2.1433 Mikhailjants / Greek ✗ (Russian)
  • 25000 25% (0m 40s) 2.0299 Planick / Russian ✗ (Czech)
  • 30000 30% (0m 48s) 1.9862 Cabral / French ✗ (Portuguese)
  • 35000 35% (0m 55s) 1.5634 Espina / Spanish ✓
  • 40000 40% (1m 5s) 3.8602 MaxaB / Arabic ✗ (Czech)
  • 45000 45% (1m 13s) 3.5599 Sandoval / Dutch ✗ (Spanish)
  • 50000 50% (1m 20s) 1.3855 Brown / Scottish ✓
  • 55000 55% (1m 27s) 1.6269 Reid / French ✗ (Scottish)
  • 60000 60% (1m 35s) 0.4495 Kijek / Polish ✓
  • 65000 65% (1m 43s) 1.0269 Young / Scottish ✓
  • 70000 70% (1m 50s) 1.9761 Fischer / English ✗ (German)
  • 75000 75% (1m 57s) 0.7915 Rudaski / Polish ✓
  • 80000 80% (2m 5s) 1.7026 Farina / Portuguese ✗ (Italian)
  • 85000 85% (2m 12s) 0.1878 Bakkarevich / Russian ✓
  • 90000 90% (2m 19s) 0.1211 Pasternack / Polish ✓
  • 95000 95% (2m 25s) 0.6084 Otani / Japanese ✓
  • 100000 100% (2m 33s) 0.2713 Alesini / Italian ✓

绘制结果

从 all_losses变量绘制的历史数据图展示网络学习:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

plt.figure()
plt.plot(all_losses)

[]

评估结果

要了解网络在不同类别上的运行情况,我们将创建一个混淆矩阵,表示对于每种实际语言(行),网络预测为哪种语言(列)的信息。(每一行表示这一类的数据在不同类别上的预测结果)

# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.init_hidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = random_training_pair()
    output = evaluate(line_tensor)
    guess, guess_i = category_from_output(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()

你可以从主轴上选出亮点,显示哪些语言预测错误,例如很多汉语被预测为韩语这一类了,西班牙语被预测为意大利语。由这个图可知,希腊语预测的结果非常好,颜色最亮,英语预测的很差(可能的原因是和其他很多欧洲语言有很多重合的词)

在用户输入端运行

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    output = evaluate(Variable(line_to_tensor(input_line)))

    # Get top N categories
    topv, topi = output.data.topk(n_predictions, 1, True)
    predictions = []

    for i in range(n_predictions):
        value = topv[0][i]
        category_index = topi[0][i]
        print('(%.2f) %s' % (value, all_categories[category_index]))
        predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')


> Dovesky
(-0.87) Czech
(-0.88) Russian
(-2.44) Polish

> Jackson
(-0.74) Scottish
(-2.03) English
(-2.21) Polish

> Satoshi
(-0.77) Arabic
(-1.35) Japanese
(-1.81) Polish

Practical PyTorch repo中脚本的最终版本将上述代码分成几个文件:

  • data.py (loads files)
  • model.py (defines the RNN)
  • train.py (runs training)
  • predict.py (runs predict() with command line arguments)
  • server.py (serve prediction as a JSON API with bottle.py)

运行train.py来训练并保存网络。

运行具有名称的predict.py来查看预测:

$ python predict.py Hazaki 
(-0.42) Japanese 
(-1.39) Polish 
(-3.51) Czech

运行server.py并访问http://localhost5533: /Yourname 以获得JSON输出的预测。

完整系列搜索查看,请PC登录 www.zhuanzhi.ai, 搜索“PyTorch”即可得。

对PyTorch教程感兴趣的同学,欢迎进入我们的专知PyTorch主题群一起交流、学习、讨论,扫一扫如下群二维码即可进入(先加微信小助手weixinhao: Rancho_Fang,注明PyTorch)。

展开全文
相关主题
Top
微信扫码咨询专知VIP会员