Hivemind: Pytorch 分散式深度学习训练框架

会员服务 ·

Hivemind: Pytorch 分散式深度学习训练框架

2020 年 8 月 29 日 专知

【导读】本文介绍一个Pytorch分散式深度学习训练框架，可以在数以千计的计算机上训练一个大规模Transformer。

Github地址：

https://github.com/learning-at-home/hivemind

使用指南

使用pip下载最新的版本的工具包

pip install hivemind

托管服务器

Hivemind.Server托管一个或多个专家(torch 模块)以进行远程访问。这些专家负责大多数模型参数和计算。可以使用python或shell脚本启动服务器。我们现在将使用外壳。要使用默认专家托管服务器，在Shell中运行此服务器：

python scripts/run_server.py --expert_cls ffn --hidden_dim 512 --num_experts 5 --expert_pattern expert.[0:5] \                             --listen_on 0.0.0.0:1337 --dht_port 1338

该服务器在端口1337上接受对专家的请求，并在端口1338上启动DHT对等方。它总共为5个具有ReLU和LayerNorm的前馈专家服务。

我们可以使用--initial_peers参数在同一分散网络中创建其他服务器：

python scripts/run_server.py --expert_cls ffn --hidden_dim 512 --num_experts 10 --expert_pattern "expert.[5:250]" \                              --initial_peers localhost:1338

在此处和下方，如果在其他计算机上运行，请用原始服务器的公用IP地址（例如12.34.56.78:1338）替换localhost：1338。Hivemind支持ipv4和ipv6协议，并使用与gRPC相同的符号。

运行

首先在Python控制台中运行

import torchimport hivemind
dht = hivemind.DHT(initial_peers=["localhost:1338"], listen=False, start=True)# note: listen=False means that your peer will operate in "client only" mode: # this means that it can request other peers, but will not accept requests in return
expert1, expert4 = dht.get_experts(["expert.1", "expert.4"])assert expert1 is not None and expert4 is not None, "server hasn't declared experts (yet?)"

dummy = torch.randn(3, 512)out = expert1(dummy)  # forward passout.sum().backward()  # backward pass

调用时，expert1将向相应的服务器（您在上面创建的服务器）提交请求，并返回输出张量或引发异常。在向后传播期间，pytorch会向专家提交向后请求，这些请求将出现在计算图中。

默认情况下，专家将在每次向后传递之后以SGD的一步自动更新其参数。这使您可以使用本地和远程层快速运行训练：

# generate dummy datax = torch.randn(3, 512)y = 0.01 * x.sum(dim=-1, keepdim=True)
# local torch moduleproj_out = torch.nn.Sequential(    torch.nn.Linear(512, 3))opt = torch.optim.SGD(proj_out.parameters(), lr=0.01)
for i in range(100):    prediction = proj_out(expert1(expert4(x)))    loss = torch.mean(abs(prediction - y))    print(loss.item())    opt.zero_grad()    loss.backward()    opt.step()

最后，可以尝试创建专家混合层：

import nest_asyncio;  nest_asyncio.apply()  # asyncio patch for jupyter. for now, we recommend using MoE from consoledmoe = hivemind.RemoteMixtureOfExperts(in_features=512, uid_prefix="expert", grid_size=(5,),                                       dht=dht, k_best=2)
out = dmoe(torch.randn(3, 512))out.sum().backward()