Pytorch-tutorials-学习(四)

Deep Learning with Sequence Data and Text

用RNNs的一些应用:

  1. Document classifier:文章分类或者情感分类
  2. Sequence-to-sequence learning:机器翻译
  3. Time-series forecasting: 给定前一天商品价格预测未来价格

Working with text data

对于文本的Deeplearning模型,就像其他的机器学习模型一样,需要将文本先转换为数值表示.这一过程称为向量化(vectorization).
向量化有多种不同的方式:

  1. 转换文本成单词,用向量表示每个单词
  2. 转换文本成字母,用向量表示每个字母
  3. 创建单词的n-gram,用向量表示它

显然,在将文本表示成向量形式前,需要先分割文本,将其分割成单词或字母这些更小的单元,称之为token.这个过程称之为tokenization.

Tokenization

Converting text into words

python自带英文文本的单词分割方法,直接调用split().主要是基于空格将文本切割为单词

1
2
a="Cao hong mei is my son"
a.split()

但中文单词之间不存在空格,这时候往往是将文本分成词语的形式,而不在只是一个单词.这里采用jieba库进行分词操作.
常见的中文分词算法:1.基于字符串匹配如正向最大匹配、逆向最大匹配2.基于统计以及机器学习的分词方法如HMM和CRF
几种jieba分词

  • jieba.cut方法接收三个输入参数:需分词的字符串,cut_all来控制是否采用全模式,HMM参数用来控制是否采用HMM模型
  • jieba.cut_for_search方法接受两个参数:需分词的字符串,是否使用HMM模型.该方法适用于搜索引擎构建倒排索引的分词,粒度比较细.

**注意:

  1. jieba.cut和jieba.cut_for_search返回的都是一个可迭代的generator,可以使用for循环获得分词后得到的每一个词语(unicode)
  2. jieba.lcut及jieba.lcut_for_search直接返回list
  3. jieba.Tokenizer(dictionary=DEFAULT_DICT)新建自定义分词器,可用于同时使用不同词典.
  4. 待分词的字符串可以是unicode或UTF-8字符串、GBK字符串.注意:不建议直接输入GBK字符串,可能无法预料地误解码为UTF-8.**
    1
    2
    3
    jieba.lcut("曹红梅是我儿子",cut_all=True)
    jieba.lcut("曹红梅是我儿子",cut_all=False)
    jieba.lcut_for_search("曹红梅是我儿子")

Converting text into characters

一般只有英文需要将文本转化为字母,中文不需要.
很简单,只需要list操作即可.

1
2
a="Cao hong mei is my son"
list(a)

N-gram representation

这时候英文地n-gram需要用到nltk库,将分词后的单词序列作为ngrams的参数,并指定n的大小.

1
2
3
from nltk import ngrams
a="cao hong mei is my son"
list(ngrams(a.split(),2))

Vectorization

两种流行的向量化方法:1. one-hot encoding 2. word embedding

One-hot encoding

Onehot编码的前提是要对应词组已经有标签号,将对应词组映射到对应标签号.然后可以通过生成一个全零向量,在对应标签号位置置为1,则可表示该词组的onehot.
下面用了sklearn中的preprocessing,先进行LabelEncoder,再进行OnehotEncoder.

1
2
3
4
5
6
7
8
9
10
from nltk import ngrams
a="cao hong mei is my son"
b=list(ngrams(a.split(),3))
c=[' '.join(w) for w in b]#b中每个元素是一个元组需要拼接
from sklearn import preprocessing
Lencoder=preprocessing.LabelEncoder()
Lcoder=Lencoder.fit_transform(c)
enc=preprocessing.OneHotEncoder()
Lcoder=Lcoder.reshape(4,1).tolist()
enc.fit_transform(Lcoder).toarray()

Word embedding

word embedding是现在非常流形的通过深度学习来解决文本数据问题的算法.Word embedding将原本稀疏的向量(onehot vector)嵌入到一个密集空间,使稀疏向量转变为密集向量.并且用这种方法使得邻的单词有相似的表达.


介绍以下所使用用的torchtext
pic1
API:

  1. torchtext.data
  • torchtext.data.Example:用来表示一个样本,数据+标签
  • torchtext.vocab.Vocab:词汇表相关
  • torchtext.data.Datasets:数据集类, getitem 返回Example,制作数据集的类
  • torctext.data.Field:用来定义字段的处理方法(文本字段,标签字段),这里用来定义创建Example时的预处理,batch时的一些处理操作
  • torchtext.data.Iterator: 迭代器,来生成batch
  1. torchtext.datasets:包含了常见的数据集

    1
    2
    from torchtext.data import Field,Example,TabularDataset
    from torchtext.data import BucketIterator

Field:用来定义字段以及文本预处理方法
Example:用来表示一个样本,通常为 “数据+标签”
TabularDataset:用来从文件中读取数据,生成Dataset,Dataset是Example实例的集合
BucketIterator:迭代器,用来生成batch,类似的有Iterator,BucketIterator功能强大支持排序,动态padding等

TorchText的数据预处理流程:

  1. 定义样本的处理操作:torchtext.data.Field

  2. 加载corpus(都是string):torchtext.data.datasets

  • 在Datasets中,torchtext将corpus处理成一个个的torchtext.data.Example实例
  • 创建torchtext.data.Example的时候,会调用field.preprocess方法
  1. 创建词汇表,用来将string token转化成index: field.build_vocab()
  • 词汇表负责:string token转换为index,index转换为 string token,string token转换为 word vector
  1. 将处理后的数据进行batch操作:torchtext.data.Iterator
  • 将Datasets中的数据batch化
  • 其中会包含一些pad操作,保证一个batch中的example长度一致
  • 在这里将string token转换为index

Field对象包含的参数:
sequential:是否把数据表示成序列,如果False,不能使用分词.默认值:True
use_vocab:是否使用词典对象.如果是False数据的类型必须已经是数值类型。默认:True
init_token:每条数据的其实字符,默认值:None
eos_token:每条数据的结尾字符,默认值:None
fix_length:修改每条数据的长度为该值,不够的用pad_token补全,默认值:None
tensor_type:把数据转换成tensor类型,默认值torch.LongTensor
preprocessing:在分词之后和数值化之前使用的管道,默认值:None
postprocessing:数值化之后和转换成tensor之前使用的管道,默认值:None
lower:是否把数据转化成小写,默认值:False
tokenize:分此函数:默认值:str.split
pad_token:用于补全的字符.默认值:”
unk_token:不存在词典里的字符.默认值:”
pad_first:是否补全第一个字符.默认值:False

torchtext.TabularDataset:继承自pytorch的Dataset,提供了一个可以下载压缩数据并解压的方法(支持zip,gz,tgz)
splits方法可以同时读取训练集,验证集,测试集
TabularDataset可以很方便读取CSV,TSV或JSON格式的文件

1
2
3
4
train, val, test = data.TabularDataset.splits(
path='./data/', train='train.tsv',
validation='val.tsv', test='test.tsv', format='tsv',
fields=[('Text', TEXT), ('Label', LABEL)])

加载数据后可以建立词典,建立词典的时候可以使用预训练的wordvector

1
TEXT.build_vocab(train,vectors="glove.6B.100d")

Iterator:是torchtext到模型的输入,它提供了我们对模型的一般处理方式,比如打乱,排序等等.可以动态修改batch大小,这里也有splits方法,可以同时输出训练集,验证集,测试集
参数如下
dataset:加载的数据集
batch_size:Batch大小
sort_key:排序的key
train:是否是一个训练集
repeat:是否在不同epoch中重复迭代
shuffle:是否打乱数据
sort:是否对数据进行排序
sort_within_batch:batch内部是否排序
device:指定设备

1
2
3
train_iter, val_iter, test_iter = data.Iterator.splits(
(train, val, test), sort_key=lambda x: len(x.Text),
batch_sizes=(32, 256, 256), device=-1)

torchtext有预训练的word embedding向量
也可以根据自己任务训练,下面是一个embedding layer

1
2
3
4
5
6
7
8
9
class EmbNet(nn.Module):
def __init__(self,emb_size,hidden_size1,hidden_size2=400):
super().__init__()
self.embedding=nn.Embedding(emb_size,hidden_size1)
self.fc=nn.Linear(hidden_size2,3)
def forward(self,x):
embeds=self.embedding(x).view(x.size(0),-1)
out=self.fc(embeds)
return F.log_softmax(out,dim=-1)

训练过程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from torch.nn import functional as F
def fit(epoch,model,data_loader,phase="training",volatile=False):
if(phase=="training"):
model.train
if(phase=="validation"):
model.eval()
volatile=False
running_loss=0.0
running_correct=0
for batch_idx,batch in enumerate(data_loader):
text,target=batch.text,batch.label
if is_cuda:
text,target=text.cuda(),target.cuda()
if phase=="training":
optimizer.zero_grad()
output=model(text)
loss=F.nll_loss(output,target)
running_loss+=F.nll_loss(output,target,size_average=False).data[0]
preds=output.data.max(dim=1,keepdim=True)[1]
running_correct+=preds.eq(target.data.view_as(preds)).cpu().sum()
if(phase=="training"):
loss.backward()
optimizer.step()
loss=running_loss/len(data_loader.dataset)
accuracy=100.*running_correct/len(data_loader.dataset)
return loss,accuracy

Recursive neural networks

RNN结构已经介绍过很多次了.下面主要是介绍代码

标准的RNN网络

定义一个RNN网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from torch.autograd import Variable
class RNN(nn.Module):
def __init__(self,input_size,hidden_size,output_size):
super(RNN,self).__init__()
self.hidden_size-hidden_size
self.i2h=nn.Linear(input_size+hidden_size,hidden_size)
self.i2o=nn.Linear(input_size+hidden_size,output_size)
self.softmax=nn.LogSoftmax(dim=1)
def forward(self,x,hidden):
combined=torch.cat((x,hidden),1)
hidden=self.i2h(combined)
output=self.i2o(combined)
output=self.softmax(output)
return output

def initHidden(self):
return Variable(torch.zeros(1,self.hidden_size))

设计一个循环过程传入数据

1
2
3
def fit(model,text):
for i in range(len(text)):
output,hidden=model(text[i],hidden)

pytorch库中的RNN

RNN的输入数据格式:[seq_len,batch_size,input_size],可以看到RNN的输入数据与我们用CNN时的数据输入格式不太一样,CNN的输入是[batch_size,channel,W,H],batch_size是第一个维度,而RNN的传入数据batch_size在第二位.seq_len是一个句子的最大长度.torch.nn.RNN()存在参数batch_first,可以修改该网络适合为batch_first
可以看到库中的RNN是自动提取seq_len,即时间步.

RNN参数

1
2
3
4
5
6
7
8
9
RNN
- input_dim 表示输入数据的特征维度
- hidden_dim 表示输出的特征维度,如果没有特殊变化,相当于out
- num_layers 表示网络的层数
- nonlinearity 表示选用的非线性激活函数,默认为"tanh"
- bias 表示是否使用偏置,默认使用
- batch_first 表示输入数据的形式,默认为False
- dropout 表示是否在输出层应用dropout
- bidirectional 表示是否使用双向的rnn,默认是False

pic2
例:

1
2
3
4
rnn_seq=nn.RNN(5,10,2)#输入特征维度为5,输出特征维度为10,层数为2
x=torch.randn(6,3,5)#句子长度为6,batch为3,5为输入特征维度
out,ht=rnn_seq(x)#h0可以不指定,默认为全零的隐藏状态
#out,ht=rnn_seq(x,h0)

这里out的size为(6,3,10),ht的size为(2,3,10)其中out的size是[seq_len,batch_size,output_dim],ht的size是[num_layers*num_direction,batch_size,hidden_size]

在每个时间点内out和ht是一样,但对于输出我们需要保存下每个时间点,因此out的第0维的大小就是seq_len,而ht不用保存每个时间点的数据,每次覆盖就行.因此第0维只和layer_num和num_direction有关.
注意:out[-1]和ht[-1]相等
这个问题有必要参考博客:
https://blog.csdn.net/VioletHan7/article/details/81292275

LSTM

Pytorch库中的LSTM

1
2
3
4
5
#输入维度30,隐层100维,2层
lstm_seq=nn.LSTM(50,100,num_layers=2)
#输入序列seq=10,batch=3,输入维度=50
lstm_input=torch.randn(10,3,50)
out,(h,c)=lstm_seq(lstm_input) #使用默认的全0隐藏状态

out的size是(seq_len,batch_size,output_dim)
(h,z)的size都是(num_layersnum_direction,batch_size,output_dim)
*注:out[-1,:,:]与h[-1,:,:]

例子

  1. 准备数据
1
2
3
4
5
6
7
8

from torchtext import data
import torchtext
TEXT=data.Field(lower=True,fix_length=200,batch_first=False)
LABEL=data.Field(sequential=False)
train,test=torchtext.datasets.IMDB.splits(TEXT,LABEL)
TEXT.build_vocab(train,vectors="glove.6B.50d",max_size=10000,min_freq=10)
LABEL.build_vocab(train,)
  1. 创造batch
    这里的数据大小会是[序列长度,batch大小],这里是[200,32]
1
2
3
train_iter,test_iter=data.BucketIterator.splits((train,test),batch_size=32)
train_iter.repeat=False
test_iter.repeat=False
  1. 创造网络
    embedding层的大小是词汇表大小×hidden_size(因为输入的一般是onehot)
    假设hidden_size为100,经过embedding层后输出的数据大小为[200,32,100].100是embedding维度.
    LSTM接受embeding层的输出伴随着2个隐变量.隐变量应该是和embedding输出相同类型的,它们的大小为[num_layers,batch_size,hidden_size].LSTM处理序列数据产生的输出形状为[Sequence_length,batch_size,hidden_size].最后一个序列的输出的形状为[batch_size,hidden_dim],将其传递到线性层中,将其映射到类别输出上.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    class IMDBnn(nn.Module):
    def __init__(self,n_vocab,hidden_size,n_cat,bs,nl=2):
    super().__init__()
    self.hidden_size=hidden_size
    self.bs=bs
    self.nl=nl
    self.e=nn.Embedding(n_vocab,hidden_size)
    self.rnn=nn.LSTM(hidden_size,hidden_size,nl)
    self.fc2=nn.Linear(hidden_size,n_cat)
    self.softmax=nn.LogSoftmax(dim=-1)
    def forward(self,x):
    bs=x.size()[1]
    if(bs!=self.bs):
    self.bs=bs
    e_out=self.e(x)
    h0=c0=Variable(e_out.data.new(*(self.nl,self.bs,self.hidden_size)).zero_())
    rnn_o,_=self.rnn(e_out,(h_0,c_0))
    rnn_o=rnn_o[-1]
    fc=F.dropout(self.fc2(rnn_o),p=0.8)
    return self.softmax(fc)
  2. 训练模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
model=IMDBRnn(n_vocab,n_hidden,3,bs=32)
model=model.cuda()
optimizer=nn.Optim.Adam(model.parameters(),lr=1e-3)

def fit(epoch,model,data_loader,phase='training',volatile=False):
if(phase=='training'):
model.train()
if(phase=='validation'):
model,eval()
volatile=True
running_loss=0.0
running_correct=0
for batch_idx,batch in enumerate(data_loader):
text,target=batch.text,batch.label
if(is_cuda):
text,target=text.cuda(),target=target.cuda()
if(phase=='training'):
optimizer.zero_grad()
output=model(text)
loss=F.nll_loss(output,target)
running_loss+=F.nll_loss(output,target,size_average=False).data[0]
preds=output.data.max(dim=1,keepdim=True)[1]
running_correct+=preds.eq(target.data.view_as(preds)).cpu().sum()
if(phase=="training"):
loss.backward()
optimizer.step()
loss=running_loss/len(data_loader.dataset)
accuracy=100.*running_correct/len(data_loader.dataset)
return loss,accuracy
1
2
3
4
5
6
7
8
9
train_losses,train_accuracy=[],[]
val_losses,val_accuracy=[],[]
for epoch in range(1,5):
epoch_loss,epoch_accuracy=fit(epoch,model,train_iter,phase="training")
val_epoch_loss,val_epoch_accuracy=fit(epoch,model,test_iter,phase='validation')
train_losses.append(epoch_loss)
train_accuracy.append(epoch_accuracy)
val_losses.append(val_epoch_loss)
val_accuracy.append(val_epoch_accuracy)

GRU

1
2
3
gru_seq=nn.GRU(10,20,2) #输入维度为10,输出维度为20,层数为2
gru_input=torch.randn(3,32,10)#seq_len,batch,x_dim
out,h=gru_seq(gru_input)

在序列数据上的卷积网络

由于文本数据与图像数据相比,缺少一个channel维度.因此,这里的卷积采用的是Conv1d.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class IMDBCnn(nn.Module):
def __init__(self,n_vocab,hidden_size,n_cat,bs=1,kernel_size=3,max_len=200):
super().__init__()
self.hidden_size=hidden_size
self.bs=bs
self.e=nn.Embedding(n_vocab,hidden_size)
self.cnn=nn.Conv1d(max_len,hidden_size,kernel_size)
self.avg=nn.AdaptiveAvgPool1d(10)
self.fc=nn.Linear(1000,n_cat)
self.softmax=nn.LogSoftmax(dim=-1)

def forward(self,x):
bs=x.size()[0]
if(bs!=self.bs):
self.bs=bs
e_out=self.e(x)
cnn_o=self.cnn(e_out)
cnn_avg=self.avg(cnn_o)
cnn_avg=cnn_avg.view(self.bs,-1)
fc=F.dropout(self.fc(cnn_avg),p=0.5)
return self.softmax(fc)

1
2
3
4
5
6
7
8
9
train_losses,train_accuracy=[],[]
val_losses,val_accuracy=[],[]
for epoch in range(1,5):
epoch_loss,epoch_accuracy=fit(epoch,model,train_iter,phase="training")
val_epoch_loss,val_epoch_accuracy=fit(epoch,model,test_iter,phase='validation')
train_losses.append(epoch_loss)
train_accuracy.append(epoch_accuracy)
val_losses.append(val_epoch_loss)
val_accuracy.append(val_epoch_accuracy)

DeepNLP实践

Skip-gram-Naive-Softmax

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
import nltk
import random
import numpy as np
from collections import Counter

random.seed(1024)

corpus=list(nltk.corpus.gutenberg.sents('melville-moby_dick.txt'))[:100]#采样句子

corpus=[[word.lower() for word in sent] for sent in corpus]

flatten=lambda l:[item for sublist in l for item in sublist]
word_count=Counter(flatten(corpus))
border=int(len(word_count)*0.01)

stopwords=word_count.most_common()[:border]+list(reversed(word_count.most_common()))[:border]

stopwords=[s[0] for s in stopwords]
vocab=list(set(flatten(corpus))-set(stopwords))
vocab.append('<UNK>')
word2index={'<UNK>':0}
for vo in vocab:
if(word2index.get(vo) is None):
word2index[vo]=len(word2index)


index2word={v:k for k,v in word2index.items()}

WINDOW_SIZE=3
windows=flatten([list(nltk.ngrams(['<DUMMY>']*WINDOW_SIZE+c+['<DUMMY']*WINDOW_SIZE,WINDOW_SIZE*2+1))for c in corpus])

train_data=[]
for window in windows:
for i in range(WINDOW_SIZE*2+1):
if(i==WINDOW_SIZE or window[i]=='<DUMMY>'):
continue
train_data.append((window[WINDOW_SIZE],window[i]))
print(train_data[:WINDOW_SIZE*2])

X_p=[]
y_p=[]
def prepare_word(word,word2index):
return Variable(torch.LongTensor([word2index[word]]) if word2index.get(word) is not None else torch.LongTensor([word2index["<UNK>"]]))

for tr in train_data:
X_p.append(prepare_word(tr[0],word2index).view(1,-1))
y_p.append(prepare_word(tr[1],word2index).view(1,-1))


train_data=list(zip(X_p,y_p))

Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Skipgram(nn.Module):
def __init__(self,vocab_size,projection_dim):
super(Skipgram,self).__init__()
self.embedding_v=nn.Embedding(vocab_size,projection_dim)
self.embedding_u=nn.Embedding(vocab_size,projection_dim)

self.embedding_v.weight.data.uniform_(-1,1)
self.embedding_u.weight.data.uniform_(0,0)

def forward(self,center_words,targets_words,other_words):
center_embeds=self.embedding_v(center_words)
target_embeds=self.embedding_u(targets_words)
outer_embeds=self.embedding_u(other_words)

scores=target_embeds.bmm(center_embeds.transpose(1,2)).squeeze(2)
norm_scores=outer_embeds.bmm(center_embeds.transpose(1,2)).squeeze(2)

nll=-torch.mean(torch.log(torch.exp(scores)/torch.sum(torch.exp(norm_scores),1).unsqueeze(1)))

return nll

def prediction(self,inputs):
embeds=self.embedding_v(inputs)
return embeds

train

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
EMBEDDING_SIZE=30
BATCH_SIZE=256
EPOCH=100
model=Skipgram(len(word2index),EMBEDDING_SIZE)
optimizer=optim.Adam(model.parameters(),lr=0.01)
def getbatch(batch_size,train_data):
random.shuffle(train_data)
s=0
e=batch_size
while(e<len(train_data)):
batch=train_data[s:e]
tmp=e
e=e+batch_size
s=tmp
yield batch

if e>len(train_data):
batch=train_data[s:]
yield batch
def prepare_sequence(seq,wordindex):
idxs=list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"],seq))
return Variable(torch.LongTensor(idxs))



losses=[]
for epoch in range(EPOCH):
for i,batch in enumerate(getbatch(BATCH_SIZE,train_data)):
inputs,targets=zip(*batch)

inputs=torch.cat(inputs)
targets=torch.cat(targets)
vocabs=prepare_sequence(list(vocab),word2index).expand(inputs.size(0),len(vocab))

model.zero_grad()
loss=model(inputs,targets,vocabs)
loss.backward()
optimizer.step()
losses.append(loss.item())

if epoch%10==0:
print("Epoch: %d, mean_loss : %.02f"%(epoch,np.mean(losses)))
losses=[]

test

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def word_similarity(target, vocab):
if USE_CUDA:
target_V = model.prediction(prepare_word(target, word2index))
else:
target_V = model.prediction(prepare_word(target, word2index))
similarities = []
for i in range(len(vocab)):
if vocab[i] == target: continue

if USE_CUDA:
vector = model.prediction(prepare_word(list(vocab)[i], word2index))
else:
vector = model.prediction(prepare_word(list(vocab)[i], word2index))
cosine_sim = F.cosine_similarity(target_V, vector).data.tolist()[0]
similarities.append([vocab[i], cosine_sim])
return sorted(similarities, key=lambda x: x[1], reverse=True)[:10] # sort by similarity

test = random.choice(list(vocab))
word_similarity(test, vocab)

Skip-gram with negative sampling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
corpus=list(nltk.corpus.gutenberg.sents('melville-moby_dick.txt'))[:500];
corpus=[[word.lower()for word in sent] for sent in corpus]
flatten=lambda l: [item for sublist in l for item in sublist ]#两个for嵌套
#统计各个词的个数
word_count=Counter(flatten(corpus))

MIN_COUNT=3
exclude=[]##稀疏单词统计
for w,c in word_count.items():
if c<MIN_COUNT:
exclude.append(w)


vocab=list(set(flatten(corpus))-set(exclude))
#制作word2index和index2word
word2index={}
for vo in vocab:
if word2index.get(vo) is None:
word2index[vo]=len(word2index)
index2word={v:k for k,v in word2index.items()}

##制作窗口数据
WINDOW_SIZE=5
windows=flatten([list(nltk.ngrams(['<DUMMY>']*WINDOW_SIZE+c+['<DUMMY>']*WINDOW_SIZE,WINDOW_SIZE*2+1))for c in corpus])

train_data=[]
for window in windows:
for i in range(WINDOW_SIZE*2+1):
if(window[i] in exclude or window[WINDOW_SIZE] in exclude):
continue
if(i==WINDOW_SIZE or window[i]=="<DUMMY>"):
continue
train_data.append((window[WINDOW_SIZE],window[i]))

def prepare_word(word,word2index):
return Variable(torch.LongTensor([word2index[word]]) if word2index.get(word) is not None else torch.LongTensor([word2index['<UNK>']]))

X_p=[]
y_p=[]
for tr in train_data:
X_p.append(prepare_word(tr[0],word2index).view(1,-1))
y_p.append(prepare_word(tr[1],word2index).view(1,-1))

train_data=list(zip(X_p,y_p))
1
2
3
4
5
6
7
8
9
Z=0.001
num_total_words=sum([c for w,c in word_count.items() if w not in exclude])
unigram_table=[]
for vo in vocab:
unigram_table.extend([vo]*int(((word_count[vo]/num_total_words)**0.75)/Z))

def prepare_sequence(seq,word2index):
idxs=list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"],seq))
return Variable(torch.LongTensor(idxs))

Negative Sampling

1
2
3
4
5
6
7
8
9
10
11
12
13
def negative_sampling(targets,unigram_table,k):
batch_size=targets.size(0)
neg_samples=[]
for i in range(batch_size):
nsample=[]
target_index=targets[i][0]
while(len(nsample)<k):
neg=random.choice(unigram_table)
if word2index[neg]==target_index:
continue
nsample.append(neg)
neg_samples.append(prepare_squence(nsample,word2index).view(1,-1))
return torch.cat(neg_samples)

Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class SkipgramNegSampling(nn.Module):
def __init__(self,vocab_size,projection_dim):
super(SkipgramNegSampling,self).__init__()
self.embedding_v=nn.Embedding(vocab_size,projection_dim)
self.embedding_u=nn.Embedding(vocab_size,projection_dim)
self.logsigmoid=nn,LogSimoid()

initrange=(2.0/(vocab_size+projection_dim))**0.05##Xavier init

self.embedding_v.weight.data.uniform_(-initrange,initrange)
self.embedding_u.weight.data.uniform_(-0.0,0.0)

def forward(self,center_words,target_words,negative_words):
center_embeds=self.embedding_v(center_words) #B×1×D
target_embeds=self.embedding_u(target_words)#B×1×D

neg_embeds=-self.embedding_u(target_words)#B×K×D

positive_score=target_embeds.bmm(center_embeds.transpose(1,2)).squeeze(2)##B×1
negative_score=torch.sum(torch.logsigmoid(neg_embeds.bmm(center_embeds.transpose(1,2)).squeeze(2)),1).view(negs.size(0),-1)

loss=self.logsigmoid(positive_score)+negative_score
return -torch.mean(loss)

def prediction(self,x):
embeds=self.embedding_v(inputs)
return embeds

Train

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
EMBEDDING_SIZE = 30
BATCH_SIZE = 256
EPOCH = 100
NEG = 10 # Num of Negative Sampling
losses = []
model = SkipgramNegSampling(len(word2index), EMBEDDING_SIZE)
if USE_CUDA:
model = model.cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(EPOCH):
for i,batch in enumerate(getBatch(BATCH_SIZE, train_data)):

inputs, targets = zip(*batch)

inputs = torch.cat(inputs) # B x 1
targets = torch.cat(targets) # B x 1
negs = negative_sampling(targets, unigram_table, NEG)
model.zero_grad()

loss = model(inputs, targets, negs)

loss.backward()
optimizer.step()

losses.append(loss.data.tolist()[0])
if epoch % 10 == 0:
print("Epoch : %d, mean_loss : %.02f" % (epoch, np.mean(losses)))
losses = []

Recurrent Neural Networks and Language Models

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
import nltk
import random
import numpy as np
from collections import Counter,OrderedDict
from copy import deepcopy

flatten= lambda l:[item for sublist in l for item in sublist]
def prepare_sequence(seq,to_index):
idxs=list(map(lambda w: to_index[w] if to_index.get(w) is not None else to_index['<unk>'],seq))
return torch.LongTensor(idxs)

def prepare_ptb_dataset(filename,word2index=None):
corpus=open(filename,'r',encoding='utf-8').readlines()
corpus=flatten([co.strip().split()+['</s>'] for co in corpus])

if word2index==None:
vocab=list(set(corpis))
word2index={'<unk>':0}
for vo in vocab:
if(word2index.get(vo) is None):
word2index[vo]=len(word2index)
return prepare_sequence(corpus,word2index),word2index

def batchify(data,bsz):
nbatch=data.size(0)
data=data.narrow(0,0,nbatch*bsz)
data=data.view(bsz,-1).contiguous()
if true:
data=data.cuda()
return data

def getBatch(data,seq_length):
for i in range(0,data.size(0)-seq_length,seq_length):
inputs=Variable(data[:,i:i+seq_length])
targets=Variable(data[:,(i+1):(i+1)+seq_length].contiguous())
yield(inputs,targets)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class LanguageModel(nn.Module):
def __init__(self,vocab_size,embedding_size,hidden_size,n_layers=1,dropout_p=0.5):
super(LanguageModel,self).__init__()
self.n_layers=n_layers
self.hidden_size=hidden_size
self.embed=nn.Embedding(vocab_size,embedding_size)
self.rnn=nn.LSTM(embedding_size,hidden_size,n_layers,batch_first=True)
self.linear=nn.Linear(hidden_size,vocab_size)
self.dropout=nn.Dropout(dropout_p)

def init_weight(self):
self.embed.weight=nn.init.xavier_uniform(self.embed.weight)
self.linear.weight=nn.init.xavier_uniform(self.linear.weight)
self.linear.bias.data.fill_(0)


def init_hidden(self,batch_size):
hidden=Variable(torch.zeros(self.n_layers,batch_size,hidden_size))
context=Variable(torch.zeros(self.n_layers,batch_size,hidden_size))
return (hidden.cuda().context.cuda()) if true else (hidden,context)

def detach_hidden(self,hiddens):
return tuple([hidden.detach() for hidden in hiddens])
def forward(self,inputs,hidden,is_training=False):
embeds=self.embed(inputs)
if(is_training):
embeds=self.dropout(embeds)
out,hidden=self.rnn(embeds,hidden)
return self.linear(out.contiguous().view(out.size(0)*out.size(1),-1)),hidden

Seq2Seq with attention

seq2seq网络的编码器是一个RNN,它为输入句子中的每个单词输出一些值。 对于每个输入单词,编码器输出一个向量和一个隐藏状态,这个隐藏状态和下一个单词构成下一步的输入。
pic3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class Encoder(nn.Module):
def __init__(self,input_size,embedding_size,hidden_size,n_layers=1,bidirec=False):

super(Encoder,self).__init__()
self.input_size=input_size
self.hidden_size=hidden_size
self.n_layers=n_layers
self.embedding=nn.Embedding(input_size,embedding_size)

if(bidirec):
self.n_direction=2
self=gru=nn.GRU(embedding_size,hidden_size,n_layers,batch_first=True,bidirectional=True)

else:
self.n_direction=1
self.gru=nn.GRU(embedding_size,hidden_size,n_layers,batch_first=True)

def init_hidden(self,inputs):
hidden=Variable(torch.zeros(self.n_layers*self.n_direction,inputs.size(0),self.hidden_size))##input.size(0)=batch_size
return hidden.cuda()

def init_weight(self):
self.embedding.weight=nn.init.xavier_uniform(self.embedding.weight)
self.gru.weight_hh_10=nn.init.xavier_uniform(self.gru.weight_hh_10)
self.gru.weight_ih_10=nn.init.xavier_uniform(self.gru.weight_ih_10)
def forward(self,inputs,inputs_lengths):

hidden=self.init_hidden(inputs)
embedded=self.embedding(inputs)
packed=pack_padded_sequence(embedded,input_length,batch_first=True)##数据序列不等长需要填充,input_length是最大长度
outputs,hidden=self.gru(packed,hidden)
outputs,output_lengths=torch.nn.utils.rnn.pad_packed_sequence(outputs,batch_first=True)
if self.n_layers>1:
if self.n_direction==2:
hidden=hidden[-2:]
else:
hidden=hidden[-1]
return outputs,torch.cat([h for h in hidden],1).unsqueeze(1)

解码器是另一个RNN,它接收编码器输出向量并输出一个字序列来创建翻译。

在最简单的seq2seq解码器中,我们只使用编码器的最后一个输出。 这个最后的输出有时被称为上下文向量,因为它从整个序列编码上下文。 该上下文向量被用作解码器的初始隐藏状态。如果仅在编码器和解码器之间传递上下文向量,则该单个向量承担编码整个句子的负担。注意力(Attention Decoder)允许解码器网络针对解码器自身输出的每一步“聚焦”编码器输出的不同部分。首先我们计算一组注意力权重。 这些将被乘以编码器输出矢量以创建加权组合。
pic4
pic5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
class Decoder(nn.Module):
def __init__(self,input_size,embedding_size,hidden_size,n_layers=1,dropout_p=0.1):
super(Decoder,self).__init__()
self.hidden_size=hidden_size
self.n_layers=n_layers

self.embedding=nn.Embedding(input_size,embedding_size)
self.dropout=nn.Dropout(dropout_p)
self.gru=nn.GRU(embedding_size+hidden_size,hidden_size,n_layers,batch_first=True)
self.linear=nn.Linear(hidden_size*2,input_size)
self.attn=nn.Linear(hidden_size,self.hidden_size)

def init_hidden(self,inputs):
hidden=Variable(torch.zeros(self.n_layers,inputs.size(0),self.hidden_size))
return hidden.cuda()

def init_weight(self):
self.embedding.weight=nn.init.xavier_uniform(self.embedding.weight)
self.gru.weight_hh_10=nn.init.xavier_uniform(self.gru.weight_hh_10)
self.gru.weight_ih_10=nn.init.xavier_uniform(self.gru.weight_ih_10)

self.linear.weight=nn.init.xavier_uniform(self.linear.weight)
self.attn.weight=nn.init.xavier_uniform(self.attn.weight)

def Attention(self,hidden,encoder_outputs,encoder_maskings):##attention的计算过程,hidden是decoder的隐层
'''
hidden:1,B,D
encoder_output:B,T,D
encoder_mask:B,T
'''
hidden=hidden[0].unsqueeze(2)#B,1,D
batch_size=encoder_outputs.size(1)
max_len=encoder_outputs.size(1)
energies=self.attn(encoder_outputs.contiguous().view(batch_size*max_len,-1))
energies=energies.view(batch_size,max_len,-1)
attn_energies=energies.bmm(hidden).squeeze(2)##Encoder的隐层和decoder的隐层相乘

alpha=F.softmax(attn_energies,1)##分数alpha
alpha=alpha.unsqueeze(1)
context=alpha.bmm(encoder_outputs)
return context,alpha

def forward(self,inputs,context,max_length,encoder_outputs,encoder_maskings=None,is_training=False):##inputs是开始符,context是Encoder的输出hidden_c,encoder_outputs是encoderd的序列输出
embedded=self.embedding(inputs)
hidden=self.init_hidden(inputs)
if is_training:
embedded=self.dropout(embedded)
decode=[]

for i in range(max_length):
_,hidden=self.gru(torch.cat((embedded,context),2),hidden)
concated=torch.cat(hidden,context.transpose(0,1),2)
score=self.liear(concated.squeeze(0))
softmaxed=F.log_softmax(score,1)
decode.append(softmaxed)
decoded=softmaxed.max(1)[1]##预测的分数或说是预测的结果
##重复embedded的生成过程
embedded=self.embedding(decoded).unsqueeze(1)
if is_training:
embedded=self.dropout(embedded)

context,alpha=self.Attention(hidden,encoder_outputs,encoder_maskings)

scores=torch.cat(decode,1)
return scores.view(inputs.size(0)*maxl_length,-1)

def decode(self,context,encoder_outputs):
start_decode=Variable(torch.LongTensor([[target2index['<s>']]*1])).transpose(0,1)
embedded=self.embedding(start_decode)
hidden=self.init_hidden(start_decode)

decodes=[]
attentions=[]
decoded=embedded
while decoded.data.tolist()[0]!=target2index['</s>']:
_,hidden=self.gru(torch.cat((embedded,context),2),hidden)
concated=torch.cat((hidden,context.transpose(0,1)),2)
score=self.linear(concated.squeeze(0))
softmaxed=F.log_softmax(score,1)
decodes.append(softmaxed)
decoded=softmaxed.max(1)[1]
embedded=self.embedding(decoded).unsqueeze(1)
context,alpha=self.Attention(hidden,encoder_outputs,None)
attentions.append(alpha.squeeze(1))

return torch.cat(decodes).max(1)[1],torch.cat(attentions)

-------------本文结束感谢您的阅读-------------

本文标题:Pytorch-tutorials-学习(四)

文章作者:Yif Du

发布时间:2019年03月07日 - 08:03

最后更新:2019年03月12日 - 14:03

原始链接:http://yifdu.github.io/2019/03/07/Pytorch-tutorials-学习(四)/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。