由Yif翻译，仅供学习严禁任何商业用途

Chapter 6. Sequence Modeling for Natural Language Processing

序列是项目的有序集合。传统的机器学习假设数据点是独立的、相同分布的(IID)，但在许多情况下，如语言、语音和时间序列数据，一个数据项取决于它之前或之后的数据项。这种数据也称为序列数据。在人类语言中，顺序信息无处不在。例如，语音可以被看作是音素的基本单元序列。在像英语这样的语言中，句子中的单词不是随意的。他们可能会被它之前或之后的词所束缚。例如，在英语中，介词“of”后面可能跟着冠词“the”;例如，“The lion is the king of the jungle.”。例如，在许多语言中，包括英语，动词的数量必须与句子主语的数量一致。这里有一个例子:
The book is on the table
The books are on the table.
有时这些依赖项或约束可以是任意长的。例如:
The book that I got yesterday is on the table.
The books read by the second grade children are shelved in the lower rack.
简而言之，理解序列对于理解人类语言至关重要。在前几章中，我们介绍了前馈神经网络，如多层感知器(MLPs)和卷积神经网络(CNNs)，以及向量表示的能力。尽管使用这些技术可以完成大量的自然语言处理(NLP)任务，但正如我们将在本章以及第7章和第8章中学习的那样，它们并不能充分建模序列。

传统的方法，模型序列在NLP使用隐马尔科夫模型，条件随机场，和其他类型的概率图形模型，虽然没有讨论在这本书仍然是相关的。我们邀请您(Koller and Friedman, 2009)。

在深度学习中，建模序列涉及到维护隐藏的“状态信息”或隐藏状态。当序列中的每个条目被匹配时——例如，当一个句子中的每个单词被模型看到时——隐藏状态就会被更新。因此，隐藏状态(通常是一个向量)封装了到目前为止序列所看到的一切。这个隐藏的状态向量，也称为序列表示，可以根据我们要解决的任务以无数种方式在许多序列建模任务中使用，从对序列进行分类到预测序列。在本章中，我们将研究序列数据的分类，但是第7章将介绍如何使用序列模型来生成序列。

我们首先介绍最基本的神经网络序列模型:递归神经网络。在此基础上，给出了分类设置中递归神经网络的端到端实例。具体来说，您将看到一个基于字符的RNN来将姓氏分类到它们各自的国籍。姓氏示例表明序列模型可以捕获语言中的正字法(子词)模式。这个示例的开发方式使读者能够将模型应用于其他情况，包括建模文本序列，其中数据项是单词而不是字符。

Introduction to Recurrent Neural Networks

递归神经网络(RNNs)的目的是建立张量序列的模型。rnn和前馈网络一样，是一类模型。RNN家族中有几个不同的成员，但在本章中，我们只讨论最基本的形式，有时称为Elman RNN。递归网络(基本的Elman形式和第7章中概述的更复杂的形式)的目标是学习序列的表示。这是通过维护一个隐藏的状态向量来实现的，它捕获了序列的当前状态。隐藏状态向量由当前输入向量和前一个隐藏状态向量计算得到。这些关系如图6-1所示，图6-1显示了计算依赖项的函数(左)视图和“展开”(右)视图。

![pic1](pic1.png “图6 - 1。（左）Elman RNN的函数视图将递归关系显示为隐藏向量的反馈循环。（右）“展开”视图可以清楚地显示计算关系，因为每个时间步的隐藏向量依赖于该时间步的输入和前一个时间步的隐藏向量。

在每次步骤中使用相同的权重将输入转换为输出是参数共享的另一个例子。在第4章中，我们看到了CNNs如何跨空间共享参数。CNNs使用称为内核的参数来计算来自输入数据子区域的输出。卷积核在输入端平移，从每一个可能的位置计算输出，以学习平移不变性。与此相反，rnn通过依赖一个隐藏的状态向量来捕获序列的状态，从而使用相同的参数来计算每一步的输出。通过这种方式，rnn的目标是通过计算给定的隐藏状态向量和输入向量的任何输出来学习序列不变性。你可以想象一个RNN跨时间共享参数，一个CNN跨空间共享参数。

由于单词和句子可以是可变长度的，因此rnn或任何序列模型都应该能够处理可变长度的序列。一种可能的技术是人为地将序列限制在一个固定的长度。在本书中，我们使用另一种技术，称为掩蔽，通过利用序列长度的知识来处理可变长度序列。简而言之，屏蔽允许数据在某些输入不应计入梯度或最终输出时发出信号。PyTorch提供了处理称为packedsequence的可变长度序列的原语，这些序列从这些不太密集的序列中创建密集的张量。“例子:使用字符RNN对姓氏国籍进行分类”就是一个例子。

在这两个图中，输出与隐藏向量相同。这并不总是正确的，但是在Elman RNN的例子中，隐藏的向量是被预测的。”)

Implementing an Elman RNN

为了探究RNN的细节，让我们逐步了解Elman RNN的一个简单实现。PyTorch提供了许多有用的类和帮助函数来构建rnn。PyTorch RNN类实现了Elman RNN。在本章中，我们没有直接使用PyTorch的RNN类，而是使用RNNCell，它是对RNN的单个时间步的抽象，并以此构建RNN。我们这样做的目的是显式地向您展示RNN计算。示例6-1中显示的类ElmanRNN利用了RNNCell。RNNCell创建了“递归神经网络导论”中描述的输入隐藏和隐藏权重矩阵。对RNNCell的每次调用都接受一个输入向量矩阵和一个隐藏向量矩阵。它返回一个步骤产生的隐藏向量矩阵。

除了控制RNN中的输入和隐藏大小超参数外，还有一个布尔参数用于指定批处理维度是否位于第0维度。这个标志也出现在所有PyTorch RNNs实现中。当设为真时，RNN交换输入张量的第0维和第1维。

在ElmanRNN中，forward()方法循环遍历输入张量，以计算每个时间步长的隐藏状态向量。注意，有一个用于指定初始隐藏状态的选项，但如果没有提供，则使用所有0的默认隐藏状态向量。当ElmanRNN循环遍历输入向量的长度时，它计算一个新的隐藏状态。这些隐藏状态被聚合并最终堆积起来。在返回之前，将再次检查batch_first标志。如果为真，则输出隐藏向量进行排列，以便批处理再次位于第0维上。

ElmanRNN的输出是一个三维张量——对于批处理维度上的每个数据点和每个时间步长，都有一个隐藏状态向量。根据手头的任务，可以以几种不同的方式使用这些隐藏向量。您可以使用它们的一种方法是将每个时间步骤分类为一些离散的选项集。该方法是通过调整RNN权值来跟踪每一步预测的相关信息。另外，您可以使用最后一个向量来对整个序列进行分类。这意味着RNN权重将被调整以跟踪对最终分类重要的信息。在本章中，我们只看到分类设置，但在接下来的两章中，我们将更深入地讨论逐步预测。

Example 6-1. An implementation of the Elman RNN using PyTorch’s RNNCell

class ElmanRNN(nn.Module):
    """ an Elman RNN built using the RNNCell """
    def __init__(self, input_size, hidden_size, batch_first=False):
        """
        Args:
            input_size (int): size of the input vectors
            hidden_size (int): size of the hidden state vectors
            bathc_first (bool): whether the 0th dimension is batch
        """
        super(ElmanRNN, self).__init__()

        self.rnn_cell = nn.RNNCell(input_size, hidden_size)

        self.batch_first = batch_first
        self.hidden_size = hidden_size

    def _initialize_hidden(self, batch_size):
        return torch.zeros((batch_size, self.hidden_size))

    def forward(self, x_in, initial_hidden=None):
        """The forward pass of the ElmanRNN

        Args:
            x_in (torch.Tensor): an input data tensor.
                If self.batch_first: x_in.shape = (batch_size, seq_size, feat_size)
                Else: x_in.shape = (seq_size, batch_size, feat_size)
            initial_hidden (torch.Tensor): the initial hidden state for the RNN
        Returns:
            hiddens (torch.Tensor): The outputs of the RNN at each time step.
                If self.batch_first:
                   hiddens.shape = (batch_size, seq_size, hidden_size)
                Else: hiddens.shape = (seq_size, batch_size, hidden_size)
        """
        if self.batch_first:
            batch_size, seq_size, feat_size = x_in.size()
            x_in = x_in.permute(1, 0, 2)
        else:
            seq_size, batch_size, feat_size = x_in.size()

        hiddens = []

        if initial_hidden is None:
            initial_hidden = self._initialize_hidden(batch_size)
            initial_hidden = initial_hidden.to(x_in.device)

        hidden_t = initial_hidden

        for t in range(seq_size):
            hidden_t = self.rnn_cell(x_in[t], hidden_t)
            hiddens.append(hidden_t)

        hiddens = torch.stack(hiddens)

        if self.batch_first:
            hiddens = hiddens.permute(1, 0, 2)

        return hiddens

Example: Classifying Surname Nationality using a Character RNN

现在我们已经概述了RNNs的基本属性，并逐步实现了ElmanRNN，现在让我们将它应用到任务中。我们将考虑的任务是第4章中的姓氏分类任务，在该任务中，字符序列(姓氏)被分类到起源的国籍。

The Surnames Dataset

本例中的数据集是姓氏数据集，前面在第4章中介绍过。每个数据点由姓氏和相应的国籍表示。我们将避免重复数据集的细节，但是您应该参考“姓氏数据集”来刷新关于数据集的一些关键点。

在本例中，就像“使用CNN对姓氏进行分类”一样，我们将每个姓氏视为字符序列。与往常一样，我们实现一个数据集类，如示例6-2所示，它返回向量化的姓氏和表示其国籍的整数。此外，返回的是序列的长度，它用于下游计算，以知道序列中的最终向量的位置。这是我们熟悉的步骤序列的一部分——实现数据集、向量化器和词汇表——在实际的训练开始之前。

Example 6-2. Implementing the Dataset for the Surname data

class SurnameDataset(Dataset):        
    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch

        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dictionary holding the data point's:
                features (x_data)
                label (y_target)
                feature length (x_length)
        """
        row = self._target_df.iloc[index]

        surname_vector, vec_length = \
            self._vectorizer.vectorize(row.surname, self._max_seq_length)

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_data': surname_vector,
                'y_target': nationality_index,
                'x_length': vec_length}

The Vectorization Data Structures

向量化管道的第一阶段是将姓氏中的每个字符标记映射到唯一的整数。为了实现这一点，我们使用了SequenceVocabulary数据结构，这是我们在“示例:使用预先训练的嵌入式进行文档分类的传输学习”中首次介绍和描述的。回想一下，这个数据结构不仅将tweet中的单词映射到整数，而且还使用了四个特殊用途的令牌:UNK令牌、MASK令牌、BEGIN-SEQUENCE令牌和END-SEQUENCE令牌。前两个令牌对语言数据至关重要:UNK令牌用于输入中看不到的词汇表外令牌，而MASK令牌允许处理可变长度的输入。第二个标记为模型提供了句子边界特征，并分别作为前缀和追加到序列中。我们请您参阅“示例:使用预先训练的嵌入来进行文档分类的迁移学习”，以获得关于序列表的更长的描述。

整个向量化过程由SurnameVectorizer管理，它使用序列evocabulary来管理姓氏字符和整数之间的映射。示例6-3展示了它的实现，看起来应该非常熟悉。在“示例:使用预训练嵌入进行文档分类的迁移学习”中，我们研究了如何将新闻文章的标题分类到特定的类别中，而向量化管道几乎是相同的。

Example 6-3. A vectorizer for surnames

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""   
    def vectorize(self, surname, vector_length=-1):
        """
        Args:
            title (str): the string of characters
            vector_length (int): an argument for forcing the length of index vector
        """
        indices = [self.char_vocab.begin_seq_index]
        indices.extend(self.char_vocab.lookup_token(token)
                       for token in surname)
        indices.append(self.char_vocab.end_seq_index)

        if vector_length < 0:
            vector_length = len(indices)

        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.char_vocab.mask_index

        return out_vector, len(indices)

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            surname_df (pandas.DataFrame): the surnames dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        char_vocab = SequenceVocabulary()
        nationality_vocab = Vocabulary()

        for index, row in surname_df.iterrows():
            for char in row.surname:
                char_vocab.add_token(char)
            nationality_vocab.add_token(row.nationality)

        return cls(char_vocab, nationality_vocab)

The SurnameClassifier Model

SurnameClassifier模型由嵌入层、ElmanRNN和线性层组成。我们假设模型的输入是在它们被SequenceVocabulary映射到整数之后作为一组整数表示的令牌。模型首先使用嵌入层嵌入整数。然后，利用RNN计算序列表示向量。这些向量表示姓氏中每个字符的隐藏状态。由于目标是对每个姓氏进行分类，因此提取每个姓氏中最终字符位置对应的向量。考虑这个向量的一种方法是，最后的向量是传递整个序列输入的结果，因此是姓氏的汇总向量。这些汇总向量通过线性层来计算预测向量。预测向量用于训练损失，或者我们可以应用softmax函数来创建姓氏的概率分布

模型的参数是:嵌入的大小，嵌入的数量(即类的数量，以及RNN的隐藏状态大小。其中两个参数——嵌入的数量和类的数量——由数据决定。其余的超参数是嵌入的大小和隐藏状态的大小。尽管这些模型可以具有任何价值，但通常最好从一些小的、可以快速训练以验证模型是否有效的东西开始。

Example 6-4. Implementing the Surname Classifier Model

class SurnameClassifier(nn.Module):
    """ An RNN to extract features & a MLP to classify """
    def __init__(self, embedding_size, num_embeddings, num_classes,
                 rnn_hidden_size, batch_first=True, padding_idx=0):
        """
        Args:
            embedding_size (int): The size of the character embeddings
            num_embeddings (int): The number of characters to embed
            num_classes (int): The size of the prediction vector
                Note: the number of nationalities
            rnn_hidden_size (int): The size of the RNN's hidden state
            batch_first (bool): Informs whether the input tensors will
                have batch or the sequence on the 0th dimension
            padding_idx (int): The index for the tensor padding;
                see torch.nn.Embedding
        """
        super(SurnameClassifier, self).__init__()

        self.emb = nn.Embedding(num_embeddings=num_embeddings,
                                embedding_dim=embedding_size,
                                padding_idx=padding_idx)
        self.rnn = ElmanRNN(input_size=embedding_size,
                             hidden_size=rnn_hidden_size,
                             batch_first=batch_first)
        self.fc1 = nn.Linear(in_features=rnn_hidden_size,
                            out_features=rnn_hidden_size)
        self.fc2 = nn.Linear(in_features=rnn_hidden_size,
                            out_features=num_classes)

    def forward(self, x_in, x_lengths=None, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            x_lengths (torch.Tensor): the lengths of each sequence in the batch.
                They are used to find the final vector of each sequence
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
           out (torch.Tensor); `out.shape = (batch, num_classes)`
        """
        x_embedded = self.emb(x_in)
        y_out = self.rnn(x_embedded)

        if x_lengths is not None:
            y_out = column_gather(y_out, x_lengths)
        else:
            y_out = y_out[:, -1, :]

        y_out = F.dropout(y_out, 0.5)
        y_out = F.relu(self.fc1(y_out))
        y_out = F.dropout(y_out, 0.5)
        y_out = self.fc2(y_out)

        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)

        return y_out

您将注意到，正向函数需要序列的长度。长度用于检索从RNN返回的带有名为column_gather函数的张量中每个序列的最终向量，如示例6-5所示。该函数迭代批处理行索引，并检索位于序列相应长度所指示位置的向量。
Example 6-5. Retrieving the final vector in each sequence using column_gather

def column_gather(y_out, x_lengths):
    '''Get a specific vector from each batch datapoint in `y_out`.

    Args:
        y_out (torch.FloatTensor, torch.cuda.FloatTensor)
            shape: (batch, sequence, feature)
        x_lengths (torch.LongTensor, torch.cuda.LongTensor)
            shape: (batch,)

    Returns:
        y_out (torch.FloatTensor, torch.cuda.FloatTensor)
            shape: (batch, feature)
    '''
    x_lengths = x_lengths.long().detach().cpu().numpy() - 1

    out = []
    for batch_index, column_index in enumerate(x_lengths):
        out.append(y_out[batch_index, column_index])

    return torch.stack(out)

The Training Routine and Results

训练程序遵循标准公式。对于单个批数据，应用模型并计算预测向量。利用横熵损失和地面真值来计算损失值。使用损失值和优化器，计算梯度并使用这些梯度更新模型的权重。对训练数据中的每批重复此操作。对验证数据进行类似的处理，但是将模型设置为eval()模式，以防止在验证数据上反向传播。相反，验证数据仅用于对模型的执行情况给出不那么偏颇的感觉。这个例程在特定的时期重复执行。代码见补充资料我们鼓励您使用超参数来了解影响性能的因素以及影响程度，并将结果制成表格。我们还将为该任务编写合适的基线模型作为练习，让您完成。在“SurnameClassifier模型”中实现的模型是通用的，并不局限于字符。模型中的嵌入层可以映射出离散项序列中的任意离散项;例如，一个句子是一系列的单词。我们鼓励您在其他序列分类任务(如句子分类)中使用示例6-6中的代码。
Example 6-6. Arguments to the RNN-based Surname Classifier

args = Namespace(
    # Data and path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch6/surname_classification",
    # Model hyper parameter
    char_embedding_size=100,
    rnn_hidden_size=64,
    # Training hyper parameter
    num_epochs=100,
    learning_rate=1e-3,
    batch_size=64,
    seed=1337,
    early_stopping_criteria=5,
    # ... Runtime options not shown for space
)

Summary

在本章中，您学习了用于对序列数据建模的递归神经网络，以及最简单的一种递归网络，即Elman RNNs。我们确定序列建模的目标是学习序列的表示(即向量)。根据任务的不同，可以以不同的方式使用这种学习过的表示。我们考虑了一个示例任务，涉及到将这种隐藏状态表示分类到许多类中的一个。姓氏分类任务展示了一个使用RNNs在子词级别捕获信息的示例。