由Yif翻译，仅供学习严禁任何商业用途

Chapter 7. Intermediate Sequence Modeling for Natural Language Processing

本章的目标是序列预测。序列预测任务要求我们对序列中的每一项进行标记。这类任务在自然语言处理(NLP)中很常见。一些例子包括语言建模(参见图7-1)，在语言建模中，我们在每个步骤中给定单词序列预测下一个单词;词性标注的一部分，对每个词的语法词性进行预测;命名实体识别，我们预测每个词是否属于一个命名实体，如人、位置、产品、组织;等等。有时，在NLP文献中，序列预测任务也被称为序列标记。

虽然理论上我们可以使用第六章中介绍的Elman递归神经网络(RNNs)进行序列预测任务，但是在实际应用中，它们并不能很好地捕捉到长期的依赖关系，并且表现不佳。在本章中，我们将花一些时间来理解为什么会这样，并学习一种新的RNN架构，称为门控网络。

我们还介绍了自然语言生成作为序列预测应用的任务，并探讨了输出序列在某种程度上受到约束的条件生成。

The Problem with Vanilla RNNs (or Elman RNNs)

尽管在第6章中讨论的普通RNN/Elman-RNN非常适合于建模序列，但它有两个问题使其不适用于许多任务:无法保留用于长期预测的信息，以及梯度稳定性。为了理解这两个问题，回想一下，在它们的核心，rnn在每个时间步上使用前一个时间步的隐藏状态向量和当前时间步上的输入向量计算一个隐藏状态向量。正是这种核心计算使得RNN如此强大，但也产生了大量的数值问题。

Elman RNNs的第一个问题是很难记住长期的信息。例如，在第6章的RNN中，在每次步骤中，我们仅仅更新隐藏的状态向量，而不管它是否有意义。因此，RNN无法控制隐藏状态中保留的值和丢弃的值，而这些值完全由输入决定。直觉上，这是说不通的。我们希望RNN通过某种方式来决定更新是可选的，还是发生了更新，以及状态向量的多少和哪些部分，等等。

Elman RNNs的第二个问题是，它们会导致梯度螺旋地失去控制，趋近于零或无穷大。不稳定的梯度，可以螺旋失控被称为消失梯度或爆炸梯度取决于方向梯度的绝对值正在收缩/增长。梯度绝对值非常大或非常小(小于1)都会使优化过程不稳定(Hochreiter et al.， 2001;Pascanu et al.， 2013)。

在一般的神经网络中，有解决这些梯度问题的方法，如使用校正的线性单元(relu)、梯度裁剪和小心初始化。但是没有一种解决方案能像门控技术那样可靠地工作。

Gating as a Solution to a Vanilla RNN’s Challenges

为了直观地理解门控，假设您添加了两个量，a和b，但是您想控制b放入和的多少。数学上，你可以把a + b的和改写成
a+λb
λ是一个值在0和1之间。如果λ= 0,没有贡献从b如果λ= 1,b完全贡献。这种方式看,你可以解释λ充当一个“开关”或“门”控制的b进入之和。这就是门控机制背后的直觉。现在，让我们重新访问Elman RNN，看看如何将门控与普通的RNN合并以进行条件更新。如果前面的隐藏状态是$h_{t−1}$和当前输入$x_t$, Elman RNN的周期性更新看起来像

$h_t=h_{t-1}+F(h_{t-1,x_t})$

其中F是RNN的递归计算。显然，这是一个无条件的和，并且有“Vanilla RNNs(或Elman RNNs)的问题”中描述的缺点。现在想象一下,替代常数,如果前面的例子的λ是一个函数之前的隐藏状态向量ht−1和当前输入xt,而且还产生所需的控制行为;也就是说，0到1之间的值。通过这个门控函数，我们的RNN更新方程如下:

$h_t=h_{t-1}+\lambda(h_{t-1},x_t)F(h_{t-1},x_t)$

现在就清楚函数λ控制当前输入的多少可以更新状态ht−1。进一步的函数λ是上下文相关的。这是所有门控网络的基本直觉。λ的函数通常是一个s形的函数,我们知道从第三章到产生一个值在0和1之间。

在长短期记忆的情况下,这个基本的直觉是扩展仔细将不仅条件更新,而且还故意忘记之前的隐藏状态$h_{t−1}$的值。这种“忘记”乘以发生前隐藏状态与另一个函数μ值ht−1,还产生值在0和1之间,取决于当前的输入:

$h_t=\mu (h_{t-1},x_t)h_{t-1}+\lambda(h_{t-1},x_t)F(h_{t-1},x_t)$

您可能已经猜到,μ是另一个控制功能。在实际的LSTM描述中，这变得很复杂，因为门函数是参数化的，导致对未初始化的操作的复杂序列。但是，在掌握了本节的直观知识之后，如果您想深入了解LSTM的更新机制，现在就可以了。我们推荐Christopher Olah的经典文章。在本书中，我们将不涉及这些内容，因为这些细节对于LSTMs在NLP应用程序中的应用和使用并不是必需的。

LSTM只是RNN的许多门控变体之一。另一种越来越流行的门控变量是门控循环单元(GRU;Chung et al.， 2015)。幸运的是，在PyTorch中，您可以简单地替换nn。RNN或神经网络。RNNCell nn。LSTM和神经网络。LSTMCell没有其他代码更改来切换到LSTM(为GRU做必要的修改)!

门控机制是“普通RNNs(或Elman RNNs)问题”中列举的问题的有效解决方案。它不仅可以控制更新，还可以控制梯度问题，使训练相对容易。不再赘述，我们将使用两个示例来展示这些封闭体系结构的实际应用。

Example: A Character-RNN for Generating Surnames

在本例中，我们将完成一个简单的序列预测任务:使用RNNs生成姓氏。在实践中，这意味着对于每个时间步骤，RNN都在计算姓氏中可能的字符集的概率分布。使用这些概率分布，我们可以优化网络来改进它的预测(假设我们知道应该预测哪些字符)，也可以使用它们来生成全新的姓氏!

虽然这个任务的数据集已经在前面的例子中使用过，看起来很熟悉，但是在构建用于序列预测的每个数据样本的方式上有一些不同。在描述了数据集和任务之后，概述了支持通过系统簿记进行序列预测的数据结构。

在描述了数据集、任务和支持数据结构之后，我们引入了两个生成姓氏的模型:无条件姓氏生成模型和条件姓氏生成模型。该模型在不了解姓氏的情况下，对姓氏序列进行预测。与此相反，条件模型利用特定的国籍嵌入作为RNN的初始隐藏状态，从而使模型对序列的预测产生偏差。

The SurnamesDataset

姓氏数据集是姓氏及其来源国的集合，最早出现在“带有多层感知器的姓氏分类”中。到目前为止，该数据集已经被用于一个分类任务——给出一个新的姓氏，正确地将姓氏来自哪个国家。然而，在本例中，我们将展示如何使用数据集来训练一个模型，该模型可以为字符序列分配概率并生成新的序列。

SurnamesDataset类与前几章基本相同:我们使用panda DataFrame加载数据集，并构造了一个向量化器，它将令牌封装为模型和手边任务所需的整数映射。为了适应任务的不同，修改了SurnamesDataset.__getitem()方法，以输出预测目标的整数序列，如示例7-1所示。该方法引用向量器来计算作为输入的整数序列(from_vector)和作为(to_vector)的整数序列。下一小节将描述向量化的实现。
Example 7-1. The SurnamesDataset.getitem__ for a sequence prediction task

class SurnameDataset(Dataset):
    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch

        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """

        surname_df = pd.read_csv(surname_csv)
        return cls(surname_df, SurnameVectorizer.from_dataframe(surname_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dictionary holding the data point: (x_data, y_target, class_index)
        """
        row = self._target_df.iloc[index]

        from_vector, to_vector = \
            self._vectorizer.vectorize(row.surname, self._max_seq_length)

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_data': from_vector,
                'y_target': to_vector,
                'class_index': nationality_index}

The Vectorization Data Structures

与前面的示例一样，有三种主要的数据结构将每个姓氏的字符序列转换为其向量化形式:SequenceVocabulary将单个令牌映射到整数，SurnameVectorizer协调整数映射，DataLoader将SurnameVectorizer的结果分组为小批。由于DataLoader实现及其使用在本例中保持不变，我们将跳过其实现细节。

SURNAMEVECTORIZER AND END-OF-SEQUENCE

对于序列预测任务，编写训练例程以期望在每个时间步骤中出现两个表示令牌观察和令牌目标的整数序列。通常，我们只想预测我们正在训练的序列，例如本例中的姓氏。这意味着我们只有一个令牌序列可以使用，并通过调整这个令牌序列来构造观察和目标。

为了将其转化为序列预测问题，使用SequenceVocabulary将每个令牌映射到其适当的索引。然后，序列的起始令牌索引begin_seq_index位于序列的开头，而序列的结束令牌索引end_seq_index位于序列的末尾。此时，每个数据点都是一系列索引，具有相同的第一个和最后一个索引。要创建训练例程所需的输入和输出序列，我们只需使用索引序列的两个切片:第一个切片包含除最后一个之外的所有令牌索引，第二个切片包含除第一个之外的所有令牌索引。当排列和配对在一起时，序列就是正确的输入-输出索引。

为了更明确，我们展示了SurnameVectorizer的代码。在示例7-2中向量化。第一步是将姓氏(字符串)映射到索引(表示这些字符的整数列表)。然后，用序列索引的开始和结束来包装索引:具体来说，begin_seq_index在索引之前，end_seq_index在索引之后。接下来，我们测试vector_length，它通常在运行时提供，但是代码的编写允许向量的任何长度。在训练期间，提供vector_length是很重要的，因为小批是由堆叠的向量表示构造的。如果向量的长度不同，它们不能堆放在一个矩阵中。在测试vector_length之后，创建两个向量:from_vector和to_vector。不包含最后一个索引的索引片放在from_vector中，不包含第一个索引的索引片放在to_vector中。每个向量的剩余位置都填充了mask_index。将序列填充(或填充)到右边是很重要的，因为空位置将改变输出向量，我们希望这些变化发生在序列被看到之后。

Example 7-2. The code for SurnameVectorizer.vectorize in a sequence prediction task

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""    
    def vectorize(self, surname, vector_length=-1):
        """Vectorize a surname into a vector of observations and targets

        Args:
            surname (str): the surname to be vectorized
            vector_length (int): an argument for forcing the length of index vector
        Returns:
            a tuple: (from_vector, to_vector)
                from_vector (numpy.ndarray): the observation vector
                to_vector (numpy.ndarray): the target prediction vector
        """
        indices = [self.char_vocab.begin_seq_index]
        indices.extend(self.char_vocab.lookup_token(token) for token in surname)
        indices.append(self.char_vocab.end_seq_index)

        if vector_length < 0:
            vector_length = len(indices) - 1

        from_vector = np.zeros(vector_length, dtype=np.int64)         
        from_indices = indices[:-1]
        from_vector[:len(from_indices)] = from_indices
        from_vector[len(from_indices):] = self.char_vocab.mask_index

        to_vector = np.empty(vector_length, dtype=np.int64)
        to_indices = indices[1:]
        to_vector[:len(to_indices)] = to_indices
        to_vector[len(to_indices):] = self.char_vocab.mask_index

        return from_vector, to_vector

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            surname_df (pandas.DataFrame): the surname dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        char_vocab = SequenceVocabulary()
        nationality_vocab = Vocabulary()

        for index, row in surname_df.iterrows():
            for char in row.surname:
                char_vocab.add_token(char)
            nationality_vocab.add_token(row.nationality)

        return cls(char_vocab, nationality_vocab)

From the ElmanRNN to the GRU

在实践中，从普通的RNN转换到门控变体是非常容易的。在以下模型中，虽然我们使用GRU代替普通的RNN，但是使用LSTM也同样容易。为了使用GRU，我们实例化了torch.nn。GRU模块使用与第六章ElmanRNN相同的参数。

Model 1: Unconditioned Surname Generation Model

第一个模型是无条件的:它在生成姓氏之前不观察国籍。在实践中，非条件意味着GRU的计算不偏向任何国籍。在下一个例子(例子7-3)中，通过初始隐藏向量引入计算偏差。在这个例子中，我们使用一个全为0的向量，这样初始的隐藏状态向量就不会影响计算。
通常，SurnameGenerationModel嵌入字符索引，使用GRU计算其顺序状态，并使用线性层计算令牌预测的概率。更明确地说，非条件SurnameGenerationModel从初始化嵌入层、GRU和线性层开始。
与第6章的序列模型相似，该模型输入了一个整数矩阵。我们使用一个PyTorch嵌入实例char_embed将整数转换为一个三维张量(每个批处理项的向量序列)。这个张量传递给GRU, GRU计算每个序列中每个位置的状态向量。

第六章的序列分类与本章序列预测的主要区别在于如何处理由RNN计算出的状态向量。在第6章中，我们为每个批处理索引检索一个向量，并使用这些向量执行预测。在这个例子中，我们将我们的三维张量重塑为一个二维张量(一个矩阵)，以便行维表示每个样本(批处理和序列索引)。利用这个矩阵和线性层，我们计算每个样本的预测向量。我们通过将矩阵重新构造成一个三维张量来完成计算。由于排序信息是通过整形操作保存的，所以每个批和序列索引仍然处于相同的位置。我们需要整形的原因是因为线性层需要一个矩阵作为输入。

Example 7-3. The unconditioned surname generation model

class SurnameGenerationModel(nn.Module):
    def __init__(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
                 batch_first=True, padding_idx=0, dropout_p=0.5):
        """
        Args:
            char_embedding_size (int): The size of the character embeddings
            char_vocab_size (int): The number of characters to embed
            rnn_hidden_size (int): The size of the RNN's hidden state
            batch_first (bool): Informs whether the input tensors will
                have batch or the sequence on the 0th dimension
            padding_idx (int): The index for the tensor padding;
                see torch.nn.Embedding
            dropout_p (float): the probability of zeroing activations using
                the dropout method.
        """
        super(SurnameGenerationModel, self).__init__()

        self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
                                     embedding_dim=char_embedding_size,
                                     padding_idx=padding_idx)
        self.rnn = nn.GRU(input_size=char_embedding_size,
                          hidden_size=rnn_hidden_size,
                          batch_first=batch_first)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=char_vocab_size)
        self._dropout_p = dropout_p

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the model

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be False during training
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        x_embedded = self.char_emb(x_in)

        y_out, _ = self.rnn(x_embedded)

        batch_size, seq_size, feat_size = y_out.shape
        y_out = y_out.contiguous().view(batch_size * seq_size, feat_size)

        y_out = self.fc(F.dropout(y_out, p=self._dropout_p))

        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)

        new_feat_size = y_out.shape[-1]
        y_out = y_out.view(batch_size, seq_size, new_feat_size)

        return y_out

Model 2: Conditioned Surname Generation Model

第二个模型考虑了要生成的姓氏的国籍。在实践中，这意味着有某种机制允许模型对特定姓氏的行为进行偏差。在本例中，我们通过将每个国籍嵌入为隐藏状态大小的向量来参数化RNNs的初始隐藏状态。这意味着模型在调整模型参数的同时，也调整了嵌入矩阵中的值，从而使预测偏于对特定的国籍和姓氏的规律性更加敏感。例如，爱尔兰国籍向量偏向于起始序列“Mc”和“O”。

例7-3显示了条件模型之间的差异。具体地说，引入额外的嵌入来将国籍索引映射到与RNN的隐藏层相同大小的向量。然后，在正向函数中嵌入民族指标，作为RNN的初始隐含层简单传入。虽然这是对第一个模型的一个非常简单的修改，但是它对于让RNN根据生成的国籍改变其行为有着深远的影响。
Example 7-4. The conditioned surname generation model

class SurnameGenerationModel(nn.Module):
    def __init__(self, char_embedding_size, char_vocab_size, num_nationalities,
                 rnn_hidden_size, batch_first=True, padding_idx=0, dropout_p=0.5):
        # ...
        self.nation_embedding = nn.Embedding(embedding_dim=rnn_hidden_size,
                                             num_embeddings=num_nationalities)

    def forward(self, x_in, nationality_index, apply_softmax=False):
        # ...
        x_embedded = self.char_embedding(x_in)
        # hidden_size: (num_layers * num_directions, batch_size, rnn_hidden_size)
        nationality_embedded = self.nation_emb(nationality_index).unsqueeze(0)
        y_out, _ = self.rnn(x_embedded, nationality_embedded)
        # ...

Training Routine and Results

在本例中，我们介绍了用于生成姓氏的字符序列预测任务。虽然许多实现细节和训练例程与第6章的序列分类示例相似，但有几个主要区别。在这一节中，我们将重点讨论差异、使用的超参数和结果。
与前面的例子相比，计算这个例子中的损失需要两个更改，因为我们在序列中的每一步都要进行预测。首先，我们将三维张量重塑为二维张量(矩阵)以满足计算约束。其次，我们协调掩蔽索引，它允许可变长度序列与损失函数，使损失不使用掩蔽位置在其计算。

我们通过使用示例7-5中所示的代码片段来处理这两个问题——三维张量和可变长度序列。首先，预测和目标被标准化为损失函数期望的大小(预测是二维的，目标是一维的)。现在，每一行代表一个示例:按顺序执行一个时间步骤。然后，交叉熵损失用于ignore_index设置为mask_index。这将导致loss函数忽略与ignore_index匹配的目标中的任何位置。

Example 7-5. Handling three-dimensional tensors and sequence-wide loss computations

def normalize_sizes(y_pred, y_true):
    """Normalize tensor sizes

    Args:
        y_pred (torch.Tensor): the output of the model
            If a 3-dimensional tensor, reshapes to a matrix
        y_true (torch.Tensor): the target predictions
            If a matrix, reshapes to be a vector
    """
    if len(y_pred.size()) == 3:
        y_pred = y_pred.contiguous().view(-1, y_pred.size(2))
    if len(y_true.size()) == 2:
        y_true = y_true.contiguous().view(-1)
    return y_pred, y_true


def sequence_loss(y_pred, y_true, mask_index):
    y_pred, y_true = normalize_sizes(y_pred, y_true)
    return F.cross_entropy(y_pred, y_true, ignore_index=mask_index)

使用这种修正的损失计算，我们构造了一个训练例程，看起来与本书中的每个例子相似。它首先迭代训练数据集，每次只处理一小批数据。对于每个小批量，模型的输出是由输入计算出来的。因为我们在每一个时间步上执行预测，所以模型的输出是一个三维张量。使用前面描述的sequence_loss和优化器，可以计算模型预测的错误信号，并用于更新模型参数。

大多数模型超参数是由字符词汇表的大小决定的。这个大小是可以观察到的作为模型输入的离散令牌的数量，以及每次步骤输出分类中的类的数量。剩下的模型超参数是字符嵌入的大小和内部RNN隐藏状态的大小。示例7-6给出了这些超参数和培训选项。

Example 7-6. Hyperparameters for surname generation

args = Namespace(
    # Data and Path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch7/model1_unconditioned_surname_generation",
    # or: save_dir="model_storage/ch7/model2_conditioned_surname_generation",
    # Model hyper parameters
    char_embedding_size=32,
    rnn_hidden_size=32,
    # Training hyper parameters
    seed=1337,
    learning_rate=0.001,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    # Runtime options omitted for space
)

尽管预测的每个字符的准确性是模型性能的度量，但是在本例中，通过检查模型将生成的姓氏类型来进行定性评估会更好。为此，我们在forward()方法中步骤的修改版本上编写一个新的循环，以计算每个时间步骤的预测，并将这些预测用作下一个时间步骤的输入。我们将展示示例7-7中的代码。模型在每个时间步上的输出是一个预测向量，利用softmax函数将预测向量转换为概率分布。利用概率分布，我们利用火炬。多项式抽样函数，它以与索引的概率成比例的速率选择索引。抽样是一个每次产生不同输出的随机过程。
Example 7-7. Sampling from the unconditioned generation model

def sample_from_model(model, vectorizer, num_samples=1, sample_size=20,
                      temperature=1.0):
    """Sample a sequence of indices from the model

    Args:
        model (SurnameGenerationModel): the trained model
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        num_samples (int): the number of samples
        sample_size (int): the max length of the samples
        temperature (float): accentuates or flattens
            the distribution.
            0.0 < temperature < 1.0 will make it peakier.
            temperature > 1.0 will make it more uniform
    Returns:
        indices (torch.Tensor): the matrix of indices;
        shape = (num_samples, sample_size)
    """
    begin_seq_index = [vectorizer.char_vocab.begin_seq_index
                       for _ in range(num_samples)]
    begin_seq_index = torch.tensor(begin_seq_index,
                                   dtype=torch.int64).unsqueeze(dim=1)
    indices = [begin_seq_index]
    h_t = None

    for time_step in range(sample_size):
        x_t = indices[time_step]
        x_emb_t = model.char_emb(x_t)
        rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
        prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
        probability_vector = F.softmax(prediction_vector / temperature, dim=1)
        indices.append(torch.multinomial(probability_vector, num_samples=1))
    indices = torch.stack(indices).squeeze().permute(1, 0)
    return indices

您需要将采样的索引从sample_from_model()函数转换为用于人类可读输出的字符串。如示例7-8所示，要做到这一点，需要使用用于向量化姓氏的SequenceVocabulary。在创建字符串时，只使用序列结束索引之前的索引。这是假设模型能够了解姓氏应该在何时结束。

Example 7-8. Mapping sampled indices to surname strings

def decode_samples(sampled_indices, vectorizer):
    """Transform indices into the string form of a surname

    Args:
        sampled_indices (torch.Tensor): the inidces from `sample_from_model`
        vectorizer (SurnameVectorizer): the corresponding vectorizer
    """
    decoded_surnames = []
    vocab = vectorizer.char_vocab

    for sample_index in range(sampled_indices.shape[0]):
        surname = ""
        for time_step in range(sampled_indices.shape[1]):
            sample_item = sampled_indices[sample_index, time_step].item()
            if sample_item == vocab.begin_seq_index:
                continue
            elif sample_item == vocab.end_seq_index:
                break
            else:
                surname += vocab.lookup_index(sample_item)
        decoded_surnames.append(surname)
    return decoded_surnames

使用这些函数，您可以检查模型的输出，如示例7-9所示，以了解模型是否正在学习生成合理的姓氏。从检查输出中我们可以学到什么?我们可以看到，尽管这些姓氏似乎遵循着几种形态模式，但这些姓氏显然并不是来自一个国家或另一个国家。一种可能是，学习姓氏的一般模型会混淆不同民族之间的性格分布。有条件的姓氏生成模型就是用来处理这种情况的。
Example 7-9. Sampling from the unconditioned model

Input[0]
samples = sample_from_model(unconditioned_model, vectorizer,
                            num_samples=10)
decode_samples(samples, vectorizer)
Output[0]
['Aqtaliby',
 'Yomaghev',
 'Mauasheev',
 'Unander',
 'Virrovo',
 'NInev',
 'Bukhumohe',
 'Burken',
 'Rati',
 'Jzirmar']

对于有条件的SurnameGenerationModel，我们修改sample_from_model()函数来接受国籍索引列表，而不是指定数量的样本。在例7-10中，修改后的函数使用带有国籍嵌入的国籍索引来构造GRU的初始隐藏状态。在此之后，采样过程与非条件模型完全相同。
Example 7-10. Sampling from a sequence model

def sample_from_model(model, vectorizer, nationalities, sample_size=20,
                      temperature=1.0):
    """Sample a sequence of indices from the model

    Args:
        model (SurnameGenerationModel): the trained model
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        nationalities (list): a list of integers representing nationalities
        sample_size (int): the max length of the samples
        temperature (float): accentuates or flattens
            the distribution.
            0.0 < temperature < 1.0 will make it peakier.
            temperature > 1.0 will make it more uniform
    Returns:
        indices (torch.Tensor): the matrix of indices;
        shape = (num_samples, sample_size)
    """
    num_samples = len(nationalities)
    begin_seq_index = [vectorizer.char_vocab.begin_seq_index
                       for _ in range(num_samples)]
    begin_seq_index = torch.tensor(begin_seq_index,
                                   dtype=torch.int64).unsqueeze(dim=1)
    indices = [begin_seq_index]
    nationality_indices = torch.tensor(nationalities,
                                       dtype=torch.int64).unsqueeze(dim=0)
    h_t = model.nation_emb(nationality_indices)

    for time_step in range(sample_size):
        x_t = indices[time_step]
        x_emb_t = model.char_emb(x_t)
        rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
        prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
        probability_vector = F.softmax(prediction_vector / temperature, dim=1)
        indices.append(torch.multinomial(probability_vector, num_samples=1))
    indices = torch.stack(indices).squeeze().permute(1, 0)
    return indices

用条件向量采样的有效性意味着我们对生成输出有影响。在示例7-11中，我们迭代国籍索引并从每个索引中取样。为了节省空间，我们只显示一些输出。从这些输出中，我们可以看到，该模型确实采用了姓氏拼写的一些模式。
Example 7-11. Sampling from a conditioned SurnameGenerationModel (not all outputs are shown)

Input[0]
for index in range(len(vectorizer.nationality_vocab)):
    nationality = vectorizer.nationality_vocab.lookup_index(index)

    print("Sampled for {}: ".format(nationality))

    sampled_indices = sample_from_model(model=conditioned_model,
                                        vectorizer=vectorizer,  
                                        nationalities=[index] * 3,
                                        temperature=0.7)

    for sampled_surname in decode_samples(sampled_indices,
                                          vectorizer):
        print("-  " + sampled_surname)
Output[0]
Sampled for Arabic:
-  Khatso
-  Salbwa
-  Gadi
Sampled for Chinese:
-  Lie
-  Puh
-  Pian
Sampled for German:
-  Lenger
-  Schanger
-  Schumper
Sampled for Irish:
-  Mcochin
-  Corran
-  O'Baintin
Sampled for Russian:
-  Mahghatsunkov
-  Juhin
-  Karkovin
Sampled for Vietnamese:
-  Lo
-  Tham
-  Tou

Tips and Tricks for Training Sequence Models

序列模型很难训练，而且在这个过程中会出现许多问题。在这里，我们总结了一些技巧和技巧，我们发现不仅在我们的工作中有用，而且也被其他人在文献报道。
1.如果可能，使用门控变量
门控体系结构通过解决非通配型的许多数值稳定性问题简化了训练。
2.如果可能，请选择GRUs而不是LSTMs
GRUs提供了与LSTMs几乎相同的性能，并且使用更少的参数和计算。幸运的是，从PyTorch的角度来看，除了简单地使用不同的模块类之外，在LSTM上使用GRU没有什么可做的。
3.使用Adam作为您的优化器
在第6章、第7章和第8章中，我们只使用Adam作为优化器，这是有充分理由的:它是可靠的，收敛速度更快。对于序列模型尤其如此。如果由于某些原因，您的模型没有与Adam收敛，那么在这种情况下，切换到随机梯度下降可能会有所帮助。
4.梯度剪裁
如果您注意到在应用这些章节中学习到的概念时出现了数字错误，请在训练过程中使用您的代码绘制梯度值。知道愤怒之后，剪掉任何异常值。这将确保更顺利的培训。在PyTorch中，有一个有用的实用程序clip_grad_norm可以为您完成此工作，如示例7-12所示。一般来说，你应该养成剪切渐变的习惯。
Example 7-12. Applying gradient clipping in PyTorch

# define your sequence model
model = ..
# define loss function
loss_function = ..

# training loop
for _ in ...:
   ...
   model.zero_grad()
   output, hidden = model(data, hidden)
   loss = loss_function(output, targets)
   loss.backward()
   torch.nn.utils.clip_grad_norm(model.parameters(), 0.25)
   ...

5.早期停止
对于序列模型，很容易过度拟合。我们建议您在评估错误(在开发集上测量的)开始出现时尽早停止培训过程。

在第8章中，我们继续讨论序列模型，使用序列到序列模型来预测和生成与输入长度不同的序列，并讨论序列模型的更多变体。