# Chapter 8. Advanced Sequence Modeling for Natural Language Processing

## Capturing More from a Sequence: Bidirectional Recurrent Models

The man who hunts ducks out on the weekends.

## Capturing More from a Sequence: Attention

“序列到序列模型，编码器 - 解码器模型和条件生成”中引入的S2S模型公式的一个问题是它将整个输入句子变成单个矢量（“编码”）φ并使用该编码生成输出，如图8-7所示。虽然这可能适用于非常短的句子，但对于长句，这样的模型无法捕获整个输入中的信息;例如，见Bengio等。 （1994）和Le和Zuidema（2016）。这是仅使用最终隐藏状态作为编码的限制。长输入的另一个问题是，当长时间输入反向传播时，梯度消失，使训练变得困难。

### Attention in Deep Neural Networks

Vaswani等人对变压器网络的研究。 （2017），引入多头注意，其中多个注意向量用于跟踪输入的不同区域。他们还普及了自我关注的概念，这是一种机制，通过该机制，模型可以了解输入的哪些区域相互影响。

Note

## 在构建新模型和尝试新体系结构时，您应该在建模选择和评估这些选择之间实现更快的迭代周期。

### A Vectorization Pipeline for NMT

Example 8-1. Constructing the NMTVectorizer

Example 8-3. Generating minibatches for the NMT example

### Encoding and Decoding in the NMT Model

Example 8-4. The NMTModel encapsulates and coordinates the encoder and decoder in a single forward method.

THE ENCODER
Example 8-5. The encoder embeds the source words and extracts features with a bi-GRU

Example 8-7. The NMT Decoder constructs a target sentence from the encoded source sentence

A CLOSER LOOK AT ATTENTION

Example 8-8. Attention that does element-wise multiplication and summing more explicitly

LEARNING TO SEARCH AND SCHEDULED SAMPLING

Example 8-9. The decoder with a sampling procedures (in bold) built into the forward pass

## References

1. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. (1994). “Learning long-term dependencies with gradient descent is difficult.” IEEE transactions on neural networks.

2. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002) “BLEU: a method for automatic evaluation of machine translation.” In Proceedings ACL.. Hal Daumé III, John Langford, Daniel Marcu. (2009). “Search-based Structured Prediction.” In Machine Learning Journal.

3. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.” In Proceedings of NIPS 2015.

4. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. (2015). “Effective Approaches to Attention-based Neural Machine Translation.” In Proceedings of EMNLP.

5. Phong Le and Willem Zuidema. (2016). “Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs.” Proceedings of the 1st Workshop on Representation Learning for NLP.

6. Philipp Koehn, Rebecca Knowles. (2017). “Six Challenges for Neural Machine Translation.” In Proceedings of the First Workshop on Neural Machine Translation.

7. Graham Neubig. (2017). “Neural Machine Translation and Sequence-to-Sequence Models: A Tutorial.” arXiv:1703.01619.

8. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is all you need.” In Proceedings of NIPS.

In this chapter, we reserve the symbol ϕ for encodings.
This is not possible for streaming applications, but a large number of practical applications of NLP happen in a batch (non-streaming) context anyways.
Sentences like the one in this example are called “Garden Path Sentences.” Such sentences are more common than one would imagine, for e.g., newspaper headlines use such constructs regularly. See https://en.wikipedia.org/wiki/Garden_path_sentence.
Consider the two meanings of “duck” (i) □ (noun, quack quack) and (ii) evade (verb)
The terminology key, values, and query can be quite confusing for the beginner, but we introduce them here anyway because they have now become a standard. It is worth reading this section (8.3.1) a few times until these concepts become clear. The “Key, Value, Query” terminology comes in because the attention was initially thought of as a search task. For an extended review of these concepts and attention in general, visit https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html.
So much that the original 2002 paper that proposed that BLEU received a “test of the time award” in 2018.
For an example, see https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py.
SacreBLEU is the standard when it comes to machine translation evaluation.
We also include the cases in which these subject-verb pairs are contractions, such as “I’m”, “we’re”, and “he’s”.
This simply means that the model will be able to see the entire dataset 10 times faster. It doesn’t exactly follow that the convergence will happen in one-tenth the time, because it could be that the model needs to see this dataset for a smaller number of epochs, or some other confounding factor.
Sorting the sequences in order takes advantage of a low-level CUDA primitive for RNNs.
You should try to convince yourself of this by either visualizing the computations or drawing them out. As a hint, consider the single recurrent step: the input and last hidden are weighted and added together with the bias. If the input is all 0’s, what effect does the bias have on the output?
We utilize the describe function shown in section 1.4.
Starting from left to right on the sequence dimension, any position past the known length of the sequence is assumed to be masked.
The Vectorizer prepends the BEGIN-OF-SEQUENCE token to the sequence, so the first observation is always a special token indicating the boundary.
See section 7.3 of Graham Neubig’s tutorial for a discussion on connecting encoders and decoders in neural machine translation. See (Neubig, 2017).
We refer you to Luong, Pham, and Manning (2011), in which they outline three different scoring functions.
Each batch item is a sequence and the probabilities for each sequence sum to 1.
Broadcasting happens when a tensor has a dimension of size 1. Let this tensor be called Tensor A. When Tensor A is used in an element-wise mathematical operation (such as addition or subtraction) with another tensor called Tensor B, its shape (the number of elements on each dimension) should be identical except for the dimension with size 1. The operation of Tensor A on Tensor B is repeated for each position in Tensor B. If Tensor A has a shape (10, 1, 10) and Tensor B has a shape (10, 5, 10), A+B will repeat the addition of Tensor A for each of the five positions in Tensor B.
We refer you to two papers on this topic: “Search-based Structured Prediction” by Daumé, Langford, Marcu, and “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” by Bengio, Vinyals, Jaitly, Shazeer (2015).
If you’re familiar with Monte Carlo sampling for optimization techniques such as Markov Chain Monte Carlo, you will recognize this pattern.
Primarily, this is because gradient descent and automatic differentiation is an elegant abstraction between model definitions and their optimization.
https://github.com/joosthub/nlpbook/chapters/chapter_8/example_8_5
We omit a plot for the first model because it attended to only the final state in the encoder RNN. As noted by Koehn and Knowles (2017), the attention weights are endemic of many different situations. We suspect the attention weights in the first model did not need to rely on attention as much because the information it needed was already encoded in the states of the encoder GRU.

-------------本文结束感谢您的阅读-------------