Paper Matrix

Back to list

{"id":4,"url":"https://pm.philipcastiglione.com/papers/4.json","title":"Attention Is All You Need","read":true,"authors":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Illia Polosukhin","year":2017,"auto_summary":"The paper \"Attention Is All You Need\" introduces the Transformer, a novel neural network architecture for sequence transduction tasks, such as language translation, that relies entirely on attention mechanisms, eliminating the need for recurrent or convolutional networks. The Transformer architecture significantly enhances parallelization and reduces training time while achieving superior performance compared to previous models.\n\nKey Highlights:\n\n1. **Transformer Architecture**: The model consists of an encoder-decoder structure, where both the encoder and decoder are composed of layers of multi-head self-attention mechanisms and position-wise feed-forward networks. The encoder has six identical layers, each with a multi-head self-attention mechanism followed by a feed-forward network, and the decoder has an additional layer for encoder-decoder attention.\n\n2. **Attention Mechanisms**: The paper introduces scaled dot-product attention and multi-head attention. Scaled dot-product attention computes the attention weights by taking the dot product of queries and keys, scaling by the dimension of the keys, and applying a softmax function. Multi-head attention allows the model to focus on different parts of the input sequence simultaneously by using multiple attention mechanisms in parallel.\n\n3. **Positional Encoding**: Since the Transformer does not inherently understand the order of input sequences, positional encodings are added to the input embeddings to provide information about the position of tokens in the sequence. The authors use sinusoidal positional encodings, which allow the model to learn relative positions effectively.\n\n4. **Performance**: The Transformer achieves state-of-the-art results on the WMT 2014 English-to-German and English-to-French translation tasks, outperforming previous models by a significant margin. The big Transformer model achieves a BLEU score of 28.4 for English-to-German and 41.8 for English-to-French.\n\n5. **Training Efficiency**: The Transformer model is more computationally efficient than its predecessors, requiring less training time and resources. The base model can be trained in 12 hours on 8 P100 GPUs, and the big model in 3.5 days, demonstrating a substantial reduction in training cost compared to other models.\n\n6. **Generalization**: The Transformer also generalizes well to other tasks, such as English constituency parsing, achieving competitive results without task-specific tuning.\n\n7. **Advantages**: The main advantages of the Transformer include its ability to model long-range dependencies with shorter path lengths, improved parallelization, and reduced computational complexity, especially for tasks with shorter sequence lengths compared to the model's representation dimensionality.\n\nOverall, the Transformer represents a significant advancement in neural network architectures for sequence transduction, offering both improved performance and efficiency. The paper's findings have had a profound impact on the field, influencing subsequent research and applications in natural language processing and beyond.","notes":{"id":4,"name":"notes","body":"\u003ch1\u003eNotes\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eAt the time, RNNs (Recurrent Neural Networks), LSTMs (long short-term memory) and gated RNNs (along with CNNs?) were the state of the art (SOTA) in sequence modeling and transduction problems such as language modeling and machine translation.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAchieving parallelism in processing RNNs is limited by the need to consider sequentially related tokens.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAttention mechanisms remove (or mitigate ?) these limitations. They had been used with RNNs to achieve better results.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThis paper stopped using RNNs (or convolutional layers) altogether and introduced a new “transformer” based architecture that used attention mechanisms only, in non recurrent DNNs.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eEncoder/decoder structures map an input sequence of symbol representations (tokens?) to a sequence of continuous representations (\u003cstrong\u003ez\u003c/strong\u003e). The decoder takes \u003cstrong\u003ez\u003c/strong\u003e then consecutively generates output sequence elements, auto-regressive at each step (consuming the previously generated output elements in the sequence as additional input).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eTransformers use point-wise fully connected layers for the encoder and decoder.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003caction-text-attachment content-type=\"image\" url=\"https://prod-files-secure.s3.us-west-2.amazonaws.com/573626e0-140e-457e-a372-d944792d247a/d3946d76-081f-4fc6-ba10-9430722916c3/image.png\"\u003e\u003c/action-text-attachment\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eencoder is the left stack (6 x layer)\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003edecoder is the right stack (6 x layer)\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAttention function maps a query and a set of key-value pairs to an output (each component is implemented as a vector). For multi-head attention, output is computed as sum of values weighted by a compatibility function of the query with the key for the value.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003caction-text-attachment content-type=\"image\" url=\"https://prod-files-secure.s3.us-west-2.amazonaws.com/573626e0-140e-457e-a372-d944792d247a/5b708a6b-ab58-45eb-8092-161ed13048b9/image.png\"\u003e\u003c/action-text-attachment\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eMulti-head attention consists of several attention layers in parallel.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAttention is used in 3 different ways. Some of the details are beyond my current understanding. There are various specifics about the model architecture.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThey use some clever maths to be able to consider sequence order.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eMotivating this work is careful thinking about computational complexity and parallelism for the required task (specifically including long range dependencies eg. distant tokens).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eThey use some cool tricks with their optimizer to make training runs efficient.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eSome early tests on non-translation tasks show immediate success, suggesting the architecture is applicable to other kinds of problems.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003ch1\u003eQuestions\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003eWhat are attention mechanisms? Encoder/decoder processes?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003e“point-wise fully connected”?\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003eMulti-head attention consists of several attention layers\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003cdiv\u003e\u003cbr\u003eHow many? across all keys? or some subset with high compatibility scores for Q? something else? 8, apparently. I don’t fully understand this. But,\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cblockquote\u003e\u003cbr\u003edk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.\u003cbr\u003e\u003cbr\u003e\u003c/blockquote\u003e\u003ch1\u003eTakeaways\u003c/h1\u003e\u003cdiv\u003e\u003cbr\u003ePart (most? all? unknown) of the win of transformers is computational; permitting greater scale. That’s not to undersell it though, efficiency is powerful in biological systems also.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAt the time, this network got SOTA results in machine translation at 2 orders of magnitude reduction in computation on the hardest benchmark.\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003eAttention also retains the ability to consider related information in sequence (including order).\u003cbr\u003e\u003cbr\u003e\u003c/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c/div\u003e","record_type":"Paper","record_id":4,"created_at":"2024-12-10T03:50:56.444Z","updated_at":"2024-12-10T04:24:42.078Z"},"created_at":"2024-12-10T03:50:44.306Z","updated_at":"2024-12-10T04:24:42.079Z"}

Edit Paper

Delete Paper