當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

BERT大火却不懂Transformer？读这一篇就够了原版可视化机器学习可视化神经网络可视化深度学习...20201107

發(fā)布時間：2023/11/28 生活经验 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 BERT大火却不懂Transformer？读这一篇就够了原版可视化机器学习可视化神经网络可视化深度学习...20201107 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

20211016

調(diào)節(jié)因子?

20211004

【NLP】Transformer模型原理詳解 - 知乎

論文所用

20210703

GPT模型與Transformer進(jìn)行對比_znevegiveup1的博客-CSDN博客_gpt transformer

GPT模型與Transformer進(jìn)行對比

GPT采用了Transformer的Decoder，而BERT采用了Transformer中的Encoder。GPT使用Decoder中的Mask Multi-Head Attention結(jié)構(gòu)，在使用[u1,u2,…u(i-1)]預(yù)測單詞ui的時候，會將ui之后的單詞Mask掉。

20210622

QxKT 之后的結(jié)果也是 batchsize x64?? V 的每一行的每個分量都有一個權(quán)值

20210620

Transformer模型有多少種變體？復(fù)旦邱錫鵬教授團(tuán)隊(duì)做了全面綜述_人工智能與算法學(xué)習(xí)的博客-CSDN博客

Transformer變體綜述

為了構(gòu)建更深的模型，每個模塊周圍都采用了殘差連接

https://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

Discussions:?Hacker News (65 points, 4 comments),?Reddit r/MachineLearning (29 points, 3 comments)?
Translations:?Chinese (Simplified),?Korean?
Watch: MIT’s?Deep Learning State of the Art?lecture referencing this post

In the?previous post, we looked at Attention?– a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at?The Transformer?– a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their?Cloud TPU?offering. So let’s try to break the model apart and look at how it functions.

The Transformer was proposed in the paper?Attention is All You Need. A TensorFlow implementation of it is available as a part of the?Tensor2Tensor?package. Harvard’s NLP group created a?guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

A High-Level Look

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in?seq2seq models).

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an?embedding algorithm.

?
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes. 每個字的嵌入維度是512,整個句子的字個數(shù)也是512

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

?
As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

Be sure to check out the?Tensor2Tensor notebook?where you can load a Transformer model, and examine it using this interactive visualization.

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The?first step?in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

三個矩陣初始化的時候是沒有意義的,經(jīng)過訓(xùn)練之后得到的參數(shù)值才有了實(shí)際的物理意義

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 到這里就可以看成每個字有四個向量了

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

batchsizex512? 512X64?
Multiplying?x1?by the?WQ?weight matrix produces?q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

What are the “query”, “key”, and “value” vectors??

X*query=query

X*key=key

X*value=value

三個向量剛開始都可以看成是字向量本身的表示

key:也可以看成和query 是一樣是字本身的表示

query*key:得到其他詞相對于當(dāng)前詞的權(quán)重(得分)? 是一個標(biāo)量值

每個value 表示的詞向量? 在乘上上一步的得到的得分? ?

value 是64維, 每個元素都要乘以這個得分

20210430

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The?second step?in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the?query vector?with the?key vector?of the respective word we’re scoring. So if we’re processing the self-attention for the word in position?#1, the first score would be the dot product of?q1?and?k1. The second score would be the dot product of?q1?and?k2.

The?third and forth steps?are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

The?fifth step?is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The?sixth step?is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

Matrix Calculation of Self-Attention

The first step?is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix?X, and multiplying it by the weight matrices we’ve trained (WQ,?WK,?WV).

20210430

權(quán)重矩陣的維度和最后得到的矩陣維度是不一樣

batchsize * 512? X 512 *64 =batchsize*64

?
Every row in the?X?matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

?
The self-attention calculation in matrix form batchszie*64 X64 * batchsize=?? batchsize * batchsize 第二部分的每一列相當(dāng)于是每個字用64維來表示了? ? 用64維表示了512? 減少了計(jì)算量? 但模型保留的是權(quán)重矩陣

Z的維度是 batchsize * batchsize x batchsize*64=batchsize*64??

也就是最后每一行（而不是每一個字）被一個64位向量表示了？

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

每個編碼器和解碼器都有八個頭 8x64=512

從編碼器輸出的是? batchsize*512？? 每個頭輸入的時候都是? batchsize*512？矩陣的一行代表一條輸入記錄而不是一個字向量？? 所有自向量是怎么合并到一起的？

20210427

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

問題? 每一行到底是代表一個字還是一個batchsize

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

Wo 的維度難道是? 512x512？

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

好像每行代表是每個字的向量而不是一行記錄? 每次處理一行？

?
As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

If we add all the attention heads to the picture, however, things can be harder to interpret:

Representing The Order of The Sequence Using Positional Encoding

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

A real example of positional encoding with a toy embedding size of 4

What might this pattern look like?

? 位置編碼也是一個512維度的向量？

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

縱軸是每個字? 橫軸是向量分量
A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in?get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a?layer-normalization?step.

殘差繞過自己

If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

The Decoder Side

翻譯任務(wù)是有解碼器的

分類任務(wù)好像沒有解碼器? ?直接就是最后一個全連接 bert的作用就相當(dāng)于是一個詞嵌入

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 8套？并沒有合成一個？? ? ? ? ? ? ? 所有的解碼器都需要? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?每一套再加權(quán)重？相當(dāng)于對每個z有加了一次權(quán)重

又在更高一層抽象了一次? 每個Z 是對每個字加不同的權(quán)重,對每個Z加不同權(quán)重相當(dāng)于對每個self attention 再加權(quán)重? ? ? 解碼器輸入的應(yīng)該是譯文的詞向量?

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

每一步輸出一個字?? ?解碼器翻譯第一個字的時候應(yīng)該是輸入的開頭字符? 和真實(shí)翻譯字符相差一個位置? ?有沒有用 seq2seq的反向翻譯呢？
After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

The following steps repeat the process until a special?symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to?-inf) before the softmax step in the self-attention calculation.

? ?解碼器Q矩陣是自己創(chuàng)建的,K和V 是從最后一個編碼器繼承過來的

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

?
This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

Recap Of Training

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

?
The output vocabulary of our model is created in the preprocessing phase before we even begin training.

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

?
Example: one-hot encoding of our output vocabulary

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

The Loss Function

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

所有的參數(shù)都是隨機(jī)初始化的? ?詞嵌入也是隨機(jī)初始化的
Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

How do you compare two probability distributions? We simply subtract one from the other. For more details, look atcross-entropy?and?Kullback–Leibler divergence.

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
The first probability distribution has the highest probability at the cell associated with the word “i”
The second probability distribution has the highest probability at the cell associated with the word “am”
And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

?
The targeted probability distributions we'll train our model against in the training example for one sample sentence.

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

?
Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see:?cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process. 貪婪解碼

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a(chǎn)’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:

Read the?Attention Is All You Need?paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the?Tensor2Tensor announcement.
Watch??ukasz Kaiser’s talk?walking through the model and its details
Play with the?Jupyter Notebook provided as part of the Tensor2Tensor repo
Explore the?Tensor2Tensor repo.

Follow-up works:

Depthwise Separable Convolutions for Neural Machine Translation
One Model To Learn Them All
Discrete Autoencoders for Sequence Models
Generating Wikipedia by Summarizing Long Sequences
Image Transformer
Training Tips for the Transformer Model
Self-Attention with Relative Position Representations
Fast Decoding in Sequence Models using Discrete Latent Variables
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Acknowledgements

Thanks to?Illia Polosukhin,?Jakob Uszkoreit,?Llion Jones?,?Lukasz Kaiser,?Niki Parmar, and?Noam Shazeer?for providing feedback on earlier versions of this post.

Please hit me up on?Twitter?for any corrections or feedback.

Written on June 27, 2018 自注意力:自己和本句話中的其他詞進(jìn)行比較 N個編碼器,N個解碼器三個向量: q和k 一起構(gòu)建權(quán)重,v 代表本身的向量表示,qkv 構(gòu)成最終向量,值隨機(jī)初始化,人為賦予意義 1.每個詞有一個嵌入向量,嵌入向量分別和三個權(quán)重矩陣相乘得到3個向量,q,k,v 假設(shè)只有一個詞詞為[1,512]? 權(quán)重為 [512,64] 得到的q 為[1,64] 如果N個詞詞為[N,512]? 權(quán)重為 [512,64] 得到的q 為[N,64] 權(quán)重矩陣的參數(shù)通過訓(xùn)練而得到 q 為當(dāng)前詞,k為所有詞,q,k 相乘得到每個詞相對于當(dāng)前詞的權(quán)重每個v再和每個對應(yīng)權(quán)重相乘,放大權(quán)重？使得權(quán)重高的越高,權(quán)重低的越低（乘以很小的數(shù),甚至完全忽略?）假設(shè)QxK 的結(jié)果是[2,2]? V 是[2,3] 相當(dāng)于V的每列兩個元素（每個詞的豎向?qū)?yīng)分量）再和 QK 每行的兩個元素作用?? 最終得到[2,3]? 得到兩個詞的3個最終分量 2.實(shí)際代碼中都是以矩陣形式處理的,相當(dāng)于批量并行處理多個詞 3. 多頭每個頭的每套QKV參數(shù)都是不同的 Self-attention即K=V=Q，例如輸入一個句子，那么里面的每個詞都要和該句子中的所有詞進(jìn)行attention計(jì)算。目的是學(xué)習(xí)句子內(nèi)部的詞依賴關(guān)系，捕獲句子的內(nèi)部結(jié)構(gòu)。對于使用自注意力機(jī)制的原因，論文中提到主要從三個方面考慮（每一層的復(fù)雜度，是否可以并行，長距離依賴學(xué)習(xí)），并給出了和RNN，CNN計(jì)算復(fù)雜度的比較。可以看到，如果輸入序列n小于表示維度d的話，每一層的時間復(fù)雜度self-attention是比較有優(yōu)勢的。當(dāng)n比較大時，作者也給出了一種解決方案self-attention（restricted）即每個詞不是和所有詞計(jì)算attention，而是只與限制的r個詞去計(jì)算attention。在并行方面，多頭attention和CNN一樣不依賴于前一時刻的計(jì)算，可以很好的并行，優(yōu)于RNN。在長距離依賴上，由于self-attention是每個詞和所有詞都要計(jì)算attention，所以不管他們中間有多長距離，最大的路徑長度也都只是1。可以捕獲長距離依賴關(guān)系。多頭的作用? 如果只有一個頭的話,其對應(yīng)的當(dāng)前詞的權(quán)重是最大的,增加不同的頭,期望其他與當(dāng)前詞緊密相關(guān)詞的權(quán)重也大.（提取更多的特征,不同角度的特征? 20201106）每個頭把輸入詞嵌入投影到不同的表征子空間（每個頭的作用）每個編碼器中有8個頭,每個頭輸出一個[N,512]，8個[N,512] 拼接起來之后再乘以一個權(quán)重矩陣最終得到[N,512] 再給FFNN 全連接層,全連接層再給下一個編碼器另外有6個或12個編碼器多個編碼器的作用,有可能編碼器少了,模型的表達(dá)能力不足以容納足夠的數(shù)據(jù)量 4.從最后一個編碼器出來之后傳輸數(shù)據(jù)（這里的數(shù)據(jù)不是[N，512],而是K和V ）到所有的解碼器,解碼器中間有個類似seq2seq中的模型,幫助解碼器把注意力放在句子中相關(guān)的部分. 命名實(shí)體識別類比翻譯? ?譯文類比于標(biāo)簽解碼器輸入一個原文詞向量構(gòu)建查詢向量Q和K,V 相互作用后得到,譯文的輸出而且解碼的時候,只注意輸出序列的已經(jīng)出現(xiàn)的部分,沒有出現(xiàn)的譯文部分預(yù)測的時候是沒有的，無法使用（這里訓(xùn)練的時候有沒有像seq2seq一樣把譯文的順序顛倒的,好像沒有）訓(xùn)練的時候譯文和真實(shí)標(biāo)簽對比并反向傳播改變參數(shù)

下層的解碼器結(jié)果輸入到上一層解碼器,每個解碼器所用的K，V都是同一套?

直到遇到結(jié)束符號,編碼結(jié)束

最后的線性層,輸出的單詞表大小的向量,softmax 之后,取最大的譯詞或標(biāo)簽

5.其他? 5.1 對每個詞的位置進(jìn)行編碼 5.2 殘差層的加入短接? ?繞過self_attention 或者ffnn BERT原理及實(shí)現(xiàn)_嗶哩嗶哩_bilibili https://www.ixigua.com/6889319326990795278/ 問題: 1.長時依賴? XLnet 2.體積大? ? ? 知識蒸餾 3.單詞和句子級別的多任務(wù)學(xué)習(xí) 4.MLM 遮蔽語言模型? 單詞級別? ? 給定單詞的上下文序列,當(dāng)前單詞出現(xiàn)的條件概率的乘積語言模型：條件概率對應(yīng)的概率分布? ? 遮蔽擴(kuò)展了不止利用前序,后序也利用了原序列目標(biāo)序列為了減少計(jì)算量只對15%的單詞進(jìn)行mask（預(yù)測）只學(xué)習(xí)有mask 標(biāo)記的,用隨機(jī)單詞替換的單詞不學(xué)習(xí)? 類似于正則化防止過擬合? 允許一些出錯的信息? 泛化程度更好 5.下一句預(yù)測? ? yes,no 的預(yù)測? 二值化? ?cls符號來攜帶 YES和NO的信息? 也是一個向量? 當(dāng)前子句和下個子句是否是上下句的情況 lable=isnext 6.和GPT的區(qū)別:沒有包括后序單詞的信息是單向的 7.與ELMO的區(qū)別,用兩個模型一個從左到右? 一個從右往左? 只不過特征提取器是LSTM? ?transformer對長序序列的提取效果要比LSTM 更好 8.transformer 包括編碼器和解碼器? BERT 只用了編碼器部分 transformer 結(jié)構(gòu)? add:殘差網(wǎng)絡(luò)? ?norm:層歸一化? feed forward:前饋型全連接神經(jīng)網(wǎng)絡(luò)? 選取（encorder layer） N層堆疊（8或者16個單元）當(dāng)前層的輸出作為下一層的輸入看圖,看公式,看代碼說的是同一個事情? 除以根號64是為了使值在合理的范圍之內(nèi) 為pad的部分賦值很小的值避免對softmax產(chǎn)生大的影響 20210511 bert家族中的mask機(jī)制 - 知乎幾種mask方式

總結(jié)

以上是生活随笔為你收集整理的BERT大火却不懂Transformer？读这一篇就够了原版可视化机器学习可视化神经网络可视化深度学习...20201107的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Bert系列（三）——源码解读之Pre-
下一篇： Bert源码阅读