當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

变压器耦合和电容耦合_超越变压器和抱抱面的分类

發布時間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了变压器耦合和电容耦合_超越变压器和抱抱面的分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

變壓器耦合和電容耦合

In this post, I plan to explore aspects of cutting edge architectures in NLP like BERT/Transformers. I assume that the readers are familiar with the Transformer architectures. To learn more about these, refer to Jay Alammar’s post here and here . I am also going to explore a couple of BERT variations like BERT-base and RoBERTa-base models, but these techniques can be very easily extended to more recent architectures, thanks to Hugging Face!

在本文中，我計劃探索BERT / Transformers等NLP中最先進的體系結構的各個方面。我假設讀者熟悉Transformer架構。要了解更多關于這些，指周杰倫Alammar的帖子在這里和這里。我還將探索一些基于BERT和RoBERTa的BERT版本，但是由于Hugging Face，這些技術可以很容易地擴展到最新的體系結構！

As I started diving into the world of Transformers, and eventually into BERT and its siblings, a common theme that I came across was the Hugging Face library (link). It reminds me of scikit-learn, which provides practitioners with easy access to almost every algorithm, and with a consistent interface. The Hugging Face library has accomplished the same kind of consistent and easy-to-use interface, but this time with deep learning based algorithms/architectures in the NLP world. We will dig into the architectures with the help of these interfaces provided by the this library.

當我開始深入研究《變形金剛》的世界，最終進入BERT及其兄弟姐妹的時候，我遇到的一個共同主題是Hugging Face庫( 鏈接 )。它使我想起scikit-learn，它使從業人員可以輕松訪問幾乎所有算法，并具有一致的界面。 Hugging Face庫已經完成了相同類型的一致且易于使用的界面，但是這次使用了NLP世界中基于深度學習的算法/體系結構。我們將在此庫提供的這些接口的幫助下深入研究體系結構。

探索性模型分析(EMA) (Exploratory Model Analysis (EMA))

One of the main components of any Machine Learning (ML) projects is Exploratory Data Analysis (EDA). Every ML project I have been part of in the past years has started with me and my team doing EDA, iteratively, which helped understand and formulate a concrete problem statement. Figure 1 below demonstrates the typical ML process with an iterative EDA phase, which aims at answering questions about the data to help make decisions, typically about methods to leverage the data to solve specific business problems (say via modeling).

任何機器學習(ML)項目的主要組成部分之一是探索性數據分析(EDA)。過去幾年中，我參與的每個ML項目都是從我和我的團隊開始進行EDA迭代開始的，這有助于理解和制定具體的問題陳述。下面的圖1展示了具有迭代EDA階段的典型ML過程，該階段旨在回答有關數據以幫助做出決策的問題，通常是有關利用數據來解決特定業務問題( 例如通過建模 )的方法。

With the advent of Deep Learning (DL), especially with the option of transfer-learning, the exploration phase now extends beyond looking at the data. It also entails another cycle of exploration for models, let’s call that EMA (Exploratory Model Analysis), which involves understanding the model architectures, their pre-training process, the data and assumptions that went into the pre-training, the architecture’s limitations (e.g. input size, model bias, types of problems they cannot solve), and the extent to which they can be fine-tuned for a downstream task. In other words, analyze where do they lie in the spectrum of re-training, few-shots fine-tuning to zero-shot (as they say in the GPT-3 world).

隨著深度學習(DL)的到來，尤其是在轉移學習的選擇下，探索階段現在已經超出了查看數據的范圍。它還需要對模型進行另一輪探索，我們稱其為EMA (探索性模型分析) ，它涉及對模型體系結構，其預訓練過程，預訓練中涉及的數據和假設，體系結構的局限性( 例如，輸入大小，模型偏差，無法解決的問題的類型 )以及可以針對下游任務進行微調的程度。換句話說，分析它們在再訓練的頻譜中所處的位置，從少量射擊微調到零射擊(正如他們在GPT-3世界中所說的) 。

Figure 2: EDA + EMA: A Typical Deep Learning flow with both EDA (Exploratory Data Analysis) and EMA (Exploratory Model Analysis) .. or should we just call it EDMA? :)圖2：EDA + EMA：同時具有EDA(探索性數據分析)和EMA(探索性模型分析)的典型深度學習流程..還是我們僅將其稱為EDMA？ :)

In this article, I would like to focus more on EMA on BERT (and the likes), to understand what can it provide beyond fine-tuning for classification or Q&A. As I stated earlier, the Hugging Face library can provide us with the tools necessary to peek into the model, and explore various aspects of the model. More specifically, I would like to use the library to answer the following questions -

在本文中，我想將更多的注意力放在BERT(或類似)上的EMA上，以了解它可以為分類或Q＆A進行微調之外提供什么。如前所述，Hugging Face庫可以為我們提供窺探模型并探索模型各個方面所必需的工具。更具體地說，我想使用該庫來回答以下問題-

How can I peek into the pre-trained model architecture and attempt to interpret the model results given the weights? If you have heard of attention weights and how they could be used to interpret these models, we will explore how to access them using the Hugging Face library, and also visualize them.

如何權衡預訓練的模型體系結構并嘗試根據權重解釋模型結果？如果您聽說過注意權重以及如何將它們用于解釋這些模型，我們將探索如何使用Hugging Face庫訪問它們，并對其進行可視化。

How can I access outputs from various layers of BERT like models? What quality of output can I expect from these models, if I had to go completely unsupervised beyond? We will extract word and sentence vectors, and visualize them to analyze similarity. Further, we will examine the impact on quality of these vectors and metrics when we fine-tuning these models on a dataset from a completely different domain. To put it simply, if we did not have domain specific datasets to fine-tune, how do we move forward?!

如何訪問類似BERT的各個層的輸出？如果我不得不完全不受監督，那么我可以從這些模型期望什么輸出質量？我們將提取單詞和句子向量，并將其可視化以分析相似性。此外，當我們在來自完全不同域的數據集上微調這些模型時，我們將研究這些向量和度量對質量的影響。簡而言之，如果我們沒有微調特定領域的數據集，我們將如何前進？

These questions spawn from two pain points, limited availability of labelled data, and interpretability. Most of real world projects I have worked on, unlike Kaggle competitions, do not give us a nice labelled dataset, and the challenge is to justify the cost of creating labelled data. The second challenge, in some those projects, is also the ability to provide explanation of the model behavior, to hedge some flavor of risk.

這些問題來自兩個痛點， 標記數據的可用性有限和可解釋性 。與Kaggle競賽不同，我從事的大多數現實世界項目都沒有給我們提供一個很好的標記數據集，而挑戰在于證明創建標記數據的成本是合理的。在某些項目中，第二個挑戰是提供模型行為的解釋，對沖某種風險的能力。

Without further ado, let’s dive in! :)

事不宜遲，讓我們開始吧！ :)

輕松獲得注意力權重：邁向解釋的一步 (Easy access to Attention weights: a step towards interpretation)

All the transformer based architectures today are based on attention mechanisms. I found that understanding the basics of how attention works helped me explore how that could be used as a tool for interpretation. I plan to describe the layers at a high level in this post, and focus more on how to extract them using the Transformers library from Hugging Face. If you need to understand the concept of attention in depth, I would suggest you go through Jay Alammar’s blog (link provided earlier) or watch this playlist by Chris McCormick and Nick Ryan here.

如今，所有基于變壓器的體系結構都基于注意力機制。我發現了解注意力的基本原理有助于我探索如何將注意力用作解釋工具。我打算在本文中高層次地描述這些圖層，并將更多的精力放在如何使用Hugging Face的Transformers庫中提取它們上。如果您需要深入了解注意力的概念，建議您瀏覽Jay Alammar的博客( 前面提供的鏈接 )，或在此處觀看Chris McCormick和Nick Ryan的播放列表。

The Hugging Face library provides us with a way access the attention values across all attention heads in all hidden layers. In the BERT base model, we have 12 hidden layers, each with 12 attention heads. Each attention head has an attention weight matrix of size NxN (N is number of tokens from the tokenization process). In other words, we have a total of 144 matrices (12x12), each of size NxN. The final embedding size of each token at every layer input or output is 768 (which comes from 64 dimensional vectors from each attention head i.e. 64x12 = 768). This will be clear as you move to figure 4 below.

Hugging Face庫為我們提供了一種訪問所有隱藏層中所有關注頭的關注值的方法。在BERT基本模型中，我們有12個隱藏層，每個隱藏層都有12個關注頭。每個關注頭都有一個大小為NxN的關注權重矩陣(N是來自令牌化過程的令牌數)。換句話說，我們共有144個矩陣(12x12)，每個矩陣的大小為NxN。每個令牌在每層輸入或輸出上的最終嵌入大小為768 (來自每個關注頭的64維矢量，即64x12 = 768) 。當您移至下面的圖4時，這將很清楚。

Figure 3 provides the architecture for an encoder layer. Figure 4 below drills into the Attention block from Figure 3, and provides a simplified, high-level flow of one sentence through one attention layer of the BERT base model (ignoring the batch_size for simplicity). These diagrams hopefully provides clarity on what matrix will be returned when you turn the output_attentions flag to true via the library.

圖3提供了編碼器層的體系結構。下面的圖4深入到了圖3中的Attention塊，并提供了一個簡化的高層流程，其中一個句子通過BERT基本模型的一個注意層( 為簡單起見，忽略batch_size )。這些圖有望使您清楚通過庫將output_attentions標志設置為true時將返回什么矩陣。

Figure 3: A simplified diagram of the encoder stack. As you can see, we get a vector for each token after each encoder layer. The next diagram (Fig 4 below) drills into the attention block in one of these encoder blocks圖3：編碼器堆棧的簡化圖。如您所見，我們在每個編碼器層之后為每個令牌獲得一個向量。下圖(下面的圖4)鉆入這些編碼器塊之一中的關注塊 Figure 4: A simplified, high-level flow of one sentence (batch dim ignored for simplicity) through one self attention layer of the BERT base model. The input matrix, Nx768 (N rows, one for each token, each embedded into 768 dimensions) flows through the attention layer (the box in the center). When we set output_attention=True in the BertConfig, it returns the matrix ‘A’ for each for attention head.圖4：通過BERT基本模型的一個自我關注層，一個句子的簡化高級流程(為簡單起見，忽略了批次暗淡)。輸入矩陣Nx768(N行，每個令牌一個，每個嵌入768個維度)流經關注層(中間的框)。當我們在BertConfig中設置output_attention = True時，它將為每個關注頭返回矩陣“ A”。

Note: I found tools here and here, which enable us to visualize attentions. These tools are either deprecated or do not implement all the latest architectures. Further, instead of leveraging the well maintained APIs like Hugging Face, one of them re-implements the architectures within, which hampers the chance to run things for newer architectures.

注意：我在這里和這里找到了工具，這些工具使我們可以直觀地看到注意力。這些工具已棄用或未實現所有最新架構。此外，它們之一沒有重新利用像Hugging Face這樣維護良好的API，而是重新實現其中的體系結構，這阻礙了為更新的體系結構運行事物的機會。

Let’s quickly walkthrough the code (the full notebook can be found here). All sthe code here, except fine-tuning, can be run without a GPU

讓我們快速瀏覽一下代碼( 完整的筆記本可以在此處找到 )。 除微調外，此處所有的代碼都可以在沒有GPU的情況下運行

The code above creates a Bert config with output_attentions=True, and uses this config to initialize the Bert base model. We put the model into eval mode since we just care about doing a forward pass through the architecture for this task. The code then goes onto tokenize and do a forward pass. The shape of the output is based on the config passed as described the documentation here. The first two items are last_hidden_state for the last layers and the pooled_output that can be used for fine tuning. The next input is what we are interested in, which is the attentions. As you can see from the last three statements, we can reach any layer, and any attention head, each of which will give us a NxN matrix we are interested in.

上面的代碼創建一個具有output_attentions = True的Bert配置，并使用該配置初始化Bert基本模型。我們將模型置于評估模式，因為我們只在乎為此任務進行架構的前向傳遞。然后，代碼進入令牌化并進行前向傳遞。輸出的形狀基于此處的文檔中所述的傳遞的配置。前兩個項目是最后一層的last_hidden_??state和可用于微調的pooled_output。下一個輸入是我們感興趣的東西，即關注點。從最后三個語句中可以看到，我們可以到達任何層，也可以到達任何關注頭，每一個都會給我們一個我們感興趣的NxN矩陣。

We can quickly plot a heatmap for any of the 144 matrices like below

我們可以為以下144個矩陣中的任何一個快速繪制熱圖

The code above has two simple functions:

上面的代碼具有兩個簡單的功能：

“get_attentions” navigates to the particular layer and attention head, and grabs the NxN matrix to be visualized

“ get_attentions”導航到特定的圖層和關注頭，并獲取要可視化的NxN矩陣

“plt_attentions” plots the matrix passed as a heat map

“ plt_attentions”繪制了作為熱圖傳遞的矩陣

As part of my EDA in the full notebook here, I plotted all the 144 heatmaps as a grid, and skimmed through them to spot some that had a good variation of attention weights. One of them in particular in Figure 3 below shows the relation between the words ‘it’ and ‘animal’ in the sentence “The animal didn’t cross the street because it was too tired”.

作為完整的筆記本電腦我的EDA的一部分在這里，我繪制的所有144個熱圖作為一個網格，并通過他們向脫脂發現一些不得不注意權重的良好變化。其中一個特別是在下面的圖3中，顯示了“動物因為疲倦而沒有穿過馬路”這句話中的“它”和“動物”之間的關系。

Figure 5: The attention heatmap for the sentence “The animal didn’t cross the street because it was too tired” from layer 9 and attention head 10. We can see that the the word “it” has a large weight for “animal”.圖5：第9層和注意頭10的句子“動物沒有過馬路，因為它太累了”的注意熱圖。我們可以看到“ it”一詞對“ animal”具有很大的分量。。

As you can see, with some basic understanding of architecture, the transformers library by Hugging Face makes it extremely easy to pull out raw weights from any attention head. Now that we discussed how to pull out raw weights, let’s talk a bit about whether we should use them directly to interpret what the model has learnt. A recent paper “Quantifying Attention Flow in Transformers” discussed exactly this aspect. They state “across layers of the Transformer, information originating from different tokens gets increasingly mixed”

如您所見，通過對體系結構有一些基本了解，Hugging Face的轉換器庫非常容易地從任何關注的頭中提取原始分量。現在，我們討論了如何提取原始權重，讓我們討論一下是否應該直接使用它們來解釋模型學到的知識。最近的一篇論文《量化變壓器中的注意力流》恰好討論了這一方面。他們指出“跨變形金剛的層，來自不同令牌的信息越來越混雜”

This means reading too much into these weights to interpret how the model deconstructs the input text may not be very useful. They go on to device a strategy to help interpret the impact of inputs on outputs. I won’t dive into the full paper here, but in short, they discuss building a Directed Acyclic Graph (DAG) on top of the architecture, which will help track paths and information flow between pairs of inputs and the hidden tokens. They discuss two approaches “attention rollout” and “attention flow”, that could be used to interpret attention weights as the relative relevance of the input tokens. In simple terms, instead of looking at only the raw attentions in a particular layer, you should consider a weighted flow of information all the way from the input embedding to the particular hidden output. If you are interested to know more, you can also refer to this article, that explains the paper with examples.

這意味著過多地閱讀這些權重以解釋模型如何解構輸入文本可能不是很有用。他們繼續制定一項策略，以幫助解釋輸入對輸出的影響。我不會在這里深入研究全文，但總而言之，他們討論了在體系結構之上構建有向無環圖(DAG)，這將有助于跟蹤輸入對和隱藏令牌之間的路徑和信息流。他們討論了“注意力分布”和“注意力流”兩種方法，可用于將注意力權重解釋為輸入令牌的相對相關性。簡單來說，您應該考慮從輸入嵌入到特定隱藏輸出的一路加權信息流，而不是僅關注特定層中的原始注意力。如果您想了解更多信息，還可以參考本文，該文章以示例對本文進行了解釋。

In summary, to interpret the effect of specific inputs, instead of looking at only the raw attentions independently in each layer, we should take it a step further by using them to track the contribution all the way from the input embedding to specific outputs.

總而言之，為了解釋特定輸入的效果，而不是僅在每一層中單獨關注原始注意力，我們應該進一步利用它們，以跟蹤從輸入嵌入到特定輸出的所有貢獻。

訪問單詞和句子向量：相似性的路徑(以及聚類，分類等) (Access to word and sentence vectors: paths to similarity (and clustering, classification etc.))

As we discussed, it is quite easy to access the attention layers and the corresponding weights. The Hugging Face library also provides us with easy access to outputs from each layer. This allows us to generate word vectors, and potentially sentence vectors.

正如我們所討論的，訪問關注層和相應的權重非常容易。 Hugging Face庫還使我們可以輕松訪問每一層的輸出。這使我們能夠生成單詞向量以及潛在的句子向量。

Word Vectors

詞向量

Figure 6 below shows a few different ways we can extract word level vectors. We could average/sum/concat the last few layers to get a vector. The hypothesis would be that the initial layers (closer to the inputs) would learn low level features (like in a CNN), and the final few layers (closer to the output) would have a much richer representation of the words. We could also just extract the last or the second to last layer as a word/token vector. There is no consensus on what one should use, and it would really depend on the requirement of the downstream task, and it may not even matter so much if the task we are trying to run downstream is simple enough.

下面的圖6顯示了幾種我們可以提取詞級向量的方法 。我們可以平均/求和/合并最后幾層以獲得向量。假設是，初始層(更靠近輸入)將學習低級特征( 例如在CNN中 )，而最后幾個層(更靠近輸出)將具有更豐富的單詞表示形式。我們也可以只提取最后一層或倒數第二層作為單詞/令牌向量。對于應該使用什么沒有達成共識，這實際上取決于下游任務的要求，如果我們要在下游運行的任務足夠簡單，則可能甚至沒有太大關系。

Figure 6 Word Vectors: Ways we can extract vectors for each token. On the left, it shows how we could either average, sum or concatenate over last 4 layers to get one vector for token-1. On the right, it shows how we could access a vector for Token-N from last or second last layer.圖6單詞向量：我們可以為每個令牌提取向量的方式。在左側，它顯示了如何對最后4層進行平均，求和或連接以獲得令牌1的一個向量。右邊顯示了如何從最后一層或倒數第二層訪問令牌N的向量。

Word Similarity

詞相似度

Once we have word vectors, we are ready to use them in a downstream task like similarity. We can either directly visualize them using dimensionality reduction techniques like PCA or tSNE. Instead, I generally try to find distances between these vectors first in the higher dimension, and then use techniques like Multi Dimensional Scaling (MDS; ref Link) to visualize the distance matrix. Figure 7 below summarizes this approach (we can do this for sentences as well, as long as we plan to access to sentence vectors directly, more on this later in the post):

一旦有了單詞向量，我們就可以在類似相似的下游任務中使用它們。我們可以使用降維技術(例如PCA或tSNE)直接可視化它們。取而代之的是，我通常首先嘗試在較高維度上查找這些向量之間的距離，然后使用諸如多維比例縮放(MDS; ref Link )之類的技術來可視化距離矩陣。下面的圖7總結了這種方法( 只要我們計劃直接訪問句子向量，我們也可以對句子執行此操作，更多信息將在后文中介紹 )：

Figure 7: Flow of how to visualize word vectors using a cosine distance + MDS (Multi dimensional Scaling) for 4 words圖7：如何使用余弦距離+ MDS(多維縮放)顯示4個單詞的單詞向量的流程

Implementation: Word Similarity

實現：單詞相似度

Let’s start looking through some code that implements the above flow. The full colab notebook can be found here

讓我們開始研究實現上述流程的一些代碼。完整的colab筆記本可以在這里找到

We will follow the same process as we used to initialize and visualize attention weights, except this time we use “output_hidden_states=True” while initializing the model.

我們將遵循與初始化和可視化注意力權重相同的過程，只是這次我們在初始化模型時使用“ output_hidden_??states = True” 。

We also switch to encode_plus instead of encode, which will help us add CLS/SEP/PAD tokens, with a lot less effort. We build some additional logic around it to get attention masks that mask the PADs and also the CLS/SEPs. Notice the return values in the below function

我們還將切換到encode_plus而不是encode，這將幫助我們以更少的精力添加CLS / SEP / PAD令牌。我們圍繞它構建了一些附加的邏輯，以獲取遮蓋PAD以及CLS / SEP的注意遮罩。注意以下函數中的返回值

Let’s define the model and call it on the sample sentences.

讓我們定義模型并在例句上調用它。

Let’s define few helper functions:

讓我們定義一些輔助函數：

get_vector: Extract vectors as needed (concat/sum across multiple layers as discussed in the diagrams above etc.)
get_vector ：根據需要提取向量(如上圖所示，跨越多層的concat / sum等)

plt_dists: Plot distance matrix passed. This calculates an MDS plot over a distance matrix passed.
plt_dists ：傳遞的距離矩陣圖。這將計算通過的距離矩陣的MDS圖。

eval_vecs: Tie get_vector and plt_dists together to get word vectors for sentences
eval_vecs ：將get_vector和plt_dists綁在一起以獲得句子的詞向量

Thats it! We are now ready to visualize these vectors in different configurations. Let’s look at one of them, concatenating the word vectors from the last 4 layers, and visualizing. We pass two sentences

而已！現在，我們準備以不同的配置可視化這些向量。讓我們看一下其中的一個，將最后4層的單詞向量連接起來并進行可視化。我們通過兩個句子

texts = [
"Joe took Alexandria out on a date.",
"What is your date of birth?",
]

And then call the function to plot the similarity

然后調用該函數繪制相似度

MODE = 'concat'eval_vecs(hidden_states, tokenized_sents, mode='concat')Figure 8: As you can see, date_0 and date_1 are different vectors, and are closer to other words in the respective sentence. This allows us to now use them based on the sentence they occur in (i.e. “contextualized”)圖8：如您所見，date_0和date_1是不同的向量，并且更靠近相應句子中的其他單詞。這使我們現在可以根據它們出現的句子來使用它們(即“上下文化”)

Sentence Vectors

句子向量

The encoder based models provide us with a couple of options to get sentence vectors from the architecture. Figure 9 shows us these options:

基于編碼器的模型為我們提供了兩個選項，可以從體系結構中獲取句子向量。圖9向我們展示了這些選項：

Sentence vectors could be extracted using the last layer (or averaged over last n layers) for the CLS token. I tried extracting and visualizing sentence similarity based on the CLS token, but it did not give me good results.

可以使用CLS令牌的最后一層(或在最后n層取平均值)提取句子矢量。我嘗試基于CLS令牌提取和可視化句子相似度，但是效果不佳。

A quick note from the authors of BERT: This output is usually not a good summary of the semantic content of the input (source).

BERT作者的快速注釋：此輸出通常不是輸入( source ) 的語義內容的很好總結。

2. Average token vectors across the sentence. The token vectors themselves, as we discussed above, could come from concatenating/averaging over last N-layers or directly from the Nth layer.

2.整個句子的平均標記向量。如上所述，令牌向量本身可以來自對最后N層的串聯/平均或直接來自第N層。

Figure 9: Sentence vectors could be extracted using the last layer CLS token directly OR could be averaged over all the tokens in the sentence, which in turn could come from the last layer, the second last layer or averaged over a few layers as we saw in Figure 6.圖9：句子向量可以直接使用最后一層CLS令牌提取，或者可以對句子中的所有令牌進行平均，而這又可以來自最后一層，倒數第二層或平均數層在圖6中。

Sentence Similarity

句子相似度

Similarity of two sentences is very subjective. Two sentences could be very similar in one context, and could be treated as opposites in other contexts. For example, two sentences could be called similar because they are talking about certain a topic, and could be discussing both positive as well as negative aspects of the topic. Those sentences could be considered similar because they are talking about a common topic, but would be considered opposites, if the focus is polarity of the sentence. Most of these architectures are trained on an independent objectives like Masked Language Model(MLM) and Next Sentence Prediction (NSP), and are trained on large, varied datasets. What we get out-of-the-box in terms of similarity, may or may not be relevant, based on task at hand.

兩個句子的相似性非常主觀。在一個上下文中，兩個句子可能非常相似，而在其他上下文中，這兩個句子可以被視為相反。例如，兩個句子之所以被稱為“相似”是因為它們正在談論某個主題，并且可能同時討論該主題的正面和負面方面。那些句子可以被認為是相似的，因為它們在談論一個共同的話題，但是，如果重點是句子的極性，則可以認為它們是相反的。這些體系結構中的大多數都在諸如屏蔽語言模型(MLM)和下一句預測(NSP)的獨立目標上進行了訓練，并且在各種大型數據集上進行了訓練。根據手頭的任務，我們在相似性方面開箱即用，可能相關，也可能不相關。

Irrespective, it is important to look at our options of understanding similarity. Figure 11 below summarizes our options with sentence similarity in a flow diagram below.

無論如何，重要的是要看一下我們了解相似性的選擇。下面的圖11在下面的流程圖中總結了帶有句子相似性的選項。

Figure 11: Different ways of calculating Sentence Similarity using BERT-like models圖11：使用類似BERT的模型計算句子相似度的不同方法

Option #1 and Option #2 above extract and try to create sentence vectors, which can then use the same pipeline we built for word vector similarity.

上面的選項1和選項2提取并嘗試創建句子向量，然后可以使用為詞向量相似性構建的相同管道。

Option #3 tries to compute similarity between two sentences directly from word vectors, instead of attempting to create a sentence vectors explicitly. This uses a special metric called Word Movers Distance (WMD).

選項＃3嘗試直接從單詞向量中計算兩個句子之間的相似度，而不是嘗試顯式創建句子向量。這使用一種特殊的度量標準，稱為“單詞移動距離”(WMD)。

Figure 10: WMD calculation: How the sentence on the left is converted “moved” to the sentence on the right圖10：大規模殺傷性武器計算：如何將左側的句子“移動”到右側的句子

You can read the original paper for WMD here, but in short, it is based on EMD (Earth Movers Distance) and tries to move the words from one sentence to other using the word vectors. An example directly from the paper is show in Figure 10. There is a nice implementation of this here, and an awesome explanation here

您可以在此處閱讀有關WMD的原始文章，但總而言之，它基于EMD(地球移動距離)，并嘗試使用單詞向量將單詞從一個句子移動到另一個句子。圖10顯示了直接來自本文的示例。這里有一個不錯的實現，而這里有很棒的解釋。

Implementation: Sentence Similarity

實現：句子相似度

The code to run through sentence similarity options is available as a colab notebook here here

通過句子相似性選項運行的代碼可在此處作為colab筆記本獲得

Most of the code is very similar to the word embeddings piece we discussed earlier, except calculating the word movers distance. The class below calculates the word movers distance using WMD-relax library. it needs access to an embedding lookup, that yields a vector when a word is passed.

除了計算移詞器的距離外，大多數代碼與我們之前討論的嵌入詞非常相似。下面的類使用WMD-relax庫計算移詞器距離。它需要訪問嵌入查詢，當一個單詞被傳遞時，它會產生一個向量。

Class defined to calculate WMD on vectors extracted from BERT. This needs an embedding lookup dictionary. Internally, it calculates n-bow, which is essentially a distribution of words based on their count in the sentence.定義為在從BERT提取的向量上計算WMD的類。這需要嵌入查找字典。在內部，它計算n弓，它實際上是根據單詞在句子中的數量分布的單詞。

轉移學習的潛力： (The potential for transfer learning:)

I came across two levels of transfer learning for Transformer based models:

我遇到了基于Transformer模型的傳輸學習的兩個層次：

The default learning that comes baked into pre-trained models. Models like BERT/RoBERTa etc. come pre-trained on large corpora, and give us a starting point
預訓練模型中包含的默認學習。像BERT / RoBERTa等模型已經在大型語料庫上進行了預訓練，并為我們提供了一個起點
Fine-tuning the architecture for task specific learning, which is typically how these architectures are used today (e.g. building classifiers/Q&A systems with a dataset from the domain)
微調用于特定任務學習的體系結構，這通常是當今使用這些體系結構的方式(例如，使用來自領域的數據集構建分類器/ Q＆A系統)

The problems I work on generally do not have domain specific labeled datasets. This led me to explore the third option:

我研究的問題通常沒有特定于域的標記數據集。這使我探索了第三個選擇：

What if I fine-tune the models, but on a dataset that comes from a different domain? If I make sure that the objective is similar, can I still transfer learn, even though I know there will be a covariate shift?(covariate shift is the change in the distribution of the input variables present in the training and the test/realtime data).
如果我對模型進行了微調，但在來自不同域的數據集上怎么辦？ 如果我確定目標是相似的，即使我知道會有協變量轉變，我仍可以轉移學習嗎？ (協變量移位是訓練和測試/實時數據中存在的輸入變量分布的變化)。

To understand if this method of transfer learning works, let’s do a quick experiment. Let’s pick sentiment/polarity as the objective for our experimentation.

要了解這種遷移學習方法是否有效，讓我們做一個快速實驗。讓我們選擇情緒/極性作為我們實驗的目標。

We take the pre-trained RoBERTa model, fine-tune the model on the training set from the IMDB 50K movie reviews, and then pick our evaluation dataset as follows:

我們采用預先訓練的RoBERTa模型，根據IMDB 50K電影評論中的訓練集對模型進行微調，然后選擇評估數據集 ，如下所示：

2 sentences from the IMDB 50k movie reviews TEST dataset
來自IMDB 50k電影評論TEST數據集的2句話

Positive: 'This is a good film. This is very funny. Yet after this film there were no good Ernest films!'Negative: 'Hated it with all my being. Worst movie ever. Mentally- scarred. Help me. It was that bad.TRUST ME!!!'

2 sentences from Amazon fine food reviews, which is a dataset from a completely different domain. We will use this only for evaluation
來自亞馬遜美食評論的 2個句子，這是來自完全不同領域的數據集。我們僅將其用于評估

Positive: 'This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!'Negative: "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."

We then extract and visualize sentence similarity on both the base pre-trained model (RoBERTa architecture in this case), and on the fine-tuned model on the same architecture

然后，我們在基本的預訓練模型(在這種情況下為RoBERTa架構)以及在同一架構上的微調模型上提取和可視化句子相似度

The code to fine-tune BERT is available here.

此處提供了用于微調BERT的代碼。

比較： (Comparison:)

Lets do our first comparison by visualizing cosine similarity between sentence vectors obtained by averaging word vectors, that are themselves an averaged version from the last four layers (I know this was a mouthful, but if you need clarification on any of these, feel free to scroll up where we discussed each of them in detail).

讓我們通過可視化通過平均單詞向量獲得的句子向量之間的余弦相似性來進行第一個比較，這些單詞向量本身是最后四層的平均值 ( 我知道這是一個詳盡的例子，但是如果您需要澄清其中的任何一個，請隨意向上滾動我們在其中詳細討論的每個地方 )。

Legend: The yellow circles are positive, while the pink circles are negative

圖例：黃色圓圈為正，而粉紅色圓圈為負

Figure 11 below shows the the visualization run on the pre-trained model:

下面的圖11顯示了在預訓練模型上運行的可視化效果：

The yellow circles are positive, while the pink circles are negative黃色圓圈為正，而粉紅色圓圈為負

Figure 12 below shows the same plot has Figure 11 but zoomed in (notice the axes have a different range). This shows that while the distance between the vectors is really small, after zooming in, we see that the model does attempt to place the sentences discussing the subject (food) closer to each other, favoring the subject of discussion more than the polarity of the sentence (see the bottom two circles, one yellow and one pink)

下面的圖12顯示了與圖11相同的圖，但是放大了(請注意，軸的范圍有所不同 )。這表明盡管向量之間的距離確實很小，但是放大后，我們看到該模型確實試圖將討論主題(食物)的句子彼此放置得更近，從而比討論極性更有利于討論主題句子( 見最下面的兩個圓圈，一個黃色和一個粉紅色 )

Figure 12 Pre-trained model: After zooming in, we see that it does attempt to place the sentences discussing the subject (food) closer to each other, favoring the subject of discussion more than the polarity of the sentence. The yellow circles are positive, while the pink circles are negative圖12預先訓練的模型：放大后，我們看到它確實嘗試將討論主題(食物)的句子彼此放置得更近，比討論的極性更有利于討論主題。黃色圓圈為正，而粉紅色圓圈為負

Now let’s look at the same sentences passed through the fine-tuned model in Figure 13 below. Notice how the sentences are much closer based on the polarity. As you can notice, it has moved its focus from the subject of discussion (food/movie) to polarity (positive vs negative): The positive sentence from the movie review data is closer to the positive sentence from the food review dataset, as is the case with the negative examples. This is exactly what we needed ! We wanted to use a dataset from a completely different domain, but with a similar labeling objective, to transfer learn!

現在，讓我們來看一下通過圖13中的微調模型傳遞的相同語句。請注意，根據極性，句子之間的距離更近了。正如您所注意到的，它已將重點從討論主題(食物/電影)轉移到極性(正面與負面)：電影評論數據中的肯定句子更接近食物評論數據集中的正面句子，帶有負面例子的情況。這正是我們所需要的！ 我們希望使用來自完全不同域但具有相似標記目標的數據集來轉移學習！

Figure 13 Fine-tuned model: As you can notice, it has moved its focus from the subject of discussion (food vs movie) to polarity (positive vs negative). The yellow circles are positive, while the pink circles are negative圖13調整好的模型：您可以注意到，它已將重點從討論的主題(食物與電影)轉移到極性(正面與負面)。黃色圓圈為正，而粉紅色圓圈為負

2. Let’s do our second round of comparison by visualizing the Word Movers Distance (WMD) calculated on word vectors averaged from the last four layers.

2.讓我們進行第二輪比較，方法是可視化根據從最后四層平均得到的單詞向量計算出的單詞移動距離(WMD)。

Figure 14 below shows similar characteristics as the models above. The subject of discussion (food) still seems to be the focus. One main difference here is that Word Movers Distance (WMD) is able to tease out the distances better than the earlier method (applying cosine distance on averaging word vectors across sentences), even on the base model i.e. there was no need to zoom in here. This shows that averaging of word vectors may have unintended consequences for a downstream task.

下面的圖14顯示了與上述模型相似的特性。討論的主題(食品)似乎仍然是焦點。此處的主要區別在于，即使在基本模型上，字移動距離(WMD)也能比較早的方法( 將余弦距離應用于跨句子的平均單詞向量 )更好地闡明距離，即無需在此處放大。這表明平均單詞向量可能會對下游任務產生意想不到的后果。

Figure 14: The distance metric, Word Movers Distance (WMD), shows similar characteristics, but can help differentiate sentences better than averaging methods. The yellow circles are positive, while the pink circles are negative圖14：距離度量(詞移動距離(WMD))顯示了相似的特征，但是比平均方法更能幫助區分句子。黃色圓圈為正，而粉紅色圓圈為負

Figure 15 below shows the same WMD metric on the four sentences using the Fine-tuned model, which again shows that we were able to “shift” the focus of the model towards polarity.

下面的圖15使用微調模型在四個句子上顯示了相同的WMD度量，這再次表明我們能夠將模型的焦點“轉移”到極性上。

Figure 15: WMD metric on the four sentences using the Fine-tuned model. The yellow circles are positive, while the pink circles are negative圖15：使用微調模型對四個句子的WMD度量。黃色圓圈為正，而粉紅色圓圈為負

In summary, the ability to transfer-learn from unrelated domain datasets will open more avenues for projects that struggle with data today. The power of transformers now can enable companies to leverage previously unusable datasets, and not have to spend time creating labelled datasets.

總之，從不相關的域數據集中進行轉移學習的能力將為當今處理數據的項目打開更多的途徑。 變壓器的強大功能現在可以使公司利用以前無法使用的數據集，而不必花費時間來創建標記的數據集。

I hope this article was helpful. There are few other topics like data augmentation in NLP, model interpretation via techniques like LIME (Local interpretable model-agnostic explanations) etc. that interest me and I do plan to explore them in future posts. Until then, thanks for reading!

希望本文對您有所幫助。除了NLP中的數據增強，通過LIME(本地可解釋的模型不可知的解釋)之類的技術進行模型解釋等其他話題之外，我也很感興趣，我也計劃在以后的文章中進行探討。在此之前，感謝您的閱讀！

https://arxiv.org/pdf/1906.05714.pdf

翻譯自: https://towardsdatascience.com/beyond-classification-with-transformers-and-hugging-face-d38c75f574fb