对比学习系列论文CPC(二)—Representation Learning with Contrastive Predictive Coding
0.Abstract
0.1逐句翻譯
While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
盡管監(jiān)督學(xué)習(xí)在許多應(yīng)用中取得了巨大進(jìn)展,但無(wú)監(jiān)督學(xué)習(xí)尚未得到如此廣泛的采用,仍然是人工智能的重要和具有挑戰(zhàn)性的努力。
In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding.
在這項(xiàng)工作中,我們提出了一種通用的無(wú)監(jiān)督學(xué)習(xí)方法來(lái)從高維數(shù)據(jù)中提取有用的表示,我們稱之為對(duì)比預(yù)測(cè)編碼。
The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
我們模型的關(guān)鍵觀點(diǎn)是通過(guò)使用強(qiáng)大的自回歸模型預(yù)測(cè)潛在空間中的 the future來(lái)學(xué)習(xí)這種表示。
感覺(jué)這個(gè)future應(yīng)該是有特殊含義的,但是現(xiàn)在還不理解
We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.
我們使用一個(gè)概率對(duì)比損失,誘導(dǎo)潛在空間捕捉信息,這是最有用的預(yù)測(cè)未來(lái)的樣本。
It also makes the model tractable by using negative sampling.
同時(shí)利用負(fù)采樣使模型易于被管理。
(大約就是用負(fù)例讓整個(gè)模型變得更容易控制)
While most prior work has focused on evaluating representations for a particular modality, we demonstrate that our approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
雖然之前的大部分工作都集中在評(píng)估特定形態(tài)的表示,但我們證明了我們的方法能夠?qū)W習(xí)有用的表示,在四個(gè)不同的領(lǐng)域?qū)崿F(xiàn)強(qiáng)大的性能:語(yǔ)音、圖像、文本和3D環(huán)境中的強(qiáng)化學(xué)習(xí)。
(大約就是說(shuō)之前的模型都是在特定的環(huán)境下才能取得較好的效果,但是本文的提出的方法在各種方面都得到了有效的驗(yàn)證)
0.2總結(jié)
- 1.這個(gè)東西是進(jìn)行無(wú)監(jiān)督表示學(xué)習(xí)方面內(nèi)容研究的
- 2.這個(gè)東西關(guān)注一種叫做future的東西
- 3.這個(gè)東西是有負(fù)例的,所以便于控制模型進(jìn)行work
- 4.這個(gè)東西經(jīng)過(guò)測(cè)試可以在各種領(lǐng)域有效地work
1.Introduction
第一段(肯定有監(jiān)督特征提取的發(fā)展,并指出不足)
Learning high-level representations from labeled data with layered differentiable models in an end-to-end fashion is one of the biggest successes in artificial intelligence so far.
以端到端的方式,用分層可微模型從標(biāo)記數(shù)據(jù)中學(xué)習(xí)高級(jí)表示是人工智能領(lǐng)域迄今為止最大的成功之一。(這里是肯定有監(jiān)督學(xué)習(xí)在學(xué)習(xí)特征方面工作的效果)
These techniques made manually specified features largely redundant and have greatly improved state-of-the-art in several real-world applications [1, 2, 3].
這些技術(shù)使得手工指定的特性在很大程度上是冗余的,并在幾個(gè)實(shí)際應(yīng)用程序中大大改進(jìn)了技術(shù)水平。
(就是說(shuō)自動(dòng)特征提取在較多應(yīng)用上已經(jīng)可以取代手工的特征提取了)
However, many challenges remain, such as data efficiency, robustness or generalization.
然而,仍然存在許多挑戰(zhàn),如數(shù)據(jù)效率、健壯性或泛化。
第二段(無(wú)監(jiān)督因?yàn)闆](méi)有對(duì)特征進(jìn)行領(lǐng)域特異化,所以魯棒性可能更好)
Improving representation learning requires features that are less specialized towards solving a single supervised task.
改進(jìn)表示學(xué)習(xí)需要的特性不是專門針對(duì)解決單個(gè)監(jiān)督任務(wù)。
(表征學(xué)習(xí)的改進(jìn))
For example, when pre-training a model to do image classification, the induced features transfer reasonably well to other image classification domains, but also lack certain information such as color or the ability to count that are irrelevant for classification but relevant for e.g. image captioning [4].
例如,當(dāng)預(yù)先訓(xùn)練一個(gè)模型進(jìn)行圖像分類時(shí),誘導(dǎo)特征可以很好地轉(zhuǎn)移到其他圖像分類領(lǐng)域,但也缺乏某些信息,如顏色或計(jì)數(shù)能力,這些信息與分類無(wú)關(guān),但與圖像字幕[4]相關(guān)。
(這里存在一個(gè)問(wèn)題,我們只是訓(xùn)練一個(gè)特定分類網(wǎng)絡(luò),那么我們這個(gè)網(wǎng)絡(luò)雖然也有上采樣和下采樣過(guò)程,但是我們提取的特征其實(shí)是不全面的,我們只是提取了我們當(dāng)前分類任務(wù)的特征,很多在其他領(lǐng)域有效果的特征其實(shí)就被我們丟棄了。)
Similarly, features that are useful to transcribe human speech may be less suited for speaker identification, or music genre prediction.
類似地,那些對(duì)人類語(yǔ)言轉(zhuǎn)錄有用的特征可能不太適合于說(shuō)話者識(shí)別或音樂(lè)類型預(yù)測(cè)。
(就是我們使用某種方法提取出來(lái)的特征可能領(lǐng)域遷移能力很弱)
Thus, unsupervised learning is an important stepping stone towards robust and generic representation learning.
因此,無(wú)監(jiān)督學(xué)習(xí)是實(shí)現(xiàn)魯棒性和泛型表征學(xué)習(xí)的重要跳板。
(就是說(shuō)魯棒性的不好的原因是我們?cè)谔崛√卣鞯倪^(guò)程中,我們無(wú)意識(shí)的丟棄了一些特征,而無(wú)監(jiān)督能避免這個(gè)問(wèn)題,所以使用無(wú)監(jiān)督能解決這個(gè)問(wèn)題)
第三段(但是現(xiàn)在沒(méi)有很好的無(wú)監(jiān)督學(xué)習(xí)方法,也沒(méi)有辦法評(píng)估)
Despite its importance, unsupervised learning is yet to see a breakthrough similar to supervised learning: modeling high-level representations from raw observations remains elusive.
盡管它的重要性,非監(jiān)督學(xué)習(xí)還沒(méi)有看到類似于監(jiān)督學(xué)習(xí)的突破:從原始觀察建模高級(jí)表示仍然是難以捉摸的。
(盡管無(wú)監(jiān)督從上面的分析當(dāng)中可以看出是非常重要的,但是我們還沒(méi)有有效地從原始觀測(cè)獲得無(wú)監(jiān)督表示特征的方法。)
Further, it is not always clear what the ideal representation is and if it is possible that one can learn such a representation without additional supervision or specialization to a particular data modality.
此外,人們并不總是清楚理想的表示是什么,以及是否有可能在沒(méi)有額外監(jiān)督或?qū)iT針對(duì)特定數(shù)據(jù)形態(tài)的情況下學(xué)習(xí)這種表示。
第四段(介紹本文提出的方法)
One of the most common strategies for unsupervised learning has been to predict future, missing or contextual information.
最常見(jiàn)的非監(jiān)督學(xué)習(xí)策略之一是預(yù)測(cè)未來(lái)、缺失或上下文信息。
This idea of predictive coding [5, 6] is one of the oldest techniques in signal processing for data compression.
這種預(yù)測(cè)編碼的思想[5,6]是信號(hào)處理中最古老的數(shù)據(jù)壓縮技術(shù)之一。
In neuroscience, predictive coding theories suggest that the brain predicts observations at various levels of abstraction [7, 8].
在神經(jīng)科學(xué)中,預(yù)測(cè)編碼理論認(rèn)為,大腦在不同的抽象層次上預(yù)測(cè)觀察結(jié)果[7,8]。
Recent work in unsupervised learning has successfully used these ideas to learn word representations by predicting neighboring words [9].
最近在無(wú)監(jiān)督學(xué)習(xí)方面的工作已經(jīng)成功地使用這些想法通過(guò)預(yù)測(cè)相鄰單詞[9]來(lái)學(xué)習(xí)單詞表示。
For images, predicting color from grey-scale or the relative position of image patches has also beenshown useful [10, 11].
對(duì)于圖像,從灰度或圖像斑塊的相對(duì)位置預(yù)測(cè)顏色也被證明是有用的[10,11]。
We hypothesize that these approaches are fruitful partly because the context from which we predict related values are often conditionally dependent on the same shared high-level latent information.
我們假設(shè)這些方法是卓有成效的,部分原因是我們預(yù)測(cè)相關(guān)價(jià)值的背景往往有條件地依賴于相同的共享的高級(jí)潛在信息。
And by casting this as a prediction problem, we automatically infer these features of interest to representation learning.
通過(guò)將其作為一個(gè)預(yù)測(cè)問(wèn)題,我們自動(dòng)地推斷出這些對(duì)表征學(xué)習(xí)感興趣的特征。
第五段(介紹本文提出的CPC)
In this paper we propose the following: first, we compress high dimensional data into a much more compact latent embedding space in which conditional predictions are easier to model.
本文提出以下建議:
首先,我們將高維數(shù)據(jù)壓縮到一個(gè)更緊湊的潛在嵌入空間,在這個(gè)空間中條件預(yù)測(cè)更容易建模。
Secondly, we use powerful autoregressive models in this latent space to make predictions many steps in the future.
之后,我們?cè)谶@一潛在空間中使用強(qiáng)大的自回歸模型對(duì)未來(lái)的許多步驟進(jìn)行預(yù)測(cè)。
Finally, we rely on Noise-Contrastive Estimation [12] for the loss function in similar ways that have been used for learning word embeddings in natural language models, allowing for the whole model to be trained end-to-end.
最后,我們使用與自然語(yǔ)言模型中用于學(xué)習(xí)單詞嵌入的方法類似的方法,依靠噪聲對(duì)比估計(jì)[12]來(lái)計(jì)算損失函數(shù),從而允許對(duì)整個(gè)模型進(jìn)行端到端訓(xùn)練。
We apply the resulting model, Contrastive Predictive Coding (CPC) to
widely different data modalities, images, speech, natural language and reinforcement learning, and show that the same mechanism learns interesting high-level information on each of these domains, outperforming other approaches.
我們將得到的模型,對(duì)比預(yù)測(cè)編碼(CPC)應(yīng)用于不同的數(shù)據(jù)模式、圖像、語(yǔ)音、自然語(yǔ)言和強(qiáng)化學(xué)習(xí),表明相同的機(jī)制可以在每個(gè)領(lǐng)域?qū)W習(xí)有趣的高級(jí)信息,表現(xiàn)優(yōu)于其他方法。
1.2總結(jié)
大約的邏輯是:
- 1.有監(jiān)督的特征提取已經(jīng)取得了較好的效果,但是這些特征的魯棒性或是可泛化能力還有一定的不足。
- 2.作者認(rèn)為這種不足的原因可能是,我們?cè)谔禺愋詷?biāo)簽訓(xùn)練的時(shí)候。我們可能只是提取了當(dāng)前標(biāo)簽領(lǐng)域相關(guān)的信息,而丟棄了其他領(lǐng)域的特恒。
- 3.無(wú)監(jiān)督因?yàn)闆](méi)有特定的標(biāo)簽,也就沒(méi)有特定的領(lǐng)域信息也就更不會(huì)產(chǎn)生針對(duì)某一個(gè)特定領(lǐng)域的遷移的情況,所以避免了這種問(wèn)題。所以作者提出了使用無(wú)監(jiān)督的方式可以獲得更加魯邦的標(biāo)簽。
- 4.但是現(xiàn)在無(wú)監(jiān)督?jīng)]有成熟的方法,也沒(méi)有成熟的評(píng)價(jià)方式。(這個(gè)文章寫的時(shí)候可能確實(shí)是這樣的,這個(gè)和simCLR、MOCO同時(shí)期的,但時(shí)這倆都順利中了,這個(gè)文章卻被反復(fù)拒稿,所以后來(lái)在投遞的時(shí)候,其實(shí)對(duì)比學(xué)習(xí)的各種都已經(jīng)成熟了。)
通過(guò)上述的陳述本文作者提出了自己的方法:
- 1.首先,從古老的預(yù)測(cè)編碼技術(shù)(壓縮當(dāng)中通過(guò)一個(gè)位置獲得前后的信息)取得靈感,可以通過(guò)預(yù)測(cè)前后的內(nèi)容獲得有效地訓(xùn)練。
- 2.所以作者提出了前后預(yù)測(cè)的方法。
作者仔細(xì)敘述了自己的方法為:
- 1.首先將這些所有的數(shù)據(jù)壓縮在較為緊湊的環(huán)境當(dāng)做,我理解這里可能在一定程度上提升訓(xùn)練效率。
- 2.使用這些得到的信息,預(yù)測(cè)前后的內(nèi)容(作者應(yīng)該是認(rèn)為,這些前后內(nèi)容其實(shí)有一些)
2 Contrastive Predicting Coding
We start this section by motivating and giving intuitions behind our approach.
我們通過(guò)介紹給我們動(dòng)力和背后只覺(jué)得的內(nèi)容開(kāi)始本節(jié)介紹
Next, we introduce thearchitecture of Contrastive Predictive Coding (CPC).
接下來(lái),我們介紹了對(duì)比預(yù)測(cè)編碼(CPC)的體系結(jié)構(gòu)。
After that we explain the loss function that is based on Noise-Contrastive Estimation.
然后,我們解釋了基于噪聲對(duì)比估計(jì)的損失函數(shù)。
Lastly, we discuss related work to CPC.
最后,對(duì)CPC工作進(jìn)行了探討。
2.1 Motivation and Intuitions
2.1.1逐句翻譯
第一段(這種跨越維度的特征,可能更能反映全局信息,并且很少受到干擾)
The main intuition behind our model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal.
我們模型背后的主要直覺(jué)是學(xué)習(xí)編碼(高維)信號(hào)的不同部分之間的底層共享信息的表示。
(就是高維度的信號(hào)在底層是有很多共享信息的)
At the same time it discards low-level information and noise that is more local.
同時(shí),它摒棄了低層次的信息和噪音,這是更局部的。
In time series and high-dimensional modeling, approaches that use next step prediction exploit the local smoothness of the signal. When predicting further in the future, the amount of shared information becomes much lower, and the model needs to infer more global structure.
在時(shí)間序列和高維建模中,使用下一步預(yù)測(cè)的方法利用了信號(hào)的局部平滑性。在未來(lái)進(jìn)一步預(yù)測(cè)時(shí),共享的信息量會(huì)大大減少,模型需要推斷出更多的全局結(jié)構(gòu)。
(因?yàn)樾盘?hào)具有局部的平滑性,但是我們?cè)谕茢喑雠R近的信息可能比較簡(jiǎn)單,但是如果我們想要推算更加遠(yuǎn)的信息,就需要掌握更多的全局信息才能完成)
These ’slow features’ [13] that span many time steps are often more interesting (e.g., phonemes and intonation in speech, objects in images, or the story line in books.).
這些跨越多個(gè)時(shí)間步驟的“慢特征”[13]通常更有趣(例如,語(yǔ)音中的音素和語(yǔ)調(diào),圖像中的物體,或書中的故事線)。
(就是這些跨越很長(zhǎng)的時(shí)間維度的信息往往更加對(duì)全局有表現(xiàn)能力)
第二段
One of the challenges of predicting high-dimensional data is that unimodal losses such as meansquared error and cross-entropy are not very useful, and powerful conditional generative models which need to reconstruct every detail in the data are usually required.
預(yù)測(cè)高維數(shù)據(jù)的挑戰(zhàn)之一是unimodal losses (如均方誤差和交叉熵)不是很有用,通常需要重建數(shù)據(jù)中的每個(gè)細(xì)節(jié)。
But these models are computationally intense, and waste capacity at modeling the complex relationships in the data x, often ignoring the context c.
但這些模型的計(jì)算量很大,在建模數(shù)據(jù)x中的復(fù)雜關(guān)系時(shí)浪費(fèi)了能力,往往忽略了上下文c。
For example, images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories).
例如,圖像可能包含數(shù)千位信息,而高級(jí)潛在變量(如類標(biāo)簽)包含的信息要少得多(10bit他最多包含1024的特征)。
This suggests that modeling p(x|c) directly may not be optimal for the purpose of extracting shared information between x and c.
這表明直接建模p(x|c)對(duì)于提取x和c之間的共享信息的目的可能不是最優(yōu)的。
When predicting future information we instead encode the target x (future) and context c (present) into a compact distributed vector representations (via non-linear learned mappings) in a way that maximally preserves the mutual information of the original signals x and c defined as
當(dāng)預(yù)測(cè)未來(lái)信息時(shí),我們將目標(biāo)x(未來(lái))和上下文c(現(xiàn)在)編碼成一個(gè)緊湊的分布式向量表示(通過(guò)非線性學(xué)習(xí)映射),以最大限度地保留原始信號(hào)x和c的相互信息定義為
By maximizing the mutual information between the encoded representations (which is bounded by the MI between the input signals), we extract the underlying latent variables the inputs have in commmon.
通過(guò)最大限度地提高編碼表示之間的相互信息(它以輸入信號(hào)之間的MI為界),我們提取了輸入之間共有的潛在變量。
2.1.2總結(jié)
大約就是說(shuō)夸時(shí)間之間是可以提取一些平滑的信息出來(lái)的,就是識(shí)別一些全局的特征。
2.2 Contrastive Predictive Coding
Figure 1 shows the architecture of Contrastive Predictive Coding models.
圖1顯示了對(duì)比預(yù)測(cè)編碼模型的架構(gòu)。
第一段(主要是說(shuō)明當(dāng)前網(wǎng)絡(luò)的情況)
First, a non-linear encoder genc maps the input sequence of observations xt to a sequence of latent representations zt = genc(xt), potentially with a lower temporal resolution.
首先,非線性編碼器genc將觀測(cè)數(shù)據(jù)的輸入序列xt映射到潛在表示序列zt = genc(xt),可能具有較低的時(shí)間分辨率。
Next, an autoregressive model gar summarizes all z≤t in the latent space and produces a context latent representation ct = gar(z≤t).
然后,一個(gè)自回歸模型gar總結(jié)了所有的z≤t在潛行空間,并產(chǎn)生了上下文潛行表示ct = gar(z≤t)。
3.Experiments
第一段(主要是介紹這些實(shí)驗(yàn)都是怎么設(shè)計(jì)和進(jìn)行的)
We present benchmarks on four different application domains: speech, images, natural language and reinforcement learning.
我們提出了四個(gè)不同應(yīng)用領(lǐng)域的基準(zhǔn):語(yǔ)音、圖像、自然語(yǔ)言和強(qiáng)化學(xué)習(xí)。
For every domain we train CPC models and probe what the representations contain with either a linear classification task or qualitative evaluations, and in reinforcement learning we measure how the auxiliary CPC loss speeds up learning of the agent.
對(duì)于每一個(gè)領(lǐng)域,我們訓(xùn)練CPC模型,并通過(guò)線性分類任務(wù)或定性評(píng)估來(lái)探索其包含的表示,在強(qiáng)化學(xué)習(xí)中,我們測(cè)量輔助CPC損失如何加速agent的學(xué)習(xí)。
3.1 Audio(針對(duì)音頻的測(cè)試)
3.1.1 逐句翻譯
For audio, we use a 100-hour subset of the publicly available LibriSpeech dataset [30].
對(duì)于音頻,我們使用公開(kāi)可用的librisspeech數(shù)據(jù)集[30]的100小時(shí)子集。
Although the dataset does not provide labels other than the raw text, we obtained force-aligned phone sequences
總結(jié)
以上是生活随笔為你收集整理的对比学习系列论文CPC(二)—Representation Learning with Contrastive Predictive Coding的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 安卓惯性传感器(二)
- 下一篇: 惯性导航的基础