當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

强化学习-动态规划_强化学习-第5部分

發(fā)布時(shí)間：2023/12/15 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习-动态规划_强化学习-第5部分小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

強(qiáng)化學(xué)習(xí)-動(dòng)態(tài)規(guī)劃

有關(guān)深層學(xué)習(xí)的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

這些是FAU YouTube講座“ 深度學(xué)習(xí) ”的講義。 這是演講視頻和匹配幻燈片的完整記錄。 我們希望您喜歡這些視頻。 當(dāng)然，此成績(jī)單是使用深度學(xué)習(xí)技術(shù)自動(dòng)創(chuàng)建的，并且僅進(jìn)行了較小的手動(dòng)修改。 自己嘗試！ 如果發(fā)現(xiàn)錯(cuò)誤，請(qǐng)告訴我們！

導(dǎo)航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一個(gè)講座 / 觀看此視頻 / 頂級(jí) / 下一個(gè)講座

Breakout is pretty hard to learn. Image created using gifify. Source: YouTube突破很難學(xué)習(xí)。使用gifify創(chuàng)建的圖像。資料來(lái)源： YouTube

Welcome back to deep learning! Today, we want to talk about deep reinforcement learning. So, I have a couple of slides for you. Of course, we want to build on the concepts that we’ve seen in reinforcement learning, but we talk about deep Q-learning today.

歡迎回到深度學(xué)習(xí)！今天，我們想談一談深度強(qiáng)化學(xué)習(xí)。因此，我為您準(zhǔn)備了幾張幻燈片。當(dāng)然，我們希望以在強(qiáng)化學(xué)習(xí)中看到的概念為基礎(chǔ)，但是今天我們談?wù)摰氖巧疃萉學(xué)習(xí)。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

One of the very well-known examples is human-level control through deep reinforcement learning. Here in [4], this was done by Google Deepmind. They showed a neural network is able to play Atari games. So, the idea here is to directly learn the action-value function using a deep network. The inputs are essentially the three subsequent video frames from the game and this is processed by a deep network. It produces the best next action. So, the idea is now to use this deep reinforcement framework to learn the best next controller movements. They do convolutional layers for the frame processing and then fully connected layers for the final decision-making.

眾所周知的例子之一就是通過深度強(qiáng)化學(xué)習(xí)進(jìn)行人級(jí)控制。在[4]中，此操作由Google Deepmind完成。他們證明了神經(jīng)網(wǎng)絡(luò)能夠玩Atari游戲。因此，這里的想法是使用深度網(wǎng)絡(luò)直接學(xué)習(xí)動(dòng)作值函數(shù)。輸入實(shí)質(zhì)上是游戲中的三個(gè)后續(xù)視頻幀，并由深度網(wǎng)絡(luò)處理。它產(chǎn)生最佳的下一動(dòng)作。因此，現(xiàn)在的想法是使用此深度增強(qiáng)框架來(lái)學(xué)習(xí)最佳的下一個(gè)控制器運(yùn)動(dòng)。他們進(jìn)行卷積層以進(jìn)行幀處理，然后進(jìn)行全連接層以進(jìn)行最終決策。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Here, you see the main idea of the architecture. So there are these convolutional layers and ReLUs. You have the input frames that are processed by these. Then, you go into fully connected layers and again fully connected layers. Finally, you produce directly the output and you can see that in Atari games this is a very limited set. So you can either do no action, then there are essentially eight directions, there’s a fire button, and there are eight directions plus the fire button. So that’s all of the different things that you can do. So, it’s a limited domain and you can then train your system with that.

在這里，您將看到該體系結(jié)構(gòu)的主要思想。因此，存在這些卷積層和ReLU。您具有由這些輸入框處理的輸入框。然后，進(jìn)入完全連接的層，然后再次進(jìn)入完全連接的層。最后，您直接產(chǎn)生輸出，您可以看到在Atari游戲中這是一個(gè)非常有限的集合。因此，您既可以不執(zhí)行任何操作，則實(shí)際上有八個(gè)方向，有一個(gè)觸發(fā)按鈕，還有八個(gè)方向以及觸發(fā)按鈕。這就是您可以做的所有不同的事情。因此，這是一個(gè)有限的域，然后您可以使用它來(lái)訓(xùn)練系統(tǒng)。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, it’s a deep network that directly applies Q-learning. The state of the game is essentially the current plus three previous frames as an image stack. So, you have a rather fuzzy way of incorporating memory and state. Then, you have 18 outputs that are associated with the different actions and each output estimates the action for the given input. You don’t have a label and a cost function, but you update with respect to maximize the future reward. There’s a reward of +1 when the game score is increased and a reward of -1 when the game score is decreased. Otherwise, it’s zero. They use an ε-greedy policy with ε decreasing to a low value during the training. They use a semi-gradient form of the Q-learning to update the network weights w and again they use mini-batches to accumulate the weight updates.

嗯，這是一個(gè)直接應(yīng)用Q學(xué)習(xí)的深度網(wǎng)絡(luò)。游戲的狀態(tài)實(shí)質(zhì)上是當(dāng)前狀態(tài)加上前三個(gè)幀作為圖像堆棧。因此，您有一種相當(dāng)模糊的方式來(lái)合并內(nèi)存和狀態(tài)。然后，您有18個(gè)與不同操作相關(guān)聯(lián)的輸出，每個(gè)輸出都會(huì)估算給定輸入的操作。您沒有標(biāo)簽和成本函數(shù)，但會(huì)進(jìn)行更新以最大程度地提高未來(lái)的回報(bào)。當(dāng)游戲分?jǐn)?shù)增加時(shí)，獎(jiǎng)勵(lì)為+1；在游戲分?jǐn)?shù)降低時(shí)，獎(jiǎng)勵(lì)為-1。否則為零。他們使用ε貪婪策略，在訓(xùn)練過程中ε降低到較低的值。他們使用Q學(xué)習(xí)的半梯度形式來(lái)更新網(wǎng)絡(luò)權(quán)重w，并且再次使用小批量來(lái)累積權(quán)重更新。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, they have this target network and it’s updated using the following rule (see slide). You can see that this is very close to what we have seen in the previous video. Again, you have the weights and you update them with respect to the rewards. Now, the problem is, of course, that this γ and selection of the maximum q function is a function of the weights again. So, you now have a dependency on the maximization on the weights that you’re trying to update. So, your target changes simultaneously with the weights that we want to learn. This can actually lead to oscillations or divergence of your weights. So, this is not very good. To solve the problem, they introduce a second target network. After C steps, they generate this by copying the weights of the action-value network to a duplicate network and keep them fixed. So, you use the output q bar of the target network as a target to stabilize the previous maximization. You don’t use q hat, the function that you’re trying to learn, but you use the q bar which is the kind of fixed version that you use for a couple of iterations.

因此，他們有了這個(gè)目標(biāo)網(wǎng)絡(luò)，并使用以下規(guī)則對(duì)其進(jìn)行了更新(請(qǐng)參見幻燈片)。您可以看到這與我們?cè)谏弦粋€(gè)視頻中看到的非常接近。同樣，您擁有權(quán)重，并根據(jù)獎(jiǎng)勵(lì)更新權(quán)重。現(xiàn)在，問題當(dāng)然是這個(gè)γ和最大q函數(shù)的選擇又是權(quán)重的函數(shù)。因此，您現(xiàn)在依賴于要更新的??權(quán)重的最大化。因此，您的目標(biāo)會(huì)隨著我們想要學(xué)習(xí)的權(quán)重而同時(shí)變化。實(shí)際上，這可能導(dǎo)致您的體重發(fā)生波動(dòng)或發(fā)散。所以，這不是很好。為了解決該問題，他們引入了第二個(gè)目標(biāo)網(wǎng)絡(luò)。在執(zhí)行C步驟之后，他們通過將操作值網(wǎng)絡(luò)的權(quán)重復(fù)制到重復(fù)網(wǎng)絡(luò)并保持固定來(lái)生成此權(quán)重。因此，您可以使用目標(biāo)網(wǎng)絡(luò)的輸出q bar作為目標(biāo)來(lái)穩(wěn)定先前的最大化。您不需要使用q hat，即您要學(xué)習(xí)的功能，但可以使用q欄，這是用于幾次迭代的固定版本。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Another trick they have been using is experience replay. Here, the idea is to reduce the correlation between the updates. So after performing an action a subscript t for the image stack and receiving the reward, you add this to the replay memory. You accumulate experiences in this replay memory and then you update the network with samples drawn randomly from this memory instead of taking the most recent ones. This way, you kind of can stabilize and simultaneously not too much focus on one particular situation of the game. You try to keep in mind all of the different situations of the game and this removes the dependence on the current weights and increases the stability. I have a small example for you.

他們一直在使用的另一個(gè)技巧是體驗(yàn)重播。這里的想法是減少更新之間的相關(guān)性。因此，在對(duì)圖像堆棧執(zhí)行下標(biāo)t并獲得獎(jiǎng)勵(lì)后，可以將其添加到重播內(nèi)存中。您在此重播內(nèi)存中積累了經(jīng)驗(yàn)，然后使用從該內(nèi)存中隨機(jī)抽取的樣本而不是最近的樣本來(lái)更新網(wǎng)絡(luò)。這樣，您就可以穩(wěn)定并且不會(huì)過多地專注于游戲的一種特定情況。您嘗試?yán)斡浻螒虻乃胁煌闆r，從而消除了對(duì)當(dāng)前權(quán)重的依賴并增加了穩(wěn)定性。我有一個(gè)小例子給你。

240 hours of training helps. Image created using gifify. Source: YouTube240小時(shí)的培訓(xùn)會(huì)有所幫助。使用gifify創(chuàng)建的圖像。資料來(lái)源： YouTube

So, this is the Atari breakout game and you can see that the agent, in the beginning, is not performing very well. If you train it over several iterations, you can see that the game is played better. So the system learns how to follow with the paddle the ball and then it is able to reflect it. You can see that if you iterate and iterate, you could argue that at some point the reinforcement learning system also figures out the weaknesses of the game. In particular, one situation where you can score really a large number of points is if you manage to bring the ball behind the bricks and then have them jump around there. It will be reflected by the boundaries and not by the paddle and it will generate a large score. So, this is something that offers the claim that the system has learned to be a good strategy by trying to kick out only the bricks on the left-hand side. Then, it needs to get the ball into the region behind the other bricks.

因此，這是Atari突破游戲，您可以看到代理在開始時(shí)表現(xiàn)不佳。如果您經(jīng)過多次迭代訓(xùn)練，可以看到游戲玩得更好。因此，系統(tǒng)學(xué)習(xí)了如何跟隨槳球，然后能夠?qū)⑵浞从吵鰜?lái)。您可以看到，如果您反復(fù)進(jìn)行迭代，則可能會(huì)爭(zhēng)辯說(shuō)，強(qiáng)化學(xué)習(xí)系統(tǒng)有時(shí)也會(huì)找出游戲的弱點(diǎn)。特別是，您可以真正得分很多的情況是，如果您設(shè)法將球帶到積木后面，然后讓它們跳到那兒。它會(huì)被邊界而不是槳所反映，并且會(huì)產(chǎn)生很大的分?jǐn)?shù)。因此，這可以證明系統(tǒng)通過嘗試只踢掉左側(cè)的磚塊而學(xué)會(huì)了一個(gè)好的策略。然后，它需要使球進(jìn)入其他磚塊后面的區(qū)域。

Fast-forward of the game Lee Sedol vs. AlphaGo. Image created using gifify. Source: YouTubeLee Sedol vs. AlphaGo游戲的快進(jìn)。使用gifify創(chuàng)建的圖像。資料來(lái)源： YouTube

Of course, we need to talk about AlphaGo in this video. We want to look into some of the details of how it’s actually implemented. You already heard about this one. So it’s from the paper mastering the game of go with deep neural networks.

當(dāng)然，我們需要在此視頻中談?wù)揂lphaGo。我們想研究一下其實(shí)際實(shí)現(xiàn)方式的一些細(xì)節(jié)。您已經(jīng)聽說(shuō)過這一件事。因此，這是從論文精通深度神經(jīng)網(wǎng)絡(luò)的博弈中得出的。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, we already discussed that go is a much harder problem than chess because it really has a large number of possible moves. With also a large number of possible states that can potentially emerge, the idea is that black plays against white for the control over the board. It has simple rules but an extremely high number of possible moves and situations. To achieve the performance of human players was thought to be years away because of the high in numerical complexity of the problem. So, we could brute-force chess but with Go people thought it would be impossible until we have much much faster computers — orders of magnitude faster computers. They could show that they can really beat human Go experts with the system. So, Go is a perfect information game. There is no hidden information and no chance. So theoretically, we could construct a full game tree and traverse it with min-max to find the best moves. The problem is the high number of legal moves. So in chess, you have approximately 35. In Go, there are like 250 different moves that you can do during the game in each step.

因此，我們已經(jīng)討論過，走棋比象棋要難得多，因?yàn)樽咂宕_實(shí)有很多可能的動(dòng)作。由于還有大量可能出現(xiàn)的潛在狀態(tài)，因此想法是，黑色對(duì)白色的作用是對(duì)電路板的控制。它具有簡(jiǎn)單的規(guī)則，但是可能的動(dòng)作和情況非常多。人們認(rèn)為，要達(dá)到人類運(yùn)動(dòng)員的表現(xiàn)還需要很多年，因?yàn)閱栴}的數(shù)字復(fù)雜性很高。因此，我們可以強(qiáng)行下象棋，但是有了Go語(yǔ)言，人們認(rèn)為只有擁有更快的計(jì)算機(jī)(要快幾個(gè)數(shù)量級(jí)的計(jì)算機(jī))，這才是不可能的。他們可以證明他們可以用該系統(tǒng)真正擊敗人類圍棋專家。因此，Go是一款完美的信息游戲。沒有隱藏的信息，沒有機(jī)會(huì)。因此，從理論上講，我們可以構(gòu)建一個(gè)完整的游戲樹，并以最小-最大的距離遍歷它，以找到最佳移動(dòng)。問題在于大量的法律訴訟。因此，在國(guó)際象棋中，您大約有35個(gè)動(dòng)作。在圍棋中，您在游戲中的每一步都可以進(jìn)行250種不同的動(dòng)作。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Also, the game may involve many moves. So approximately a hundred and fifty. This means that the exhaustive search is completely infeasible. Well, a search tree can, of course, be pruned if you have an accurate evaluation function. for chess, if you remember deep blue, this was already extremely complex and based on massive human input. Fo Go in 2002, “No simple yet reasonable evaluation will ever be found for Go.” was the state-of-the-art. Well in 2016 and 2017; AlphaGo beat Lee Sedol and Ke Kie, two of the world’s strongest players. So, there is a way of solving this game.

另外，游戲可能涉及許多動(dòng)作。大約有一百五十個(gè)。這意味著窮舉搜索是完全不可行的。好吧，如果您具有準(zhǔn)確的評(píng)估功能，則可以修剪搜索樹。對(duì)于國(guó)際象棋，如果您還記得深藍(lán)色，這已經(jīng)非常復(fù)雜并且基于大量的人工輸入。在2002年的Fo Go中，“找不到對(duì)Go進(jìn)行簡(jiǎn)單但合理的評(píng)估。” 是最先進(jìn)的。好在2016年和2017年; AlphaGo擊敗了世界上最強(qiáng)的兩位選手Lee Sedol和Ke Kie。因此，有一種解決此游戲的方法。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

There were several very good ideas in this paper. It has been developed by Silver et al. It also Deepmind and it’s a combination of multiple methods. They use, of course, deep neural networks. Then, they use Monte Carlo Tree Search and they combine supervised learning and reinforcement learning. The first improvement compared to a full tree search was the Monte Carlo tree search. They use the networks to support efficient search through the tree.

本文有幾個(gè)非常好的想法。它已經(jīng)由Silver等人開發(fā)。它也是Deepmind，它是多種方法的組合。他們當(dāng)然使用深度神經(jīng)網(wǎng)絡(luò)。然后，他們使用蒙特卡羅樹搜索，并將監(jiān)督學(xué)習(xí)和強(qiáng)化學(xué)習(xí)相結(jié)合。與全樹搜索相比的第一個(gè)改進(jìn)是蒙特卡羅樹搜索。他們使用網(wǎng)絡(luò)來(lái)支持通過樹的有效搜索。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, what’s Monte Carlo Tree Search? Well, you expand your tree by looking into different possible future moves and you look into the moves that produce very valuable states. You expand on the valuable state over a couple of moves into the future. Then, you also look at the value of these states. So you only look into a couple of valuable states and then expand over and over again for a couple of moves. Finally, you can find a situation where you probably have a much larger state value. So, you try to look a bit into the future and follow moves that are likely produced a higher state value.

那么，什么是蒙特卡羅樹搜索？好吧，您通過查看未來(lái)可能發(fā)生的不同動(dòng)作來(lái)擴(kuò)展樹，并查看產(chǎn)生非常有價(jià)值狀態(tài)的動(dòng)作。您可以在未來(lái)的幾步中擴(kuò)展有價(jià)值的狀態(tài)。然后，您還將查看這些狀態(tài)的值。因此，您只需查看幾個(gè)有價(jià)值的狀態(tài)，然后一次又一次地進(jìn)行幾次擴(kuò)展。最后，您會(huì)發(fā)現(xiàn)一個(gè)狀態(tài)值可能更大的情況。因此，您嘗試展望未來(lái)，并遵循可能會(huì)產(chǎn)生更高狀態(tài)值的移動(dòng)。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, you start from the root node which is the current state. Then, you iteratively do that until you extend the search tree to find the best future state. Here’s the algorithm: you start at the root, you traverse with the tree policy to a leaf node. Then, you expand and you add one or more child nodes to the current leaf, probably the ones that have valuable states. Next, you simulate from the current or the child node, the episodes with actions according to your rollout policy. So, you also need a policy in order to expand here. Then, you can back up and propagate the received rewards backward through the tree. This allows you to find future states that have a large state value. So, you repeat that for a certain amount of time. Lastly, you stop and you choose the action from the root note according to the accumulated statistics. In the next move, you have to start again with a new root note according to the action that actually your opponent has taken.

因此，您從當(dāng)前狀態(tài)的根節(jié)點(diǎn)開始。然后，您需要反復(fù)進(jìn)行此操作，直到擴(kuò)展搜索樹以找到最佳的將來(lái)狀態(tài)為止。這是算法：從根開始，使用樹策略遍歷到葉節(jié)點(diǎn)。然后，展開并向當(dāng)前葉子添加一個(gè)或多個(gè)子節(jié)點(diǎn)，可能是那些具有有價(jià)值狀態(tài)的子節(jié)點(diǎn)。接下來(lái)，您將從當(dāng)前節(jié)點(diǎn)或子節(jié)點(diǎn)中模擬情節(jié)，并根據(jù)部署策略執(zhí)行操作。因此，您還需要一個(gè)策略才能在此處進(jìn)行擴(kuò)展。然后，您可以備份并通過樹向后傳播收到的獎(jiǎng)勵(lì)。這使您可以找到狀態(tài)值較大的將來(lái)狀態(tài)。因此，您需要重復(fù)一定的時(shí)間。最后，您停止并根據(jù)累積的統(tǒng)計(jì)信息從根注釋中選擇操作。在下一步中，您必須根據(jù)對(duì)手實(shí)際采取的行動(dòng)重新開始新的根音。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So the tree policy guides in how far successful paths are used and how frequently they will be looked at. This is a typical exploration/exploitation trade-off. Well, the main problem here is, of course, that the normal Monte Carlo Tree Search is not accurate enough for Go. The idea in AlphaGo was to control the tree expansion with a neural network to find promising actions and then improve the value estimation by a neural network. So, this is more efficient in terms of extension and evaluation than the search of a tree and this means that you better let go.

因此，樹策略可指導(dǎo)成功路徑的使用距離和查看頻率。這是典型的勘探/開采權(quán)衡。好吧，這里的主要問題當(dāng)然是普通的蒙特卡羅樹搜索對(duì)于Go語(yǔ)言來(lái)說(shuō)不夠準(zhǔn)確。 AlphaGo中的想法是使用神經(jīng)網(wǎng)絡(luò)控制樹的擴(kuò)展，以找到有希望的動(dòng)作，然后通過神經(jīng)網(wǎng)絡(luò)來(lái)改進(jìn)價(jià)值估算。因此，在擴(kuò)展和評(píng)估方面，這比搜索樹更有效，這意味著您最好放手。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

How do they use these deep neural networks? They have three different networks. They have a policy network that suggests the next move in a leaf node for the extension. Then they have a value network that looks at the current board situation and computes essentially the chances of winning. Lastly, they have a rollout policy network that guides the rollout action selection. All of those networks are deep convolutional networks and the input is the current board position and additional pre-computed features.

他們?nèi)绾问褂眠@些深度神經(jīng)網(wǎng)絡(luò)？他們有三個(gè)不同的網(wǎng)絡(luò)。他們有一個(gè)策略網(wǎng)絡(luò)，可為擴(kuò)展建議葉節(jié)點(diǎn)中的下一步行動(dòng)。然后，他們就有了一個(gè)價(jià)值網(wǎng)絡(luò)，可以查看當(dāng)前的董事會(huì)情況并從本質(zhì)上計(jì)算獲勝的機(jī)會(huì)。最后，他們有一個(gè)指導(dǎo)政策網(wǎng)絡(luò)，指導(dǎo)指導(dǎo)行動(dòng)選擇。所有這些網(wǎng)絡(luò)都是深度卷積網(wǎng)絡(luò)，輸入是當(dāng)前電路板位置和其他預(yù)先計(jì)算的功能。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, here’s the policy network. It had 13 convolutional layers one output for each point on the Go board. Then, a huge database of human expert moves, 30 million, that were available. They start with supervised learning and trained the network to predict the next move in the human expert play. Then, they train this network also with reinforcement learning by playing against older versions of the self and they have a reward for winning the game. All the versions, of course, avoid correlation instability. If you look at the training time, there were three weeks on 50 GPUs for the supervised part and one day for the reinforcement learning. So, actually quite a bit of supervised learning involved here — not so much reinforcement learning.

因此，這是政策網(wǎng)絡(luò)。它具有13個(gè)卷積層，在Go板上的每個(gè)點(diǎn)都有一個(gè)輸出。然后，有了一個(gè)龐大的人類專家舉動(dòng)數(shù)據(jù)庫(kù)，共有3000萬(wàn)個(gè)。他們從有監(jiān)督的學(xué)習(xí)開始，并對(duì)網(wǎng)絡(luò)進(jìn)行了培訓(xùn)，以預(yù)測(cè)人類專家游戲中的下一步行動(dòng)。然后，他們通過與較早版本的自我對(duì)戰(zhàn)來(lái)通過增強(qiáng)學(xué)習(xí)來(lái)訓(xùn)練該網(wǎng)絡(luò)，并且他們會(huì)贏得比賽而獲得獎(jiǎng)勵(lì)。當(dāng)然，所有版本都避免了關(guān)聯(lián)不穩(wěn)定。如果您看一下培訓(xùn)時(shí)間，則在受監(jiān)督的部分使用50個(gè)GPU進(jìn)行為期三周的時(shí)間，而對(duì)于強(qiáng)化學(xué)習(xí)則需要一天的時(shí)間。因此，這里實(shí)際上涉及了很多監(jiān)督學(xué)習(xí)-與其說(shuō)不是強(qiáng)化學(xué)習(xí)。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

There’s the value network. This has the same architecture as the policy network, but just one output node. The goal is here to predict the probability of winning the game. They train again on self-play games of reinforcement learning and use Monte Carlo policy evaluation for 30 million positions from these games. Training time was one week on 50 GPUs.

有價(jià)值網(wǎng)絡(luò)。它具有與策略網(wǎng)絡(luò)相同的體系結(jié)構(gòu)，但只有一個(gè)輸出節(jié)點(diǎn)。目的是預(yù)測(cè)贏得比賽的可能性。他們?cè)俅斡?xùn)練強(qiáng)化學(xué)習(xí)的自玩游戲，并使用蒙特卡洛政策評(píng)估這些游戲中的3000萬(wàn)個(gè)職位。在50個(gè)GPU上的培訓(xùn)時(shí)間為一周。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Then, they have the rollout policy network that could then be used to select the moves during rollout. Of course, here, the problem is that the inference time is comparatively high and the solution was to train a simpler linear network on a subset of the data that provides actions very quickly. So, this led to a speed-up of approximately a thousand compared to the policy network. So if you work with this rollout policy network, then you have a slimmer network, but it’s much faster. So, you can do more simulations and collect more experience. So, this is why they use this rollout policy network.

然后，他們擁有部署策略網(wǎng)絡(luò)，該網(wǎng)絡(luò)可用于在部署期間選擇移動(dòng)。當(dāng)然，這里的問題是推理時(shí)間相對(duì)較長(zhǎng)，解決方案是在可提供動(dòng)作非常Swift的數(shù)據(jù)子集上訓(xùn)練一個(gè)更簡(jiǎn)單的線性網(wǎng)絡(luò)。因此，與策略網(wǎng)絡(luò)相比，這導(dǎo)致速度提高了約一千。因此，如果使用此推出策略網(wǎng)絡(luò)，則網(wǎng)絡(luò)將更苗條，但速度要快得多。因此，您可以進(jìn)行更多的模擬并收集更多的經(jīng)驗(yàn)。因此，這就是他們使用此推出策略網(wǎng)絡(luò)的原因。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Now, there was quite a bit of supervised learning involved here. So, let’s have a look at AlphaGo zero. Now, AlphaGo zero doesn’t need human play anymore. So, the idea here is that you then play solely with reinforcement learning and self-play. It has simpler Monte Carlo Tree Search and no rollout policy network in the Monte Carlo Tree Search. Also, in the self-play games, they also introduced multi-task learning. So, the policy and value network shared the initial layers. This then led to [3] and the extensions are also able to play chess and shogi. So, it’s not just code that can solve Go. With this, you can also play chess and shogi at an expert level. Okay. So, this sums up what we’ve been doing in reinforcement learning. Of course, we could look at many other things here. However, there is just not enough time.

現(xiàn)在，這里涉及了很多監(jiān)督學(xué)習(xí)。因此，讓我們看一下AlphaGo零。現(xiàn)在，AlphaGo零不再需要人工操作。因此，這里的想法是讓您只玩強(qiáng)化學(xué)習(xí)和自我游戲。它具有更簡(jiǎn)單的“蒙特卡洛樹搜索”，并且在“蒙特卡洛樹搜索”中沒有部署策略網(wǎng)絡(luò)。此外，在自玩游戲中，他們還引入了多任務(wù)學(xué)習(xí)。因此，政策和價(jià)值網(wǎng)絡(luò)共享初始層。然后導(dǎo)致[3]，擴(kuò)展程序也能夠下棋和將棋。因此，不僅僅是可以解決Go的代碼。這樣，您還可以在專家級(jí)別下棋和將棋。好的。因此，這總結(jié)了我們?cè)趶?qiáng)化學(xué)習(xí)中所做的工作。當(dāng)然，我們可以在這里查看許多其他內(nèi)容。但是，沒有足夠的時(shí)間。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Next time in deep learning, we want to talk about algorithms that even don’t have rewards. So, complete unsupervised training and we also want to learn how to benefit from adversaries. We will see that there’s a very cool concept out there that is called generative adversarial networks which is able to generate all kinds of different images. Also, a very cool concept that we’ll talk about in one of the next videos, Then we look into extensions into performing image processing tasks. So, we go more and more towards the applications.

下次在深度學(xué)習(xí)中，我們想談?wù)撋踔翛]有獎(jiǎng)勵(lì)的算法。因此，完成無(wú)人看管的培訓(xùn)，我們也想學(xué)習(xí)如何從對(duì)手中受益。我們將看到那里有一個(gè)很酷的概念，稱為生成對(duì)抗網(wǎng)絡(luò)，它可以生成各種不同的圖像。此外，我們將在下一個(gè)視頻中討論一個(gè)非常酷的概念，然后研究執(zhí)行圖像處理任務(wù)的擴(kuò)展。因此，我們?cè)絹?lái)越傾向于應(yīng)用程序。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, some comprehensive questions: What is a policy? What are value functions? Explain the exploitation versus exploration dilemma, and so on. If you’re interested in reinforcement learning, I can definitely recommend having a look at the book reinforcement learning by Richard Sutton. It’s really a great book and you will learn in high detail about all the things that we could only scratch on in these videos. So, you see that you can go much deeper into all of the details of reinforcement learning and also deep reinforcement learning. There’s actually much more to say about this at this point, but we can only remain at this level for the time being. Well, I also brought you the link and I put also the link into the video description. So please enjoy this book it’s very good and, of course, we have plenty of further references.

好吧，一些綜合性問題：什么是政策？什么是價(jià)值函數(shù)？解釋開發(fā)與探索的困境，等等。如果您對(duì)強(qiáng)化學(xué)習(xí)感興趣，我絕對(duì)可以推薦您看一下Richard Sutton撰寫的《強(qiáng)化學(xué)習(xí)》一書。這確實(shí)是一本很棒的書，您將詳細(xì)了解我們?cè)谶@些視頻中可能碰到的所有事情。因此，您會(huì)發(fā)現(xiàn)您可以更深入地學(xué)習(xí)強(qiáng)化學(xué)習(xí)的所有細(xì)節(jié)，也可以深入學(xué)習(xí)強(qiáng)化學(xué)習(xí)。在這一點(diǎn)上，實(shí)際上還有很多話要說(shuō)，但是我們暫時(shí)只能停留在這個(gè)水平上。好吧，我還為您帶來(lái)了鏈接，并將該鏈接也放入了視頻說(shuō)明中。因此，請(qǐng)喜歡這本書，它非常好，當(dāng)然，我們還有很多其他參考。

So, thank you very much for listening and I hope you that you can now understand at least in a bit of what is happening in reinforcement learning and deep reinforcement learning and what the main ideas are in order to perform learning of games. So, thank you very much for watching this video and hope to see you in the next one. Bye-bye!

因此，非常感謝您的聆聽，并希望您現(xiàn)在至少可以了解一些強(qiáng)化學(xué)習(xí)和深度強(qiáng)化學(xué)習(xí)中發(fā)生的事情以及進(jìn)行游戲?qū)W習(xí)的主要思想。因此，非常感謝您觀看此視頻，并希望在下一個(gè)視頻中見到您。再見！

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜歡這篇文章，你可以找到這里更多的文章，更多的教育材料，機(jī)器學(xué)習(xí)在這里，或看看我們的深入學(xué)習(xí) 講座。如果您希望將來(lái)了解更多文章，視頻和研究信息，也歡迎關(guān)注YouTube ， Twitter ， Facebook或LinkedIn 。本文是根據(jù)知識(shí)共享4.0署名許可發(fā)布的，如果引用，可以重新打印和修改。如果您對(duì)從視頻講座中生成成績(jī)單感興趣，請(qǐng)嘗試使用AutoBlog 。

鏈接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中鏈接到薩頓的強(qiáng)化學(xué)習(xí)，包括Deep Q學(xué)習(xí)和Alpha Go詳細(xì)信息

翻譯自: https://towardsdatascience.com/reinforcement-learning-part-5-70d10e0ca3d9

強(qiáng)化學(xué)習(xí)-動(dòng)態(tài)規(guī)劃

總結(jié)

以上是生活随笔為你收集整理的强化学习-动态规划_强化学习-第5部分的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

动态

上一篇：辍学的名人_辍学效果如此出色的5个观点
下一篇：查看-增强会话_会话式人工智能-关键技术