日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問生活随笔！

生活随笔

生活随笔是一个全网技术分享平台，涵盖前端开发（HTML/CSS/JavaScri...

生活随笔

當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习-第3部分

發(fā)布時間：2023/12/15 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习-第3部分小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

有關(guān)深層學(xué)習(xí)的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

這些是FAU YouTube講座“ 深度學(xué)習(xí) ”的講義。 這是演講視頻和匹配幻燈片的完整記錄。 我們希望您喜歡這些視頻。 當然，此成績單是使用深度學(xué)習(xí)技術(shù)自動創(chuàng)建的，并且僅進行了較小的手動修改。 自己嘗試！ 如果發(fā)現(xiàn)錯誤，請告訴我們！

導(dǎo)航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一個講座 / 觀看此視頻 / 頂級 / 下一個講座

Also, Mario isn’t save from Reinforcement Learning. Image created using gifify. Source: YouTube.同樣，馬里奧(Mario)也無法從強化學(xué)習(xí)中受益。使用gifify創(chuàng)建的圖像。資料來源： YouTube 。

Welcome back to deep learning! So today, we want to go deeper into reinforcement learning. The concept that we want to explain today is going to be policy iteration. It tells us how to make better policies towards designing strategies for winning games.

歡迎回到深度學(xué)習(xí)！因此，今天，我們想更深入地學(xué)習(xí)強化學(xué)習(xí)。我們今天要解釋的概念將是策略迭代。它告訴我們?nèi)绾沃贫ǜ玫牟呗詠碓O(shè)計獲勝游戲的策略。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, let’s have a look at the slides that I have here for you. So it’s the third part of our lecture and we want to talk about policy iteration. Now, before we had this action-value function that somehow could assess the value of an action. Of course, this now has also to depend on the state t. This is essentially our — you could say Oracle — that tries to predict the future reward g subscript t. It depends on following a certain policy that describes how to select the action and the resulting state. Now, we can also find an alternative formulation here. We introduce the state-value function. So, previously we had the action-value function that told us how valuable a certain action is. Now, we want to introduce the state-value function that tells us how valuable a certain state is. Here, you can see that it is formalized in a very similar way. Again, we have some expected value over our future reward. This is now, of course, dependent on the state. So, we kind of leave away the dependency on the action and we only focus on the state. You can now see that this is the expected value of the future reward with respect to the state. So, we want to marginalize the actions. We don’t care about what the influence of the action is. We just want to figure out what the value of a certain state is.

因此，讓我們來看一下我為您準備的幻燈片。因此，這是我們講座的第三部分，我們想談?wù)劜呗缘?現(xiàn)在，在我們有了這個動作值函數(shù)之前，它可以某種方式評估一個動作的值。當然，現(xiàn)在這也必須取決于狀態(tài)t。本質(zhì)上，這就是我們(您可以說是Oracle)試圖預(yù)測未來獎勵g下標t。這取決于是否遵循描述了如何選擇操作和結(jié)果狀態(tài)的特定策略。現(xiàn)在，我們也可以在這里找到替代公式。我們介紹狀態(tài)值函數(shù)。因此，以前我們有動作值函數(shù)來告訴我們某個動作的價值。現(xiàn)在，我們要介紹狀態(tài)值函數(shù)，該函數(shù)告訴我們某個狀態(tài)的價值。在這里，您可以看到它以非常相似的方式形式化。同樣，我們對未來的獎勵有一些期望值。現(xiàn)在，這當然取決于狀態(tài)。因此，我們有點放棄了對動作的依賴，而只關(guān)注狀態(tài)。您現(xiàn)在可以看到，這是關(guān)于狀態(tài)的未來獎勵的期望值。因此，我們想將行動邊緣化。我們不在乎動作的影響是什么。我們只想弄清楚某個狀態(tài)的值是什么。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

We can actually compute this. So, we can also do this for our grid example. If you recall this one, you remember that we had the simple game where you had A and B that were essentially the locations on the grid that would then teleport you to A’ and B’. Once, you arrive at A’ and B’, you get a reward. For A’ its +10 and for B’ it’s +5. Whenever you try to leave the board, you get a negative reward. Now, we can play this game and compute the state-value function. Of course, we can do this under the uniform random policy because we don’t have to know anything about the game. If we play the random uniform policy, we can simply choose actions, play this game for a certain time, and then we are able to compute these state values according to the previous definition. You can see that the edge tiles, in particular, in the bottom, they even have a negative value. Of course, they can have negative values because if you are in the edge tiles, we find -1.9 and -2.0 and the bottom. At the corner tiles, there is a 50% likelihood that you will try to leave the grid. In these two directions, you will, of course, generate a negative reward. So, you can see that we have states that are much more valuable. You can see if you look at the positions where A and B are located, they have a very high value. So A has an expected future reward of 8.8 and the tile with B has an expected future reward of 5.3. So, these are really good states. So, you could say with this state value, we have somehow learned something about our game. So, you could say “Okay, maybe we can use this.” We can now use the greedy action selection on this state value. So let’s define a policy and this policy is now selecting always the action that leads into a state of a higher value. If you do so, you have a new policy. If you play with this new policy you see you have a better policy.

我們實際上可以計算出這一點。因此，我們也可以針對我們的網(wǎng)格示例執(zhí)行此操作。如果您還記得這一本書，您會記得我們有一個簡單的游戲，您擁有A和B，它們實際上是網(wǎng)格上的位置，然后將您傳送到A'和B'。一旦到達A'和B'，您將獲得獎勵。 A'為+ 10，B'為+5。每當您嘗試離開董事會時，您都會獲得負面獎勵。現(xiàn)在，我們可以玩這個游戲并計算狀態(tài)值函數(shù)。當然，我們可以在統(tǒng)一的隨機策略下執(zhí)行此操作，因為我們不必了解任何游戲。如果我們執(zhí)行隨機統(tǒng)一策略，則可以簡單地選擇動作，玩一定時間的游戲，然后能夠根據(jù)先前的定義計算這些狀態(tài)值。您可以看到，尤其是在底部的邊緣瓦片，甚至具有負值。當然，它們可以具有負值，因為如果您在邊緣切片中，我們會發(fā)現(xiàn)-1.9和-2.0以及底部。在墻角磚處，您有50％的可能性嘗試離開網(wǎng)格。在這兩個方向上，您當然會產(chǎn)生負面獎勵。因此，您可以看到我們的狀態(tài)更有價值。您可以查看一下A和B所在的位置，它們的值很高。因此，A的預(yù)期未來回報為8.8，而B的區(qū)塊的預(yù)期未來回報為5.3。因此，這些都是非常好的狀態(tài)。因此，您可以說使用此狀態(tài)值，我們已經(jīng)從某種程度上了解了我們的游戲。因此，您可以說“好吧，也許我們可以使用它。” 現(xiàn)在，我們可以在此狀態(tài)值上使用貪婪動作選擇。因此，讓我們定義一個策略，該策略現(xiàn)在總是選擇導(dǎo)致更高價值狀態(tài)的操作。如果這樣做，您將有一個新政策。如果您使用這項新政策，將會發(fā)現(xiàn)您有更好的政策。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, we can now relate this to the action-value function that we used before. We somehow introduced the state-value function in a similar role. So, we can now see that we can introduce an action-value function that is Q subscript policy of s and a, i.e., of the state and the action. This then basically accounts for the transition probabilities. So, you can now compute your Q policy of state and action as the expected value of the future rewards given the state and the action. You can compute this in a similar way. Now, you get an expected future reward for every state and for every action.

因此，我們現(xiàn)在可以將其與之前使用的動作值函數(shù)相關(guān)聯(lián)。我們以某種方式引入了狀態(tài)值函數(shù)。因此，現(xiàn)在我們可以看到可以引入一個動作值函數(shù)，該函數(shù)是s和a的Q下標策略，即狀態(tài)和動作的Q下標策略。然后，這基本上說明了轉(zhuǎn)移概率。因此，您現(xiàn)在可以將狀態(tài)和操作的Q策略計算為給定狀態(tài)和操作的未來獎勵的期望值。您可以用類似的方式進行計算。現(xiàn)在，您將為每個州和每個行動獲得預(yù)期的未來回報。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Are all of these value functions created equal? No. There can only be one optimal state value function. We can show its existence without referring to a specific policy. So, the optimal state-value function is simply the maximum of all state-value functions with the best policy. So, the best policy will always produce the optimal state-value function. Now, we can also define the optimal action-value function. This can now be related to our optimal state-value function. We can see that the optimal action-value function is given as the expected reward in the next step plus our discount factor times the optimal state-value function. So, if we know the optimal state-value function, then we can also derive the optimal action-value function. So, they are related to each other.

創(chuàng)建的所有這些價值函數(shù)是否相等？不可以。只能有一個最佳狀態(tài)值函數(shù)。我們可以在不參考特定政策的情況下證明其存在。因此，最佳狀態(tài)值函數(shù)只是具有最佳策略的所有狀態(tài)值函數(shù)中的最大值。因此，最佳策略將始終產(chǎn)生最佳狀態(tài)值函數(shù)。現(xiàn)在，我們還可以定義最佳作用值函數(shù)。現(xiàn)在，這可以與我們的最佳狀態(tài)值函數(shù)相關(guān)。我們可以看到，最佳行動價值函數(shù)作為下一步的預(yù)期報酬加上我們的折扣系數(shù)乘以最佳狀態(tài)價值函數(shù)得出。因此，如果我們知道最佳狀態(tài)值函數(shù)，那么我們也可以導(dǎo)出最佳動作值函數(shù)。因此，它們彼此相關(guān)。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, this was the state-value function for the uniform random policy. I can show you the optimal V*, i.e., the optimal state-value function. You see that this has much higher values, of course, because we have been optimizing for this. You also observe that the optimal state-value function is strictly positive because we are in a deterministic setting here. So, very important observation: In a deterministic setting, the optimal state-value function will be strictly positive.

因此，這是統(tǒng)一隨機策略的狀態(tài)值函數(shù)。我可以向您展示最佳V *，即最佳狀態(tài)值函數(shù)。您會看到它的值當然更高，因為我們一直在為此進行優(yōu)化。您還會觀察到最佳狀態(tài)值函數(shù)嚴格為正，因為我們在這里處于確定性設(shè)置。因此，非常重要的觀察：在確定性設(shè)置中，最佳狀態(tài)值函數(shù)將嚴格為正。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Now, we can also order policies. We have to determine what is a better policy. We can order them with the following concept: A better policy π is better than a policy π’ if and only if the state values of π are all higher than the state values that you obtain with π’ for all states in the set of states. If you do this, then any policy that returns the optimal state-value function is an optimal policy. So, you see that it’s only one optimal state-value function, but there might be more than one optimal policy. So, there could be two or three different policies that result in the same optimal state-value function. So, if you know either the optimal state-value or the optimal action-value function, then you can directly obtain an optimal policy by greedy action selection. So, if you know the optimal state values and if you have complete knowledge about all the actions and so on, then you can always get the optimal policy by a greedy action selection.

現(xiàn)在，我們還可以訂購保單。我們必須確定什么是更好的政策。我們可以使用以下概念對它們進行排序：當且僅當π的狀態(tài)值都高于對狀態(tài)集中的所有狀態(tài)使用π'獲得的狀態(tài)值時，更好的策略π才比策略π'更好。。如果執(zhí)行此操作，則返回最佳狀態(tài)值函數(shù)的任何策略都是最佳策略。因此，您看到它只是一個最佳狀態(tài)值函數(shù)，但是可能有不止一個最佳策略。因此，可能有兩個或三個不同的策略導(dǎo)致相同的最佳狀態(tài)值函數(shù)。因此，如果您知道最佳狀態(tài)值或最佳動作值函數(shù)，則可以通過貪婪的動作選擇直接獲得最佳策略。因此，如果您知道最佳狀態(tài)值，并且對所有動作等都有完整的了解，那么您總是可以通過貪婪的動作選擇來獲得最佳策略。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

So, let’s have a look at how this would then actually result in terms of policies. Now, greedy action selection on the optimum state-value function or the optimal action-value function would lead to the optimal policy. Well, you see here on the left inside is greedy action selection on the uniform random state-value function. So, what we’ve computed earlier in this video. You can, of course, choose your action in a way that you have the next state being a state of higher value and you end up with this kind of policy. Now, if you do the same thing on the optimal state value function, you can see that we essentially emerge with a very similar policy. You see a couple of differences. In fact, you don’t always have to move up like shown on the left-hand side. So, you can also move left or up on several occasions. You can actually choose the action at each of these squares that are indicated with multiple arrows with equal probability. So, if there’s an up and left arrow, you can choose either action and you would still have an optimal policy. So, this would be the optimal policy that is created by a greedy action selection on the optimal state value function.

因此，讓我們看一下這在政策方面的實際結(jié)果。現(xiàn)在，在最佳狀態(tài)值函數(shù)或最佳動作值函數(shù)上進行貪婪的行為選擇將導(dǎo)致最優(yōu)策略。好吧，您在這里看到的左側(cè)是統(tǒng)一隨機狀態(tài)值函數(shù)上的貪婪動作選擇。因此，我們在本視頻的前面已經(jīng)進行了計算。當然，您可以選擇一種行動，使下一個狀態(tài)成為具有較高價值的狀態(tài)，并最終得到這種策略。現(xiàn)在，如果您在最佳狀態(tài)值函數(shù)上執(zhí)行相同的操作，則可以看到我們在本質(zhì)上出現(xiàn)了非常相似的策略。您會看到一些差異。實際上，您不必總是像左側(cè)所示那樣向上移動。因此，您也可以在幾種情況下向左或向上移動。實際上，您可以在這些正方形的每個正方形上選擇動作，這些正方形均以相等的概率由多個箭頭指示。因此，如果有一個向上和向左的箭頭，則您可以選擇任一操作，并且仍將具有最佳策略。因此，這將是通過對最佳狀態(tài)值函數(shù)進行貪婪操作選擇而創(chuàng)建的最佳策略。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Now, the big question is: “How can we compute optimal value functions?” We still have to determine this optimal state-value function and the optimal action-value function. In order to do this, there are the Bellman equations. They are essentially consistency conditions for value functions. So, this is the example of the state-value function. You can see that you have to sum over all the different actions that are determined by your policy. So, we want to marginalize out the influence of the actual action. Of course, depending on what action you would choose, you would generate different states and different rewards. So, you also sum over the different states and the respective rewards here and multiply the probability of the states with the actual reward plus the discounted state-value function of the next state. So in this way, you can determine the state-value function. You see that there is this dependency between the current state and the next state in this computation.

現(xiàn)在，最大的問題是：“我們?nèi)绾斡嬎阕顑?yōu)值函數(shù)？” 我們?nèi)匀槐仨毚_定此最佳狀態(tài)值函數(shù)和最佳動作值函數(shù)。為此，需要使用Bellman方程。它們本質(zhì)上是價值函數(shù)的一致性條件。因此，這是狀態(tài)值函數(shù)的示例。您可以看到您必須對由策略確定的所有不同操作進行匯總。因此，我們想邊緣化實際行動的影響。當然，根據(jù)您選擇的操作，您將產(chǎn)生不同的狀態(tài)和不同的獎勵。因此，您還可以在此處匯總不同狀態(tài)和相應(yīng)的獎勵，然后將狀態(tài)的概率與實際獎勵以及下一個狀態(tài)的折扣狀態(tài)值函數(shù)相乘。因此，可以通過這種方式確定狀態(tài)值函數(shù)。您會看到在此計算中，當前狀態(tài)和下一個狀態(tài)之間存在這種依賴性。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

This means you can either write this up as a system of linear equations and actually solve this for small problems. But what is even better is that you iteratively solve this by turning the Bellman equations into update rules. So, you see now that we can generate a new value function k+1 for the current state if we simply apply the Bellman equation. So, we have to compute all of the different actions. We have to evaluate actually all of the different actions given the state. Then, we determine all the next future states and the next future rewards and update this according to our previous state-value function. Of course, we do this for all the states s. Then, we have an updated state-value function. Okay. So, this is an interesting observation. If we have some policy, we can actually run those updates.

這意味著您既可以將其寫為線性方程組，也可以解決小問題。但是更好的是，您可以通過將Bellman方程式轉(zhuǎn)換為更新規(guī)則來迭代地解決此問題。因此，您現(xiàn)在看到，只要簡單地應(yīng)用Bellman方程，就可以為當前狀態(tài)生成一個新的值函數(shù)k + 1。因此，我們必須計算所有不同的動作。實際上，我們必須評估給定狀態(tài)下的所有不同操作。然后，我們確定所有下一個未來狀態(tài)和下一個未來獎勵，并根據(jù)我們先前的狀態(tài)值函數(shù)進行更新。當然，我們對所有州都這樣做。然后，我們有一個更新的狀態(tài)值函數(shù)。好的。因此，這是一個有趣的觀察。如果有一些政策，我們實際上可以運行這些更新。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

This leads us then to the concept of policy improvement. This policy iteration is what we actually want to talk about in this video. So, we can use now our state-value function to guide our search for good policies. Then, we update the policy. So, if we use the greedy action selection for an update of the state-value function, then this also means that we simultaneously update our policy because the greedy action selection on our state value will always result in different actions if we change the state values. So, any change or update in the state values will also imply an updated policy in case of greedy action selection because we directly linked them together. So this then means that we can iterate the evaluation of a greedy policy on our state-value function. We stop iterating if our policy stops changing. So, this way we can update the state values and with the update of the state values, we immediately also update our policy. Is this actually guaranteed to work?

這使我們想到了政策改進的概念。此政策迭代是我們在本視頻中實際要討論的內(nèi)容。因此，我們現(xiàn)在可以使用狀態(tài)值函數(shù)來指導(dǎo)我們尋找良好的政策。然后，我們更新該政策。因此，如果我們使用貪婪動作選擇來更新狀態(tài)值函數(shù)，那么這也意味著我們同時更新了我們的策略，因為如果更改狀態(tài)值，對我們狀態(tài)值的貪婪動作選擇將始終導(dǎo)致不同的動作。因此，在選擇貪婪操作的情況下，狀態(tài)值的任何更改或更新也都意味著更新了策略，因為我們直接將它們鏈接在一起。因此，這意味著我們可以根據(jù)狀態(tài)值函數(shù)迭代對貪婪策略的評估。如果我們的政策停止更改，我們將停止迭代。因此，通過這種方式，我們可以更新狀態(tài)值，并且隨著狀態(tài)值的更新，我們也立即更新了策略。這真的可以保證工作嗎？

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, there’s the policy improvement theorem. If we consider changing a single action a subscript t and state s subscript t, following a policy. Then, in general, if we have a higher action-value function, the state value for all states s increases. This means that we have a better policy. So, the new policy is then a better policy. This would then also imply that we also get a better state value because we generate a higher future reward in all of the states. This means that also the state-value function must have been increased. If we only greedy select, then we will always produce a higher action value than the state value before the convergence. So, we iteratively updating the state value using greedy action selection is really a guaranteed concept here in order to improve our state values. We terminate if the policy no longer changes. One last remark: if we don’t loop over all the states in our state space for the policy evaluation but update the policy directly, this is then called value iteration. Okay. So, you have seen now in this video how we can use the state value function in order to describe the expected future reward of a specific state. We have seen that if we do greedy action selection on the state-value function, we can use this to generate better policies. If we follow a better policy, then also our state-value function will increase. So if we follow this concept, we end up in the concept of policy iteration. So with every update of the state value function where you find higher state values, you also find a better policy. This means that we can improve our policy step-by-step by the concept of policy iteration. Okay. So, this was a very first learning algorithm in the concept of reinforcement learning.

好吧，這里有一個政策改進定理。如果我們考慮按照策略更改單個動作，則下標t和狀態(tài)s下標t。然后，通常，如果我們具有較高的動作值函數(shù)，則所有狀態(tài)s的狀態(tài)值都會增加。這意味著我們有更好的政策。因此，新政策才是更好的政策。這也意味著我們還可以獲得更好的州價值，因為我們在所有州中產(chǎn)生了更高的未來回報。這意味著還必須增加狀態(tài)值函數(shù)。如果我們只是貪婪地選擇，那么我們總是會產(chǎn)生比收斂之前的狀態(tài)值更高的動作值。因此，為了提高狀態(tài)值，在此使用貪婪動作選擇迭代更新狀態(tài)值確實是一個有保證的概念。如果政策不再更改，我們將終止。最后一句話：如果我們不循環(huán)狀態(tài)空間中的所有狀態(tài)以進行策略評估，而是直接更新策略，則這稱為值迭代。好的。因此，您現(xiàn)在已經(jīng)在該視頻中看到了如何使用狀態(tài)值函數(shù)來描述特定狀態(tài)的預(yù)期未來回報。我們已經(jīng)看到，如果我們對狀態(tài)值函數(shù)進行貪婪的動作選擇，則可以使用它來生成更好的策略。如果我們遵循更好的政策，那么我們的狀態(tài)值函數(shù)也會增加。因此，如果遵循這個概念，我們最終會遇到策略迭代的概念。因此，在每次更新狀態(tài)值功能時，您都會找到更高的狀態(tài)值，從而找到更好的策略。這意味著我們可以通過策略迭代的概念逐步改進策略。好的。因此，這是強化學(xué)習(xí)概念中的第一個學(xué)習(xí)算法。

CC BY 4.0 from the 深度學(xué)習(xí)講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

But of course, this is not everything. There are a couple of drawbacks and we’ll talk about more concepts on how to improve actually our policies in the next video. There are a couple more. So, we will present them and also talk a bit about the drawbacks of the different versions. So, I hope you liked this video and we will talk a bit more in the next couple of videos about reinforcement learning. So, stay tuned and hope to see you in the next video. Bye-bye!

但是，當然，這還不是全部。有兩個缺點，我們將在下一個視頻中討論更多有關(guān)如何實際改善政策的概念。還有更多。因此，我們將介紹它們，并討論不同版本的缺點。因此，我希望您喜歡這個視頻，在接下來的兩節(jié)關(guān)于強化學(xué)習(xí)的視頻中，我們將進一步討論。因此，請繼續(xù)關(guān)注并希望在下一個視頻中見到您。再見！

Reinforcement Learning Super Mario Kart 64. Image created using gifify. Source: YouTube.強化學(xué)習(xí)Super Mario Kart64 。使用gifify創(chuàng)建的圖像。資料來源： YouTube 。

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜歡這篇文章，你可以找到這里更多的文章，更多的教育材料，機器學(xué)習(xí)在這里，或看看我們的深入學(xué)習(xí) 講座。如果您希望將來了解更多文章，視頻和研究信息，也歡迎關(guān)注YouTube ， Twitter ， Facebook或LinkedIn 。本文是根據(jù)知識共享4.0署名許可發(fā)布的，如果引用，可以重新打印和修改。如果您對從視頻講座中生成成績單感興趣，請嘗試使用AutoBlog 。

鏈接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中鏈接到薩頓的強化學(xué)習(xí)，包括Deep Q學(xué)習(xí)和Alpha Go詳細信息

翻譯自: https://towardsdatascience.com/reinforcement-learning-part-3-711e31967398

總結(jié)

以上是生活随笔為你收集整理的强化学习-第3部分的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：电路分析导论_生存分析导论
下一篇：范数在机器学习中的作用_设计在机器学习中