日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习-动态规划_强化学习-第4部分

發布時間:2023/12/15 编程问答 63 豆豆
生活随笔 收集整理的這篇文章主要介紹了 强化学习-动态规划_强化学习-第4部分 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

強化學習-動態規劃

有關深層學習的FAU講義 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

這些是FAU YouTube講座“ 深度學習 ”的 講義 。 這是演講視頻和匹配幻燈片的完整記錄。 我們希望您喜歡這些視頻。 當然,此成績單是使用深度學習技術自動創建的,并且僅進行了較小的手動修改。 自己嘗試! 如果發現錯誤,請告訴我們!

導航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一個講座 / 觀看此視頻 / 頂級 / 下一個講座

Also Sonic the Hedgehog has been looked at with respect to reinforcement learning. Image created using gifify. Source: YouTube.刺猬索尼克(Sonic the Hedgehog)也在強化學習方面受到關注。 使用gifify創建的圖像 。 資料來源: YouTube 。

Welcome back to deep learning! Today we want to discuss a couple of other reinforcement learning approaches than the policy iteration concept that you’ve seen in the previous video. So let’s have a look at what I’ve got for you today. We will look at other solution methods.

歡迎回到深度學習! 今天,我們要討論除上一段視頻中看到的策略迭代概念以外的其他兩種強化學習方法。 因此,讓我們來看看我今天為您準備的。 我們將介紹其他解決方法。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

You see that in the policy and value iteration that we discussed earlier, they require updated policies during the learning to obtain better approximations of our optimal state-value function. So, these are called on policy algorithms because you need n policy. This policy is being updated. Additionally, we assumed that the state transition and the reward are known. So, the probability density functions that produce the new states and the new reward are known. If they are not then you can’t apply the previous concept. So, this very important and of course there are methods where you can then relax this. So, these methods mostly differ in how they perform the policy evaluation. So, let’s look at a couple of those alternatives.

您會看到,在我們前面討論的策略和價值迭代中,它們在學習期間需要更新的策略才能獲得最佳狀態值函數的更好近似值。 因此,將這些稱為策略算法,因為您需要n個策略。 此政策正在更新。 此外,我們假設狀態轉換和獎勵是已知的。 因此,產生新狀態和新獎勵的概率密度函數是已知的。 如果不是,那么您將無法應用先前的概念。 因此,這非常重要,當然還有一些方法可以讓您放松一下。 因此,這些方法的主要區別在于執行策略評估的方式不同。 因此,讓我們看幾個替代方案。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

The first one that I want to show you is based on Monte Carlo techniques. This applies only to episodic tasks. Here, the idea is off-policy. So, you learn the optimal state value by following an arbitrary policy. It doesn’t matter what policy you’re using. So it’s an arbitrary policy. It could be multiple policies. Of course, you still have the exploration/exploitation dilemma. So you want to choose policies that really visit all of the states. You don’t need information about the dynamics of the environment because you can simply run many of the episodic tasks. You try to reach all of the possible states. If you do so, then you can generate those episodes using some policy. Then, you loop in backward direction over one episode and you accumulate the expected future reward. Because you have played the game until the end, you can go backward in time over this episode and accumulate the different rewards that have been obtained. If a state was not yet visited, you append it to a list and essentially you use this list then to compute the update for the state value function. So, you see this is simply the sum over these lists for that specific state. This will allow you to update your state value and this way you can then iterate in order to achieve the optimal state value function.

我要向您展示的第一個基于蒙特卡洛技術。 這僅適用于情景任務。 在這里,這個想法是不合政策的。 因此,您可以通過遵循任意策略來學習最佳狀態值。 您使用什么策略都沒有關系。 因此,這是一個任意政策。 可能是多個策略。 當然,您仍然有探索/開發難題。 因此,您想選擇真正訪問所有州的政策。 您不需要有關環境動態的信息,因為您可以簡單地運行許多情景任務。 您嘗試達到所有可能的狀態。 如果這樣做,則可以使用某些策略來生成這些情節。 然后,您在一個情節中向后循環,并累積了預期的未來獎勵。 因為您一直玩游戲到最后,所以您可以在此情節中向后退,并累積獲得的不同獎勵。 如果尚未訪問狀態,則將其附加到列表中,然后基本上使用該列表來計算狀態值函數的更新。 因此,您看到的只是這些列表中特定狀態的總和。 這將允許您更新狀態值,然后可以通過這種方式進行迭代以實現最佳狀態值功能。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Now, another concept is temporal difference learning. This is an on-policy method. Again, it does not need information about the dynamics of the environment. So here, the scheme is that you loop and follow a certain policy. Then you use an action from the policy to observe the rewards and the new states. You update your state-value function using the previous state-value function plus α that is used to weight the influence of the new observations times the new reward plus the discounted version of the old state value function of the new state and you subtract the value of the old state. So this way, you can generate updates and this actually converges to the optimal solution. A variant of this estimates actually the action-value function and is then known as SARSA.

現在,另一個概念是時間差異學習。 這是一種基于策略的方法。 同樣,它不需要有關環境動態的信息。 因此,這里的方案是您循環并遵循某個策略。 然后,您使用策略中的操作來觀察獎勵和新狀態。 您可以使用先前的狀態值函數加α來更新狀態值函數,該函數用于對新觀測值的影響乘以新獎勵乘以新獎勵再加上新狀態的舊狀態值函數的打折版本,然后減去該值的舊狀態。 因此,您可以生成更新,并且實際上可以收斂到最佳解決方案。 這種方法的一種變體實際上是估計作用值函數,因此被稱為SARSA。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Q learning is an off-policy method. It’s a temporal difference type of method but it does not require information about the dynamics of the environment. Here, the idea is that you loop and follow a policy derived from your action-value function. For example, you could use an ε-greedy type of approach. Then, you use the action from the policy to observe your reward and your new state. Next, you update your action-value function using the previous action-value plus some weighting factor times the observed reward again the discounted action that would have derived the maximum action value over what you have already known from the state that is generated minus the action-value function of the previous state. So it’s again a kind of temporal difference that you are using here in order to update your action-value function.

Q學習是一種脫離政策的方法。 這是一種時間差異類型的方法,但不需要有關環境動態的信息。 這里的想法是循環并遵循從操作值函數派生的策略。 例如,您可以使用ε-貪心類型的方法。 然后,您使用策略中的操作來觀察您的獎勵和新狀態。 接下來,您使用先前的操作值加上一些權重因子乘以觀察到的獎勵再一次更新貼現操作,該貼現操作將根據您從生成的狀態減去操作得出的最大操作值來更新您的操作值函數前狀態的-value函數。 因此,這也是您用來更新操作值函數的時間差異。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

Well, if you have Universal function approximators, what about just parameterizing your policy with weights w and some loss function? This is known as the policy gradient. This instance is called REINFORCE. So, you generate an episode using your policy and your weights. Then, you go forward in your episode from time 0 to time t — 1. If you do so, you can actually compute the gradient with respect to the weights. You use this gradient in order to update your weights. Very similar way as we have previously seen in our learning approaches. You can see that this idea using the gradient over the policy then gives you an idea of how you can update the weights, again with a learning rate. We are really close to our machine learning ideas from earlier now.

好吧,如果您有通用函數逼近器,那么僅使用權重w和某些損失函數對策略進行參數化怎么辦? 這稱為策略梯度。 該實例稱為REINFORCE。 因此,您可以使用自己的政策和權重來生成情節。 然后,您可以從時間0到時間t_1前進。如果這樣做,則實際上可以計算權重的梯度。 您可以使用此漸變來更新您的權重。 與我們以前在學習方法中看到的方式非常相似。 您可以看到,通過在策略上使用梯度可以使您重新了解權重,同時又可以提高學習率。 從現在開始,我們真的很接近我們的機器學習思想。

CC BY 4.0 from the 深度學習講座中 Deep Learning Lecture.CC BY 4.0下的圖像。

This is why we talk in the next video about deep Q learning which is the kind of deep learning version of reinforcement learning. So, I hope you like this video. You’ve now seen other options on how you can actually determine the optimal state-value and action-value function. This way, we have seen that there are many different ideas that do no longer require exact knowledge on how to generate future states and on how to generate future rewards. So with these ideas, you can also do reinforcement learning and in particular the idea of the policy gradient. We’ve seen that this is very much compatible with what we’ve seen earlier in this class regarding our machine learning and deep learning methods. We will talk about exactly this idea in the next video. So thank you very much for listening and see you in the next video. Bye-bye!

這就是為什么我們在下一個視頻中談論深度Q學習,這是強化學習的深度學習版本。 所以,我希望你喜歡這個視頻。 現在,您已經看到了有關如何實際確定最佳狀態值和動作值函數的其他選項。 這樣,我們已經看到,有許多不同的想法不再需要關于如何生成未來狀態以及如何生成未來獎勵的確切知識。 因此,有了這些想法,您還可以進行強化學習,尤其是政策梯度的想法。 我們已經看到,這與我們之前在本課程中有關機器學習和深度學習方法的內容非常兼容。 我們將在下一個視頻中討論這個想法。 因此,非常感謝您收聽并在下一個視頻中見到您。 再見!

Sonic is still a challenge for today’s reinforcement learning methods. Image created using gifify. Source: YouTube對于當今的強化學習方法,Sonic仍然是一個挑戰。 使用gifify創建的圖像 。 資料來源: YouTube

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜歡這篇文章,你可以找到這里更多的文章 ,更多的教育材料,機器學習在這里 ,或看看我們的深入 學習 講座 。 如果您希望將來了解更多文章,視頻和研究信息,也歡迎關注YouTube , Twitter , Facebook或LinkedIn 。 本文是根據知識共享4.0署名許可發布的 ,如果引用,可以重新打印和修改。 如果您對從視頻講座中生成成績單感興趣,請嘗試使用AutoBlog 。

鏈接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中鏈接到薩頓的強化學習,包括Deep Q學習和Alpha Go詳細信息

翻譯自: https://towardsdatascience.com/reinforcement-learning-part-4-3c51edd8c4bf

強化學習-動態規劃

總結

以上是生活随笔為你收集整理的强化学习-动态规划_强化学习-第4部分的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。