當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习之基础入门_强化学习基础

發布時間：2023/12/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习之基础入门_强化学习基础小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

強化學習之基礎入門

Reinforcement learning is probably one of the most relatable scientific approaches that resemble the way humans learn about things. Every day, we, the learner, learn by interacting with our environment to know what to do in certain situations, to know about the consequences of our actions, and so on.

強化學習可能是最相關的科學方法之一，類似于人類學習事物的方式。每天，我們作為學習者，都通過與環境互動來學習，以了解在某些情況下該怎么做，了解我們的行動的后果等等。

When we were babies, we didn’t know that touching a hot kettle would hurt our hand. However, as we learned about how the environment responded to our action, i.e touching the hot kettle hurts our hand, then we learned not to touch a hot kettle. This illustrates the fundamental theory of reinforcement learning.

當我們還是嬰兒的時候，我們不知道觸摸熱水壺會傷害我們的手。但是，當我們了解到環境如何響應我們的行動(即觸摸熱水壺會傷害我們的手)時，我們學會了不要觸摸熱水壺。這說明了強化學習的基本理論。

Reinforcement learning is learning what to do in order to maximize a numerical reward. This means that the learner should discover which actions that yield the highest reward in the long run by trying them.

強化學習是在學習如何使數字獎勵最大化。這意味著學習者應該嘗試一下，從而發現哪些行為從長遠來看會產生最高的回報。

In this article, I want to discuss the fundamentals behind reinforcement learning which includes: Markov decision process, policies, value function, Bellman equation, and of course, dynamic programming.

在本文中，我想討論強化學習背后的基礎知識，其中包括：馬爾可夫決策過程，策略，值函數，貝爾曼方程式，當然還有動態規劃。

馬爾可夫決策過程 (Markov Decision Process)

Markov decision process is the fundamental problem which we try to solve in reinforcement learning. But, what is the definition of Markov Decision Process?

馬爾可夫決策過程是我們在強化學習中要解決的基本問題。但是，馬爾可夫決策過程的定義是什么？

Markov Decision Process or MDP is a formulation of sequential interaction between agent and environment.

馬爾可夫決策過程(MDP)是主體與環境之間順序相互作用的表述。

Here, the learner and decision maker is called agent, where the thing it interacts with is called environment. In MDP, the agent makes certain decisions or actions, then the environment responds by giving the agent a new situation or state and immediate reward.

在這里，學習者和決策者稱為代理，與之交互的事物稱為環境。在MDP中，代理做出某些決定或行動，然后環境通過為代理提供新的情況或狀態和即時獎勵來做出響應。

Agent-environment interaction in Markov Decision Process馬爾可夫決策過程中的主體與環境互動

In reinforcement learning, the main goal of an agent is to make a decision or action that maximizes the total amount of rewards it received from the environment in the long run.

在強化學習中，代理商的主要目標是做出決策或采取行動，從長遠來看，它會最大化其從環境中獲得的報酬總額。

Let’s say we want to train a robot to play the game of chess. Each time the robot win the game, the reward would be +1 and if it loses the game, the reward would be -1. In another example, if we want to train the robot to escape from a maze, the reward would decrease by -1 as the more time has passed prior to the escape.

假設我們要訓練一個機器人玩象棋游戲。每次機器人贏得比賽，獎勵將為+1；如果輸掉比賽，獎勵將為-1。在另一個示例中，如果我們要訓練機器人從迷宮中逃生，隨著逃生之前經過的時間越長，獎勵將減少-1。

The reward in reinforcement learning is the way how you communicate to the agent what you want it to achieve, not how you want it achieved.

在強化學習獎勵是如何傳達給你希望它實現的，你不希望如何實現代理的方式。

Now the question is, how do we compute the cumulative amount of rewards that have been gathered by the agent after sequences of actions? The mathematical formulation of cumulative reward is defined as follows.

現在的問題是，我們如何計算代理人在一系列行動之后已經收集到的累積獎勵金額？累積獎勵的數學公式定義如下。

Above, R is the reward in each sequence of action made by the agent and G is the cumulative reward or expected return. The goal of the agent in reinforcement learning is to maximize this expected return G.

上面的R是代理人在每個動作序列中的獎勵，而G是累積獎勵或預期回報。強化學習中智能體的目標是使此預期收益G最大化。

預期折現率 (Discounted Expected Return)

However, the equation above only applies when we have an episodic MDP problem, meaning that the sequence of agent-environment interaction is episodic or finite. What if we have a situation where the interaction between agent-environment is continuous and infinite?

但是，上面的等式僅在我們遇到了突發MDP問題時適用，這意味著主體與環境相互作用的序列是突發的或有限的。如果我們遇到一種情況，即代理人與環境之間的相互作用是連續且無限的？

Suppose we have a problem where the agent works like an air conditioner, its task is to adjust the temperature given certain situations or states. In this problem:

假設我們有一個問題，即該代理像空調一樣工作，它的任務是在特定情況或狀態下調節溫度。在這個問題上：

States: the current temperature, number of people in a room, time of the day.
狀態：當前溫度，房間人數，一天中的時間。
Action: Increase or decrease the room temperature.
行動：提高或降低室溫。
Reward: -1 if a person in the room needs to manually adjust the temperature and 0 otherwise.
獎勵：如果房間中的人需要手動調節溫度，則為-1，否則為0。

In order to avoid negative rewards, the agent needs to learn and interact with the environment continuously, meaning that there is no end of MDP sequence. To solve this continuous task from the agent, we can use the discounted expected returns.

為了避免負面獎勵，代理需要不斷學習并與環境互動，這意味著MDP序列沒有盡頭。為了解決代理商的這一連續任務，我們可以使用折現的預期收益。

In the equation above, γ is the discount rate, where its value should be within the range 0 ≤ γ ≤1.

在上式中，γ是折現率，其值應在0≤γ≤1的范圍內。

The intuition behind this discount rate is that the reward the agent received in earlier sequence will be worth more than the reward it received several sequences later. This assumption also makes sense in real life. 1 Euro in today’s life is worth more than 1 Euro several years in the future because of the inflation.

折現率背后的直覺是，代理商在較早序列中獲得的報酬將比其在隨后多個序列中獲得的報酬更有價值。這個假設在現實生活中也是有意義的。由于通貨膨脹，當今的1歐元在未來幾年的價值將超過1歐元。

If γ= 0, this means that the agent is short-sighted, meaning that it takes more weight for immediate rewards of an action in the next sequence.

如果γ= 0，則表示該代理是近視的，這意味著在下一個序列中立即獲得動作的獎勵需要更多的權重。

If γ is closer to 1, this means that the agent is far-sighted, meaning that it puts more and more weight for future rewards.

如果γ接近于1，則表示該代理是有遠見的，這意味著它將越來越重地分配給將來的獎勵。

With the discounted return, as long as the reward is non-zero and γ < 1, the output of expected return would no longer be infinite.

在折現收益率下，只要報酬不為零且γ<1，預期收益率的輸出將不再是無限的。

策略，價值函數和Bellman方程 (Policy, Value Function, and Bellman Equation)

Now we know that the goal of an agent in reinforcement learning is to maximize the cumulative reward. In order to maximize the reward, the agent needs to choose which action it needs to take in a given state such that it gets a high cumulative reward. The probability of an agent choosing a certain action in a given state is called policy.

現在我們知道，強化學習中的主體的目標是最大化累積獎勵。為了最大化獎勵，代理需要選擇在給定狀態下需要執行的操作，以使其獲得較高的累積獎勵。代理在給定狀態下選擇特定動作的概率稱為策略。

Policy is a probability of an agent selecting an action A in a given state S.

策略是代理在給定狀態S下選擇動作A的概率。

In reinforcement learning, a policy is normally represented by π. This means that π(A|S) is the probability of the agent choosing action A given that it is in state S.

在強化學習中，策略通常由π表示。這意味著π(A | S)是代理在狀態S下選擇動作A的概率。

Now if an agent is in state S and it follows policy π, then its expected return will be called state-value function for policy π. Thus, the state-value function is normally denoted as Vπ.

現在，如果代理處于狀態S并遵循策略π ，則其預期收益將稱為策略 π的狀態值函數。因此，狀態值函數通常表示為Vπ 。

Similar to state-value function, if an agent is in state S and it determines its next action based on policy π, then its expected return will be called action-value function for policy π. Thus, the action-value function is normally denoted as qπ.

與狀態值函數類似，如果代理處于狀態S并根據策略π確定其下一個動作，則其預期收益將稱為策略π的動作值函數。因此，作用值函數通常表示為qπ 。

In some sense, the value function and the reward have some similarities. However, a reward refers to what is good in an immediate sense, while a value function refers to what is good in the long run. Thus, a state might have a low immediate reward but has a high value function because it is regularly followed by other states that have high rewards.

從某種意義上說，價值函數和報酬有一些相似之處。但是，獎勵是指從即時意義上講是好的，而價值函數是指從長遠來看是好的。因此，一個州可能具有較低的立即回報，但具有較高的價值功能，因為它會定期跟隨其他具有較高獎勵的州。

To compute the value function, the Bellman equation is commonly applied.

為了計算值函數，通常使用Bellman方程。

In Reinforcement Learning, the Bellman equation works by relating the value function in the current state with the value in the future states.

在強化學習中，Bellman方程通過將當前狀態下的值函數與將來狀態下的值相關聯來工作。

Mathematically, the Bellman equation can be written as the following.

在數學上，貝爾曼方程式可以寫成如下。

As you can see from the mathematical equation above, what Bellman equation expressed is that it averages over all of the possible states and future rewards in any given states, depending on the dynamics environment p.

從上面的數學方程式可以看出，Bellman方程式表示的是，根據動力學環境p ，它在所有給定狀態的所有可能狀態和將來的回報中平均。

To make it easier for us to understand how the Bellman equation actually works intuitively, let’s relate it to our everyday moment.

為了使我們更容易理解Bellman方程實際上是如何直觀地工作的，我們將其與我們的日常生活聯系起來。

Let’s say that two months ago, you learned how to ride a bike for the first time. One day when you rode your bike, the bike lost its balance when you pull the brake on a surface full of sand, making you slipped and injured. This means that you got a negative reward from this experience.

假設兩個月前，您第一次學習了如何騎自行車。有一天，當您騎自行車時，在充滿沙子的表面上拉剎車時，自行車失去了平衡，導致您滑倒并受傷。這意味著您將從這種經歷中獲得負面獎勵。

One week later, you rode the bike again. When you rode it on a surface full of sand, you slowed down the speed. This is because you know that when the bike loses its balance, bad things will happen even though this time you didn’t actually experience it.

一個星期后，您又騎了自行車。在充滿沙子的表面上騎車時，速度會降低。這是因為您知道當自行車失去平衡時，即使您實際上沒有體驗過，也會發生壞事。

最優政策與最優價值函數 (Optimal Policy and Optimal Value Function)

Whenever we try to solve reinforcement learning task, we want the agent to choose an action which maximizes the cumulative reward. To achieve this, it means that the agent should follow a policy that maximizes the value function we have just discussed above. The policy that maximizes the value function in all of the states is called optimal policy and normally it is defined as π*.

每當我們嘗試解決強化學習任務時，我們都希望代理選擇一個使累積獎勵最大化的動作。為了實現這一點，這意味著代理應遵循我們最大化上述價值功能的策略。將所有狀態下的值函數最大化的策略稱為最優策略，通常將其定義為π*。

To understand the optimal policy better, let’s take a look at the following illustration.

為了更好地理解最佳策略，讓我們看一下下圖。

Optimal policy definition最佳政策定義

As shown in the illustration above, we can say that π` is an optimal policy compared to π because the value function at any given state following policy π` is as good as or better than policy π.

如上圖所示，與π相比，我們可以說π`是最優策略，因為在遵循策略π`的任何給定狀態下，值函數都與策略π相同或更好。

If we have an optimal policy, then we can actually rewrite the Bellman equation into the following:

如果我們有最佳策略，那么我們實際上可以將Bellman方程重寫為以下形式：

The final equation above is called the Bellman optimality equation. Note that in the final form of above equation, there is no specific reference regarding certain policy π. The Bellman optimality equation basically tells us that the value function of a state under an optimal policy should be equal to the expected return of the best action from that state.

上面的最終方程稱為Bellman最優方程。注意，在以上等式的最終形式中，沒有關于某些策略π的具體參考。貝爾曼最優性方程式基本上告訴我們，最優策略下的狀態的價值函數應等于該狀態下最佳行動的預期收益。

From the equation above, it is very straightforward to find the optimum state-value function once we know the optimal policy. However, in real life, we often don’t know what the optimal policy is.

從上面的方程式中，一旦我們知道最優策略，找到最優狀態值函數就非常簡單。但是，在現實生活中，我們通常不知道什么是最佳策略。

To find the optimal policy, the dynamic programming algorithm is normally applied. With dynamic programming, the state-value function of each state will be evaluated iteratively until we find the optimal policy.

為了找到最佳策略，通常采用動態規劃算法。通過動態編程，將迭代評估每個狀態的狀態值函數，直到找到最佳策略為止。

動態編程以找到最佳策略 (Dynamic Programming to Find Optimal Policy)

Now let’s dive into the theory behind dynamic programming to find the optimal policy. At its core, dynamic programming algorithm uses Bellman equation iteratively to do two things:

現在，讓我們深入探討動態編程背后的理論，以找到最佳策略。動態編程算法的核心是迭代地使用Bellman方程來做兩件事：

Policy evaluation
政策評估
Policy improvement
政策改善

Policy evaluation is a step to evaluate how good a given policy is. In this step, the state-value function Vπ for an arbitrary policy π is computed. We have seen that the Bellman equation actually helps us to compute the state-value function with a system of linear equation as follows:

策略評估是評估給定策略的良好程度的一步。在該步驟中，計算任意策略π的狀態值函數Vπ 。我們已經看到，Bellman方程實際上通過以下線性方程組幫助我們計算狀態值函數：

With dynamic programming, the state-value function will be approximated iteratively based on Bellman equation until the value function converged in each of the state. The converged approximation of a value function can be called as the value function of a given policy Vπ.

通過動態編程，將基于Bellman方程迭代地近似狀態值函數，直到值函數收斂于每個狀態。價值函數的收斂近似可以稱為給定策略Vπ的價值函數。

Iterative policy evaluation迭代政策評估

After we found the value function of a given policy Vπ, we need to improve the policy. Recall that we can define an optimal policy if and only if the value function of one policy is equal or bigger than other policies in any given state. With policy improvement, the new, strictly better policy in any given state can be generated.

找到給定策略Vπ的值函數后，我們需要改進策略。回想一下，當且僅當一個策略的值函數等于或大于任何給定狀態下的其他策略時，我們才可以定義最優策略。隨著政策的改進，可以在任何給定狀態下生成嚴格嚴格的新政策。

Notice that in the above equation, we use the Vπ that we have computed in policy evaluation step to improve the policy. If the policy doesn’t improve after we apply the above equation, it means that we have found the optimal policy.

注意，在上式中，我們使用在策略評估步驟中計算出的Vπ來改進策略。如果應用上述公式后策略仍未改善，則表明我們已找到最佳策略。

Overall, these two steps, policy evaluation and policy improvement are done iteratively using dynamic programming. First, under any given policy, the corresponding value function is computed. Then, the policy is improved. With the improved policy, the next value function is computed and so on. If a policy does not improve anymore compared to the previous iteration, it means that we have found the optimal policy for our problem.

總體而言，這兩個步驟(策略評估和策略改進)是使用動態編程迭代完成的。首先，在任何給定的策略下，都會計算相應的值函數。然后，對政策進行改進。使用改進的策略，可以計算下一個值函數，依此類推。如果某個策略與以前的迭代相比沒有任何改善，則意味著我們已經找到解決問題的最佳策略。

Iterative policy evaluation and policy improvement in dynamic programming動態規劃中的迭代策略評估和策略改進

動態規劃以找到最佳策略的實現 (Implementation of Dynamic Programming to Find Optimal Policy)

Now that we know all of the theory regarding dynamic programming and optimal policy, let’s implement it in a code with a simple use case.

既然我們了解了有關動態編程和最佳策略的所有理論，那么我們就可以在具有簡單用例的代碼中實現它。

Suppose that we want to control the increased demand of the use of parking space in a city. To do so, what we need to do is to control the price of parking system depending on the city’s preference. In general, the city council has a perspective that the more parking space is being used, the higher the social welfare is. However, the city council also prefers that at least one spot is left unoccupied for emergency use.

假設我們要控制城市中停車位使用需求的增長。為此，我們需要做的就是根據城市的偏好來控制停車系統的價格。通常，市議會的觀點是使用的停車空間越多，社會福利就越高。但是，市議會也希望至少留出一塊空地以備不時之需。

We can define the use case above as Markov Decision Process (MDP) with:

我們可以將上述用例定義為馬爾可夫決策過程(MDP)，其具有：

State: the number of occupied parking spaces.
州：占用的停車位數量。
Action: the parking price.
行動：停車價格。
Reward: city’s preference for the situation.
獎勵：城市對情況的偏愛。

For this example, let’s assume that there are ten parking spots and four different price range. This means that we have eleven states (10 plus 1 because there can be a situation where no parking space is occupied) and four actions.

對于此示例，假設有十個停車位和四個不同的價格范圍。這意味著我們有11個狀態(10加1，因為可能會出現沒有泊車位的情況)和4個動作。

To find the optimal policy with the given use case, we can use the dynamic programming with Bellman optimality equation. First, we evaluate the policy and then we improve the policy. We do these two steps iteratively until the result converges.

為了找到給定用例的最優策略，我們可以使用帶有Bellman最優性方程的動態規劃。首先，我們評估政策，然后改進政策。我們迭代執行這兩個步驟，直到結果收斂。

As a first step, let’s define a function to compute the Bellman optimality equation as shown below.

第一步，讓我們定義一個函數來計算Bellman最優性方程，如下所示。

The Bellman optimality equation above evaluates the value function at any given state.

上面的Bellman最優性方程式評估任何給定狀態下的值函數。

Next, let’s define a function to improve the policy. We can improve the policy by greedifying the policy. This means that we transform the policy such that the policy will have the probability of 1 of choosing action which maximize the value function at a given state.

接下來，讓我們定義一個函數來改進策略。我們可以通過充實政策來改善政策。這意味著我們對策略進行了轉換，以使該策略在給定狀態下具有選擇價值最大化功能的動作的可能性為1。

Finally, we can wrap the policy evaluation and policy improvement into one function.

最后，我們可以將政策評估和政策改進整合為一個功能。

Now if we run the function above, we will get the following result:

現在，如果我們運行上面的函數，我們將得到以下結果：

From the result above, we can see that the value function increases as the number of parking space that is occupied increased, except when all of the parking space are occupied. This is fully expected as we can see from the preference of the city council in the use case example.

從上面的結果可以看出，值函數隨著所占用的停車位數量的增加而增加，除非所有的停車位都被占用。從用例示例中市議會的偏愛中可以看出，這是完全可以預期的。

The city council has a perspective that the more the parking space is used, the higher the social welfare is and they prefer to have at least one parking space left unoccupied. Thus, the more the parking spot is occupied, the higher the value function will be apart from the last state.

市議會的觀點是，使用的停車位越多，社會福利就越高，他們更愿意至少保留一個空閑的停車位。因此，停車位占用越多，價值函數與最后一個狀態的距離就越高。

Also note that when the parking occupancy is high (state nine and ten), the action change from 0 (the lowest price value) to 4 (the highest price value) to avoid the full occupancy.

另請注意，當停車占用率很高(狀態為9和10)時，操作將從0(最低價格值)更改為4(最高價格值)，以避免完全占用。

翻譯自: https://towardsdatascience.com/the-fundamentals-of-reinforcement-learning-177dd8626042

強化學習之基礎入門

總結

以上是生活随笔為你收集整理的强化学习之基础入门_强化学习基础的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：模拟人生绅士包怎么用(求助模拟人生4绅士
下一篇：在置信区间下置信值的计算_使用自举计算置