蒙特卡洛树搜索算法实现_蒙特卡洛树搜索实现实时学习中的强化学习
蒙特卡洛樹搜索算法實現
In the previous article, we covered the fundamental concepts of reinforcement learning and closed the article with these two key questions:
在上一篇文章中 ,我們介紹了強化學習的基本概念,并用以下兩個關鍵問題結束了本文:
1 — how can we find the best move among others if we cannot process all the successive states one by one due to limited amount of time?
1-如果由于時間有限而無法一一處理所有連續狀態,我們如何找到其他最佳狀態?
2 — how do we map the task of finding best move to long-term rewards if we are limited in terms of computational resources and time?
2-如果我們在計算資源和時間方面受到限制,我們如何映射找到最佳方法以獲得長期回報的任務?
In this article, to answer these questions, we go through the Monte Carlo Tree Search fundamentals. Since in the next articles, we will implement this algorithm on “HEX” board game, I try to explain the concepts through examples in this board game environment.If you’re more interested in the code, find it in this link. There is also a more optimized version which is applicable on linux due to utilizing cython and you can find it in here.
在本文中,為了回答這些問題,我們將介紹“蒙特卡洛樹搜索”的基本原理。 由于在下一篇文章中,我們將在“ HEX”棋盤游戲上實現此算法,因此我嘗試通過此棋盤游戲環境中的示例來解釋這些概念。如果您對代碼更感興趣,請在此鏈接中找到它。 由于利用了cython ,還有一個更優化的版本可用于linux,您可以在這里找到它。
Here are the outlines:
概述如下:
1 — Overview
1 —概述
2 — Exploration and Exploitation Trade-off
2 —勘探與開發的權衡
3 — HEX: A Classic Board Game
3 —十六進制:經典棋盤游戲
4 — Algorithm structure: Selection and Expansion
4 —算法結構:選擇和擴展
5 — Algorithm structure: Rollout
5-算法結構:推出
6 — Algorithm structure: Backpropagation
6-算法結構:反向傳播
7 — Advantages and Disadvantages
7 —優缺點
Conclusion
結論
總覽 (Overview)
Monte Carlo method was coined by Stanislaw Ulam for the first time after applying statistical approach “The Monte Carlo method”. The concept is simple. Using randomness to solve problems that might be deterministic in principle. For example, in mathematics, it is used for estimating the integral when we cannot directly calculate it. Also in this image, you can see how we can calculate pi based on Monte-Carlo simulations.
斯坦尼斯拉夫·烏蘭(Stanislaw Ulam)在采用統計方法“蒙特卡洛方法”之后首次提出了蒙特卡洛方法。 這個概念很簡單。 使用隨機性來解決原則上可以確定的問題。 例如,在數學中,當我們無法直接計算積分時,它用于估計積分。 同樣在此圖像中,您可以看到我們如何基于蒙特卡洛模擬來計算pi。
source).來源 )。The image above indicates the fact that in monte carlo method the more samples we gather, more accurate estimation of target value we will attain.
上面的圖像表明,在蒙特卡洛方法中,我們收集的樣本越多,我們將獲得的目標值的估計就越準確。
- But how does Monte Carlo Methods come in handy for general game playing? 但是,蒙特卡洛方法如何在一般游戲中派上用場呢?
We use Monte Carlo method to estimate the quality of states stochastically based on simulations when we cannot process through all the states. Each simulation is a self-play that traverses the game tree from current state until a leaf state (end of game) is reached.
當我們無法處理所有狀態時,我們會使用Monte Carlo方法根據模擬隨機 估計狀態的質量 。 每個模擬都是一種自玩游戲,可從當前狀態遍歷游戲樹直到達到葉子狀態(游戲結束)。
So this algorithm is just perfect to our problem.
因此,該算法非常適合我們的問題。
- Since it samples the future state-action space, it can estimate near optimal action in current state by keeping computation effort low (which addresses the first question).
-由于它對未來的狀態動作空間進行采樣,因此可以通過保持較低的計算工作量來解決當前狀態下的最佳動作(這解決了第一個問題)。
- Also the fact that it chooses the best action based on long-term rewards (rewarding based on the result in tree leaves) answers the second question.
-同樣,它基于長期獎勵選擇最佳行動(根據樹葉的結果進行獎勵)這一事實也回答了第二個問題。
This process is exactly like when a human wants to estimate the future action to come up with the best possible action in the game of chess. He thinks simulates various games (from current state to the last possible state of future) based on self-play in his/her mind and chooses the one that has the best overall results.
這個過程就像一個人想要估計將來的動作以在國際象棋比賽中做出最好的動作一樣。 他認為可以根據自己的想法模擬各種游戲(從當前狀態到將來的最后可能狀態),并選擇總體效果最好的游戲。
Monte Carlo Tree Search (MCTS), which combines monte carlo methods with tree search, is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results.
蒙特卡洛樹搜索(MCTS)將蒙特卡洛方法與樹搜索相結合,是一種通過在決策空間中抽取隨機樣本并根據結果構建搜索樹來在給定域中找到最佳決策的方法。
Before we explain the algorithm structure, we should first discuss the exploration and exploitation trade-off.
在解釋算法結構之前,我們應該首先討論勘探與開發之間的權衡。
勘探與開發的權衡 (Exploration and Exploitation Trade-off)
As explained, In reinforcement learning, an agent always aims to achieve an optimal strategy by repeatedly using the best actions that it has found in that problem (remember the chess example in the previous article). However, there is a probability that the current best action is not actually optimal. As such it will continue to evaluate alternatives periodically during the learning phase by executing them instead of the perceived optimal. In RL terms, this is known as exploration exploitation trade-off. All of the algorithms in RL (MCTS as well) are trying to balance the exploration-exploitation trade-off.
如前所述,在強化學習中,代理始終旨在通過重復使用在該問題中發現的最佳行動來實現最佳策略 (請記住上一篇國際象棋示例)。 但是,當前的最佳動作實際上可能不是最佳的。 因此,它將繼續通過執行替代方案而不是感知到的最佳方案,在學習階段定期評估替代方案。 用RL術語來說,這稱為勘探開發權衡 。 RL中的所有算法(以及MCTS)都在嘗試平衡勘探與開發之間的權衡。
I think this video best explains the concept of exploration-exploitation:
我認為這段視頻最能說明勘探開發的概念:
十六進制:經典棋盤游戲 (HEX: A Classic Board Game)
Now it’s time to get to know the Hex game. It has simple rules:
現在是時候了解十六進制游戲了。 它有簡單的規則 :
Fig 2: HEX board. The winner is white player because it connected both white sides with chaining stones.圖2:HEX板。 獲勝者是白人玩家,因為它用鏈結石將白色兩面連接起來。- Black and white alternate turns. 黑白交替輪流。
- On each turn a player places a single stone of its color on any unoccupied cell. 玩家在每個回合上將一塊彩色的石頭放在任何未占用的單元上。
- The winner is the player who forms a chain of their stones connecting their two opposing board sides. 獲勝者是將兩塊相對的棋盤面連接起來的石頭鏈。
Hex can never end in a draw and be played on any n × n board [1].
十六進制永遠不能以平局結束,不能在任何n×n棋盤上玩[1] 。
Now let`s go through the algorithm structure.
現在讓我們看一下算法的結構。
算法結構 (Algorithm structure)
1 —選擇和擴展 (1 — Selection and Expansion)
In this step, agent takes the current state of the game and selects a node (Each node represents the state resulted by choosing an action) in tree and traverses the tree. Each move in each state is assigned with two parameters namely as total rollouts and wins per rollouts (they will be covered in rollout section).
在此步驟中,代理獲取游戲的當前狀態 ,并在樹中選擇一個節點(每個節點代表通過選擇一個動作所導致的狀態)并遍歷該樹。 每種狀態下的每次移動都分配有兩個參數,即總卷數和每卷的勝利數(它們將在卷數部分中介紹)。
The strategy to select optimal node among other nodes really matters. Upper Confidence Bound applied to Trees (UCT) is the simplest and yet effective strategy to select optimal node. This strategy is designed to balance the exploitation-exploration trade-off. This is UCT formula:
在其他節點之間選擇最佳節點的策略確實很重要。 應用于樹(UCT)的上限置信度是選擇最佳節點的最簡單但有效的策略。 該策略旨在平衡開發與開發之間的權衡。 這是UCT公式:
Fig 3: UCT formula. first term (w/n) indicates the exploitation and the second term computes the exploration term ( c * sqrt(log t / n))圖3:UCT公式。 第一項(w / n)表示開發,第二項計算勘探項(c * sqrt(log t / n))In this formula i indicates i-th node in children nodes. W is the number of wins per rollouts and n is the number of all rollouts. This part of formula represents the exploitation.Cis the exploration coefficient and it’s a constant in range of [0,1]. This parameter indicates how much agent have to favor unexplored nodes. t is the number of rollouts in parent node. the second term represents the exploration term.
在該公式中, i表示子節點中的第i個節點。 W是每次部署的獲勝數, n是所有部署的數。 公式的這一部分代表開發。 C是勘探系數,在[0,1]范圍內是一個常數。 此參數指示有多少代理必須支持未探索的節點。 t是父節點中的部署數量。 第二項代表探索項。
Let’s go through an example in order to have all information provided to sink in. Look at the image below:
讓我們來看一個示例,以便提供所有信息以供參考。看下面的圖像:
Fig 4: Selection phase圖4:選擇階段Consider the action C3 in depth 2. is 2 and is 1. t is the parent node number of rollouts which is 4. As you see selection phase stops in the depth where we have an unvisited node. Then in the expansion phase when we visit B1 in depth 4, we add it to tree.
考慮深度2中的動作C3為2且為1。t是展開的父節點數,即4。如您所見,選擇階段停止在深度中我們沒有訪問節點的位置。 然后在擴展階段,當我們訪問深度4的B1時,將其添加到樹中。
2-推出(也稱為模擬,播放) (2 — Rollout (also called simulation, playout))
In this step, based on predefined policy (like completely random selection) we select actions until we reach a terminal state. The result of game for current player is either 0 (if it loses the rollout) or 1 (if it wins the rollout) at the terminal state. In the game of HEX, the terminal state is always reachable and the result of game is loss or win (no draws). But in games like chess we might get in an infinite loop due to the extensibility of chess branch factor and depth of search tree.
在此步驟中,基于預定義的策略(例如完全隨機選擇),我們選擇操作,直到達到最終狀態。 在終端狀態下,當前玩家的游戲結果為0 (如果失敗,則失敗)或1 (如果勝利,則失敗)。 在十六進制的游戲中,終端狀態始終是可到達的,游戲的結果是輸贏(無平局)。 但是在象棋這樣的游戲中,由于象棋分支因子的可擴展性和搜索樹的深度,我們可能陷入無限循環。
Fig 5: Illustrating the Rollout phase following the previous steps (selection and expansion)圖5:說明了先前步驟(選擇和擴展)的“推出”階段In the image above, after that black player chose B1 in expansion step, in the simulation step a rollout is started to terminal state of game.
在上圖中,該黑人玩家在擴展步驟中選擇了B1,然后在模擬步驟中開始將游戲推廣到游戲的最終狀態。
In here, we chose random actions to reach the terminal state of the game. In terminal state as you see, white player has won the game by connecting the left to right with its stones. Now it’s time to use this information in backpropagation part.
在這里,我們選擇了隨機動作來達到游戲的最終狀態。 如您所見,在終端狀態下,白人玩家通過從左到右連接石頭贏得了比賽。 現在是時候在反向傳播部分中使用此信息了。
3-反向傳播 (3 — Backpropagation)
In this part, we update the statistics (rollout number and the number of wins per total rollouts) in the nodes which we traversed in tree for selection and expansion parts.
在本部分中,我們更新在樹中遍歷的節點中用于選擇和擴展部分的節點中的統計信息(推廣數量和每總推廣的獲勝次數)。
During backpropagation we need to update the rollout numbers and wins/losses stats of nodes. Only thing we need is to figure out the player who won the game in rollout (e.g. white player in figure 4).
在反向傳播期間,我們需要更新部署數量和節點的贏/虧狀態。 我們唯一需要做的就是找出在發布中贏得比賽的玩家(例如,圖4中的白人玩家)。
For the figure 4, since the black player is the winner (who chose the action in terminal state), all the states resulted by black player actions are rewarded by 1 and states which resulted by white player actions are given 0 reward (we can choose punishment by set it to -1).
對于圖4,由于黑人玩家是獲勝者(他們選擇了終端狀態的動作),因此黑人玩家動作導致的所有狀態都將獲得1獎勵,而白人玩家行為導致的所有狀態將獲得0獎勵(我們可以選擇將其設置為-1)。
For all states (tree nodes selected through step 1), total rollouts number increases by one as the figure 6 displays.
對于所有狀態(通過步驟1選擇的樹節點),如圖6所示,總卷展數量增加1。
Fig 6: Wins (for black player), Losses (for white player) and total number of rollouts are updated for nodes through tree search.圖6:通過樹搜索更新了節點的獲勝(對于黑人玩家),損失(對于白人玩家)和推出總數。These steps keep repeating until a predefined condition ends the loop (like time limit).
這些步驟不斷重復,直到預定義條件結束循環(如時間限制)為止。
的優點和缺點 (Advantages and Disadvantages)
Advantages:
優點:
1 — MCTS is a simple algorithm to implement.
1-MCTS是一種易于實現的算法。
2 — Monte Carlo Tree Search is a heuristic algorithm. MCTS can operate effectively without any knowledge in the particular domain, apart from the rules and end conditions, and can find its own moves and learn from them by playing random playouts.
2-蒙特卡洛樹搜索是一種啟發式算法。 MCTS可以在規則和最終條件之外的任何特定領域內有效地運作,而無需任何知識,并且可以通過播放隨機播報找到自己的動作并從中學習。
3 — The MCTS can be saved in any intermediate state and that state can be used in future use cases whenever required.
3-MCTS可以以任何中間狀態保存,并且可以在需要時在以后的用例中使用該狀態。
4 — MCTS supports asymmetric expansion of the search tree based on the circumstances in which it is operating.
4-MCTS支持基于其運行情況的搜索樹的非對稱擴展。
Disadvantages:
缺點:
1 — As the tree growth becomes rapid after a few iterations, it might require a huge amount of memory.
1-隨著幾次迭代后樹的增長變得Swift,它可能需要大量的內存。
2 — There is a bit of a reliability issue with Monte Carlo Tree Search. In certain scenarios, there might be a single branch or path, that might lead to loss against the opposition when implemented for those turn-based games. This is mainly due to the vast amount of combinations and each of the nodes might not be visited enough number of times to understand its result or outcome in the long run.
2-蒙特卡洛樹搜索存在一些可靠性問題。 在某些情況下,可能存在單個分支或路徑,當為那些基于回合的游戲實施時,可能導致輸給對手。 這主要是由于大量的組合,從長遠來看,每個節點可能沒有被足夠多次地訪問以了解其結果或結果。
3 — MCTS algorithm needs a huge number of iterations to be able to effectively decide the most efficient path. So, there is a bit of a speed issue there.
3-MCTS算法需要大量的迭代才能有效地確定最有效的路徑。 因此,那里存在速度問題。
4 — MCTS can return a recommended move at any time because the statistics about the simulated games are constantly updated. The recommended moves aren’t great when the algorithm starts, but they continually improve as the algorithm runs.
4-MCTS可以隨時返回建議的舉動,因為有關模擬游戲的統計信息會不斷更新。 當算法開始時,建議的舉動并不好,但是隨著算法的運行,它們會不斷提高。
結論 (Conclusion)
Now we figured out how the MCTS algorithm can efficiently use randomness to sample all the possible scenarios and come up with the best action over its simulations. The quality of action chosen by MCTS in each time lies in the fact that how well it can handle the exploration and exploitation in the environment.
現在,我們弄清楚了MCTS算法如何能夠有效地使用隨機性對所有可能的場景進行采樣,并在模擬過程中提出最佳措施。 MCTS每次選擇的行動質量取決于這樣一個事實,即它可以很好地處理環境中的勘探和開發。
OK, now that we covered necessary theoretical concepts so far, we’re good to go to the next level with getting our hands dirty with code. In the next article, first, we’re going to describe the whole framework and necessary modules to implement, then we will implement the basic MCTS with UCT. After that, we will improve the framework by adding more functionality to our code.
好的,到目前為止,我們已經涵蓋了必要的理論概念,我們很高興進入新的階段,著手編寫代碼。 在下一篇文章中,首先,我們將描述整個框架和要實現的必要模塊,然后,我們將使用UCT實現基本的MCTS。 之后,我們將通過向代碼中添加更多功能來改進框架。
翻譯自: https://towardsdatascience.com/monte-carlo-tree-search-implementing-reinforcement-learning-in-real-time-game-player-25b6f6ac3b43
蒙特卡洛樹搜索算法實現
總結
以上是生活随笔為你收集整理的蒙特卡洛树搜索算法实现_蒙特卡洛树搜索实现实时学习中的强化学习的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: cf841A Godsend
- 下一篇: visio 结合draw io 进行流程