當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Silver-Slides Chapter 1 - 强化学习入门：基本概念介绍

發布時間：2024/3/13 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Silver-Slides Chapter 1 - 强化学习入门：基本概念介绍小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一些知識點

機器學習 = 監督學習 + 無監督學習 + 強化學習

RL的不同之處：

There is no supervisor, only a reward signal

Feedback is delayed, not instantaneous

Time really matters (sequential, non i.i.d data)

Agent’s actions aect the subsequent data it receives
RL的reward

A reward Rt is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximise cumulative reward
All goals can be described by the maximisation of expected
cumulative reward
Sequential decision making

Goal: select actions to maximise total future reward

Actions may have long term consequences

Reward may be delayed

It may be better to sacri ce immediate reward to gain more long-term reward
Exploration and Exploitation

Reinforcement learning is like trial-and-error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way.
Exploration nds more information about the environment
Exploitation exploits known information to maximise reward

It is usually important to explore as well as exploit

RL的元素

參考 silver-slides與https://zhuanlan.zhihu.com/p/26608059

Agent

相當于主角，包括三個要素：

Policy, Value Function, Model.

Policy:

是Agent的行為指南，是一個從狀態(s)到行動(a)的映射，可以分為確定性策略(Deterministic policy)和隨機性策略(Stochastic policy)，前者是指在某一特定狀態確定對應著某一個行為 a=π(s) ，后者是指在某一狀態下，對應不同行動有不同的概率，即 π(a|s)=p[At=a|St=s] ，可以根據實際情況來決定具體采用哪種策略。
Value Function:

價值函數是對未來總Reward的一個預測，Used to evaluate the goodness/badness of states, and therefore to select between actions.
Model:

模型是指Agent通過對環境狀態的個人解讀所構建出來的一個認知框架，它可以用來預測環境接下來會有什么表現，比如，如果我采取某個特定行動那么下一個狀態是什么，亦或是如果這樣做所獲得的獎勵是多少。不過模型這個東西有些情況下是沒有的。

所以這就可以將Agent在連續決策(sequential decision making )行動中所遇到的問題劃分為兩種，即Reinforcement Learning problem和Planning problem。

對于前者，沒有環境的模型，Agent只能通過和環境來互動來逐步提升它的策略。The environment is initially unknown. The agent interacts with the environment. The agent improves its policy

對于后者，環境模型已經有了，所以你怎么走會產生什么樣的結果都是確定的了，這時候只要通過模型來計算那種行動最好從而提升自己策略就好。A model of the environment is known. The agent performs computations with its model (without any external interaction). The agent improves its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search。

舉個例子就是，Reinforcement Learning problem是不知道游戲規則，通過游戲操縱桿采取行動，看分數來獲得reward；Planning problem是知道游戲規則，可以查詢模擬器，通過提前計劃來找到最優策略，比如tree search

有關Agent的分類：

從采取的方法上可以分為Value Based，Policy Based 和Actor Critic。第一種是基于價值函數的探索方式，第二種就是基于策略的探索方式，第三種就是前兩者結合。

從是否含有模型上Agent又可分為Model Free 和Model Based。

Environment

故事發生的場景，可分為兩種：

Fully Observable Environment：environment的所有信息agent都能觀測到，Agent state = environment state = information state。
Formally, this is a Markov decision process (MDP) 。
Partially Observable Environment：environment的部分信息agent能觀測到，此時的環境狀態稱為部分可觀測MDP 。agent state ≠ environment state。

Formally this is a partially observable Markov decision process(POMDP) 。

Agent must construct its own state representation：

Complete history: Sat=Ht
Beliefs of environment state: Sat=(P[Set=s1],...,P[Set=sn])
Recurrent neural network: Sat=σ(Sat?1Ws+OtWo)

State

State可分為三種，Environment State、Agent State、Information State，又稱為Markov state

Environment State：

指環境用來選擇下一步observation/reward的所有信息，是真正的環境所包含的信息，Agent一般情況下是看不到或憑agent自身能力不能完全地獲取其信息的。即便環境信息整個是可見的，也許還會包含很多無關信息。
Agent State：

指Agent用來選擇下一個行動的所有信息，也是我們算法進行所需要的那些信息，我個人理解是Agent自己對Environment State的解讀與翻譯，它可能不完整，但我們的確是指望著這些信息來做決定的。
Information State/Markov state：

包含了History中所有的有用信息。感覺這只是個客觀的概念，并沒有和前兩種State形成并列關系，只是一個性質。

它的核心思想是“在現在情況已知的情況下，過去的事件對于預測未來沒有用”，也就相當于是現在的這個狀態已經包含了預測未來所有的有用的信息，一旦你獲取了現在的有用信息，那么之前的那些信息都可以扔掉了！

The environment state Set is Markov，The history Ht is Markov.

與State相關的有一個History：

The history is the sequence of observations, actions, rewards: Ht=O1,R1,A1,...,At?1,Ot,Rt

它包含了到時間t為止所能觀察到的變量信息，如observation,action和reward。所以可以說接下來所發生的事情是基于歷史的，如agent的action或environment的observation/reward。

State就被定義為一個關于History的函數： St=f(Ht) ，他們中間有某種對應關系，因為State也是對環境里邊相關信息的一個觀察和集成，也正是這些信息決定了接下來所發生的一切。

What happens next depends on the history:

? The agent selects actions

? The environment selects observations/rewards

Observation

Action

Reward

它是一個標量，是一個好壞的度量指標，然后Agent的終極目標就是盡可能的最大化整個過程的累計獎勵(cumulative reward)，所以很多時候要把目光放長遠一點，不要撿個芝麻丟個西瓜。

A reward Rt is a scalar feedback signal. Indicates how well agent is doing at step t. The agent’s job is to maximise cumulative reward

總結

以上是生活随笔為你收集整理的Silver-Slides Chapter 1 - 强化学习入门：基本概念介绍的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： html5输入框表情,H5页面input
下一篇：用了这么多年PPT才知道，按下这个键，2