增強式學習

迴力球遊戲－ATARI

賽車遊戲DQN－ATARI

賽車遊戲－TORCS

Ref:
李宏毅老師 YOUTUBE DRL 1-3

On-policy VS Off-policy

On-policy
The agent learned and the agent interacting with the environment is the same
阿光自已下棋學習
Off-policy
The agent learned and the agent interacting with the environment is different

佐助下棋，阿光在旁邊看

Add a baseline:

It is possible that R is always positive

So R subtract a expectation value

Policy in "Policy Gradient" means output action, like left/right/fire

gamma-discounted rewards:

時間愈遠的貢獻，降低其權重

Reward Function & Action is defined in prior to training

MC v.s. TD

MC 蒙弟卡羅: critic after episode end : larger variance(cuz conditions differ a lot in every episode), unbiased (judge until episode end, more fair)

TD: Temporal-difference approach: critic during episode :smaller variance, biased maybe

atari : a3c => gym

torcs : ddpg => gym-torcs

PPO

easy code

easy tune

sample efficient

Replay Buffer:
Put the experience into buffer : St, at, rt, St+1
cuz interaction with environment cost more time than training
玩遊戲的時間通常比TRAIN NETWORK花時間

FrankCheng's Blog

搜尋此網誌

增強式學習

留言

張貼留言

這個網誌中的熱門文章

DeepRacer

Frameworks overview