迴力球遊戲-ATARI
賽車遊戲DQN-ATARI
賽車遊戲-TORCS
Ref:
李宏毅老師 YOUTUBE DRL 1-3
On-policy VS Off-policy
On-policy
The agent learned and the agent interacting with the environment is the same
阿光自已下棋學習
Off-policy
The agent learned and the agent interacting with the environment is different
佐助下棋,阿光在旁邊看
Add a baseline:
It is possible that R is always positive
So R subtract a expectation value
Policy in "Policy Gradient" means output action, like left/right/fire
gamma-discounted rewards:
時間愈遠的貢獻,降低其權重
Reward Function & Action is defined in prior to training
MC v.s. TD
MC 蒙弟卡羅: critic after episode end : larger variance(cuz conditions differ a lot in every episode), unbiased (judge until episode end, more fair)
TD: Temporal-difference approach: critic during episode :smaller variance, biased maybe
atari : a3c => gym
torcs : ddpg => gym-torcs
PPO
easy code
easy tune
sample efficient
Replay Buffer:
Put the experience into buffer : St, at, rt, St+1
cuz interaction with environment cost more time than training
玩遊戲的時間通常比TRAIN NETWORK花時間
Replay Buffer:
Put the experience into buffer : St, at, rt, St+1
cuz interaction with environment cost more time than training
玩遊戲的時間通常比TRAIN NETWORK花時間
留言
張貼留言