迴力球遊戲-ATARI 賽車遊戲DQN-ATARI 賽車遊戲-TORCS Ref: 李宏毅老師 YOUTUBE DRL 1-3 On-policy VS Off-policy On-policy The agent learned and the agent interacting with the environment is the same 阿光自已下棋學習 Off-policy The agent learned and the agent interacting with the environment is different 佐助下棋,阿光在旁邊看 Add a baseline: It is possible that R is always positive So R subtract a expectation value Policy in " Policy Gradient" means output action, like left/right/fire gamma-discounted rewards: 時間愈遠的貢獻,降低其權重 Reward Function & Action is defined in prior to training MC v.s. TD MC 蒙弟卡羅: critic after episode end : larger variance(cuz conditions differ a lot in every episode), unbiased (judge until episode end, more fair) TD: Temporal-difference approach: critic during episode :smaller variance, biased maybe atari : a3c ...