跳到主要內容

增強式學習

 

 迴力球遊戲-ATARI

 



 

賽車遊戲DQN-ATARI


賽車遊戲-TORCS








Ref:
    李宏毅老師 YOUTUBE DRL 1-3

On-policy VS Off-policy

On-policy
    The agent learned and the agent interacting with the environment is the same
    阿光自已下棋學習
Off-policy
    The agent learned and the agent interacting with the environment is different

    佐助下棋,阿光在旁邊看



Add a baseline:
    It is possible that R is always positive
    So R subtract a expectation value


Policy in "Policy Gradient" means output action, like left/right/fire


gamma-discounted rewards:
時間愈遠的貢獻,降低其權重

Reward Function & Action is defined in prior to training

MC v.s. TD
MC 蒙弟卡羅: critic after episode end : larger variance(cuz conditions differ a lot in every episode), unbiased (judge until episode end, more fair)
TD: Temporal-difference approach: critic during episode :smaller variance, biased maybe



atari : a3c  => gym
torcs : ddpg => gym-torcs


PPO
   easy code 
   easy tune
   sample efficient


Replay Buffer:
Put the experience into buffer : St, at, rt, St+1
cuz interaction with environment cost more time than training
玩遊戲的時間通常比TRAIN NETWORK花時間

留言

這個網誌中的熱門文章

DeepRacer

Preliminary training: deepracer-github-simapp.tar.gz Reward function: ./opt/install/sagemaker_rl_agent/lib/python3.5/site-packages/markov/environments/deepracer_env.py action = [steering_angle, throttle] TRAINING_IMAGE_SIZE = (160, 120) Plotted waypoints in vertices array of hard track Parameters: on_track, x, y, distance_from_center, car_orientation, progress, steps,                                                                          throttle, steering, track_width, waypoints, closest_waypoints Note: Above picture is from https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html