跳到主要內容

增強式學習

 

 迴力球遊戲-ATARI

 



 

賽車遊戲DQN-ATARI


賽車遊戲-TORCS








Ref:
    李宏毅老師 YOUTUBE DRL 1-3

On-policy VS Off-policy

On-policy
    The agent learned and the agent interacting with the environment is the same
    阿光自已下棋學習
Off-policy
    The agent learned and the agent interacting with the environment is different

    佐助下棋,阿光在旁邊看



Add a baseline:
    It is possible that R is always positive
    So R subtract a expectation value


Policy in "Policy Gradient" means output action, like left/right/fire


gamma-discounted rewards:
時間愈遠的貢獻,降低其權重

Reward Function & Action is defined in prior to training

MC v.s. TD
MC 蒙弟卡羅: critic after episode end : larger variance(cuz conditions differ a lot in every episode), unbiased (judge until episode end, more fair)
TD: Temporal-difference approach: critic during episode :smaller variance, biased maybe



atari : a3c  => gym
torcs : ddpg => gym-torcs


PPO
   easy code 
   easy tune
   sample efficient


Replay Buffer:
Put the experience into buffer : St, at, rt, St+1
cuz interaction with environment cost more time than training
玩遊戲的時間通常比TRAIN NETWORK花時間

留言

這個網誌中的熱門文章

A3C in ATARI Pong-V0

ATARI PONG 對戰模式,左邊為遊戲程式,右邊為訓練中的A3C模型。一局以21分決勝負,對手MISS 一球得一分。從LOG可以看出,A3C模型從最初全敗的輸21分,經過2小時左右的TRAINING,已經逆轉至幾乎每局都勝利,偶爾甚至勝出高達13分。 底下為TRAINING A3C MODEL過程的LOG, (base) frank@viper1:~/a3c$ python main.py --env-name "Pong-v0" --num-processes 8 Time 00h 00m 10s, episode reward -21.0, episode length 1026 Time 00h 01m 18s, episode reward -21.0, episode length 1020 Time 00h 02m 26s, episode reward -21.0, episode length 1029 Time 00h 03m 35s, episode reward -21.0, episode length 1023 Time 00h 04m 42s, episode reward -21.0, episode length 1014 Time 00h 05m 50s, episode reward -21.0, episode length 1087 Time 00h 07m 00s, episode reward -21.0, episode length 1359 Time 00h 08m 14s, episode reward -16.0, episode length 1922 Time 00h 09m 30s, episode reward -14.0, episode length 2220 Time 00h 10m 48s, episode reward -15.0, episode length 2431 Time 00h 12m 04s, episode reward -15.0, episode length 2211 Time 00h 13m 23s, episode reward -8.0, episode length 2600 Time 00h 14m 38s, episod...

OCR應用在電子元件上的辨識

 OCR Application Example1: for SMD idenfication : Text detect by CRAFT   OCR文字偵測 原始照片為網路上下載,再套上OCR文字偵測顯示結果,若有侵權請告知移除 彩色區域為偵測到文字的部份 Output 10 coordinates of corresponding text blocks 1.  144,196,286,194,287,259,145,261 2.  298,198,509,196,509,259,298,262 3.  148,262,286,262,286,321,148,321 4.  368,266,513,264,513,321,369,323 5.  145,331,472,333,471,395,145,393 6.  146,404,445,404,445,454,146,454 7.  146,453,512,453,512,502,146,502 8.  147,502,481,499,481,551,148,553 9.  148,550,614,550,614,600,148,600 10.513,600,714,600,714,648,513,648  After image pre-processing:    OCR result1:   After image pre-processing:  OCR result2:     Example2: for datasheet interpretation : Text detect of TI datasheet by CRAFT OCR results: ([[75, 11], [127, 11], [127, 31], [75, 31]], 'TEXAS', 0.999188403930061) ([[474, 4], [928, 4], [928, 32], [474, 32]], 'PACKAGE MATERIALS INFORMATION', 0.6743955072876302) ([[77, 29],...