2023-Q1-AI 16. Stimulētā mašīnapmācība, Gradient Policy

16.1. Video / Materiāli

 

Video: https://youtube.com/live/wgW5p9Hfm8A?feature=share

Jamboard: https://jamboard.google.com/d/1fD-P_7yR8R1cKwzUHaA1II2VkhOVxMTCa2XCrHnuUCg/edit?usp=sharing

Materials: Policy Gradient: https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/ PPO: https://arxiv.org/abs/1707.06347

 


Jamboard shared: vecins.valters@gmail.com

Youtube RTMP key: ea11-mrgb-4jg2-4ajc-d4hr

 


 

Pagājušā gada materiāli

Video: https://youtu.be/t93-leAFnSY

Jamboard: https://jamboard.google.com/d/1KolS31GjEtTkd9rvZZFFu6Kzgj0jG0EOCMALzLnGFO8/edit?usp=sharing

 


 

14.2. Implementēt Policy Gradient modeli

Template: http://share.yellowrobot.xyz/quick/2023-5-7-A9B14AF3-F09D-4169-992D-F282A07A366B.zip

Iesniegt pirmkodu un screenshots ar labākajiem rezultātiem.


 

14.3. Implementēt A2C modeli

Template: http://share.yellowrobot.xyz/quick/2023-5-7-5CCE3179-3499-4E0D-9894-4FF10A5538AF.zip

Iesniegt pirmkodu un screenshots ar labākajiem rezultātiem.


14.4. Mājasdarbs - Implementēt PPO modeli

Template:

http://share.yellowrobot.xyz/quick/2023-5-7-F6A7C309-67A8-453D-BD97-592083ACD5D8.zip

Vienādojumi: http://share.yellowrobot.xyz/upic/4e251be9772c476a3d7e15156369e99e_1683463127.jpg

Iesniegt pirmkodu un screenshots ar labākajiem rezultātiem.

 


 

parser.add_argument('-gamma', default=0.8, type=float)

makes it live longer

 

 


 

TRPO image-20211215100532362

 

PPO ⚠️ PPO modelī NAV log p, ir tikai tīrs dalījums varbūtibām (manā orģinalajā video bija ielikts)

https://github.com/nikhilbarhate99/PPO-PyTorch/blob/bd8b8bf6832dfcfb9125374fd61c0d359e621607/PPO.py

(1)A(st,at)=δtV(st)Lcritic=1N(δtV(st))2dratio=π(at|st)+ϵπold(at|st)+ϵdclamp=clamp(dratio,1Δ,1+Δ)dfinal=min(dratio,dclamp)Lactor=1NA(st,at)dfinalL=Lactor+Lcritic

image-20211215124111928

 

image-20211215100731793

image-20211215100752812

image-20211215100558102

image-20211215100721796

A2C

https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

image-20211215081229263

https://medium.com/deeplearningmadeeasy/advantage-actor-critic-a2c-implementation-944e98616b

image-20211215082036482

image-20211215082041984

(2){δt=rt+1,ift<endδt=rt+1+γV(st+1),else
(3)transition={st,at,δt}

 

 

image-20211215082049012

image-20211215082147895

 

image-20211215082126329

 


8E6B33F3-5CBE-4CB4-BA73-EF8989A88BC9

01B372B9-1057-4557-A171-AEA0523734B3

28861AE0-BC1F-4DCD-BA0B-0D6992239FA6

(4)A(st,at)=δtV(st)Lactor=1NA(st,at)log(π(at|st))Lcritic=1N(δtV(st))2L=Lactor+Lcritic

 

19152698-80D0-47E4-8051-2BD716BE6DD5

EF96CB0D-C62E-41AE-A18C-8B5CC8828D8E

7A7F2E02-A6D7-4E81-8473-B00C9670D7E0

EC97FC7A-B673-4A6F-A8F3-A2F991D3B926

DF59EB93-51A9-48EA-8255-3EA20E6EC046

7D43D229-65B9-4266-9750-E5A4CB936380

 

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

39B2FA3D-37B5-4169-A05B-747E88BC6056

BB6DC3F4-1EE7-4B3A-8C8B-BCC301FC0742