-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtorch_a2c.py
42 lines (32 loc) · 1.39 KB
/
torch_a2c.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
"""
Pytorch stable baselines 3 A2C model.
This algorithm is the fastest I have tried so far. It can learn at a rate of
400-500 steps per second.
At 1,000,000 steps, it converges on a single word and assumes that's always right,
probably because it only sees a few positive rewards. There are 11M possible actions,
and to learn english words an algorithm would need to try them all several times,
so we need to do a few hundred million steps, with very slow learning.
Might need to split this into two problems. First is learning vaid english words.
Using random digits, it will take an average of 18,000 guesses to get a single reward.
And then probably some multiple of 11M experiments to start to understand which words
are valid. Once that training is done, the model could be used as the random sample, instead
of a truly random sample. Even this feels sort of like cheating.
"""
import gym
from stable_baselines3 import A2C
from gym.envs.registration import register
# Register our custom environment
register(
id="wordle-v0",
entry_point="wordle:WordleEnv",
)
env = gym.make("wordle-v0")
model = A2C("MlpPolicy", env, verbose=1, learning_rate=0.000001)
model.learn(total_timesteps=1000000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()