Monday, October 09, 2017
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
"""
We’ve discovered that evolution strategies (ES), an optimization technique that’s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo), while overcoming many of RL’s inconveniences.
In particular, ES is simpler to implement (there is no need for backpropagation), it is easier to scale in a distributed setting, it does not suffer in settings with sparse rewards, and has fewer hyperparameters.
[...]
Black-box optimization. In ES, we forget entirely that there is an agent, an environment, that there are neural networks involved, or that interactions take place over time, etc. The whole setup is that 1,000,000 numbers (which happen to describe the parameters of the policy network) go in, 1 number comes out (the total reward), and we want to find the best setting of the 1,000,000 numbers.
[...]
at each step we take a parameter vector w and generate a population of, say, 100 slightly different parameter vectors w1 ... w100 by jittering w with gaussian noise. We then evaluate each one of the 100 candidates independently by running the corresponding policy network in the environment for a while, and [...] The updated parameter vector then becomes the weighted sum of the 100 vectors, where each weight is proportional to the total reward [...] equivalent to estimating the gradient of the expected reward in the parameter space using finite differences, except we only do it along 100 random directions. Yet another way to see it is that we’re still doing RL (Policy Gradients, or REINFORCE specifically), where the agent’s actions are to emit entire parameter vectors using a gaussian policy.
[...]
Injecting noise in the parameters. Notice that the objective is identical to the one that RL optimizes: the expected reward. However, RL injects noise in the action space and uses backpropagation to compute the parameter updates, while ES injects noise directly in the parameter space. Another way to describe this is that RL is a “guess and check” on actions, while ES is a “guess and check” on parameters. Since we’re injecting noise in the parameters, it is possible to use deterministic policies (and we do, in our experiments). It is also possible to add noise in both actions and parameters to potentially combine the two approaches.
[...]
Tradeoffs between ES and RL
ES enjoys multiple advantages over RL algorithms (some of them are a little technical):
No need for backpropagation. ES only requires the forward pass of the policy and does not require backpropagation [...] 2-3 times faster [...] no need to worry about exploding gradients in RNNs. Lastly, we can explore a much larger function class of policies, including networks that are not differentiable
[...]
Highly parallelizable. ES only requires workers to communicate a few scalars between each other, while in RL it is necessary to synchronize entire parameter vectors [...] because we control the random seeds on each worker, so each worker can locally reconstruct the perturbations of the other workers. [...] linear speedups in our experiments as we added on the order of thousands of CPU cores to the optimization.
Higher robustness. Several hyperparameters that are difficult to set in RL implementations are side-stepped in ES.
[...]
Conversely, we also found some challenges to applying ES in practice. One core problem is that in order for ES to work, adding noise in parameters must lead to different outcomes to obtain some gradient signal. [...] further work on effectively parameterizing neural networks to have variable behaviors as a function of noise is necessary.
"""
https://blog.openai.com/evolution-strategies/
https://blog.openai.com/evolution-strategies/
Labels: Oleg Zabluda