Reinforcement Learning & Self-Play

Meta Learning & Self Play

This passage is a learning note about a paper talking about the reinforcement learning and self play.

First of all, tell a joke.
Title: How to perform as machine learning?
Q: Do you know the result of 11 * 12?
A: Yes. My answer is 233.
Q: No, the answer is 132.
A: Ok, my answer is 132.

The reinforcement Learning Problem

The Reinforcement Learning framework just tell you that you have an agent in some environment and you want to find a policy for this agent that will maximize its reward.

It’s a super general framework because almost any problem you can think of can be describe as there is an agent that takes some actions and you want to take those actions which lead to the good rewards, the high rewards.

Now, the reason that reinforcement learning is interesting is because this reasonably good reinforcement learning algorithms. I should say reasonably good, I should say interesting reinforcement learning algorithms that can sometimes solve problems. So in the formulation, the environment gives the agents the observations and the rewards, but in the real world, the agent need to figure out its own rewards from the observation.

Humans and animals they are not being told by the world but something is good or bad, it’s on us to figure it out of for ourselves.

Agent = neural work

And this is how it looks like

This is how it looks like now at least where the observation come in and a little network or helpfully a big neural network does some processing and produces an action.

And I’ll explain to you in this part the way in which the vast majority of reinforcement learning algorithms work.

  • Add randomness to your actions
  • If the result was better than expected, do more of the same in the future

So, this two both points it tries something random and if you eat better than expected, do it again.

And there is some math around it but that’s basically the core of it and then that everything else is like slightly clever ways of making better use of this randomness.

The reinforcement learning algorithms that we have new can solve some problems, but there is also a lot of things they can not solve.

If you had a super good reinforcement learning algorithms then you can build the system it could achieve super complicated goals really quickly and basically the technical portion of the field of AI would be complete and a really good algorithms would combine all the spectrum of ideas from machine learning, and reasoning and inference the best time and the training at the best time, all of those ideas would be put together in the right way to create a system which would figure out how the world works and then achieve its goals in this world and do it vey quickly.

But the algorithm we have today are still nowhere near at the level of what they can be in the future and will be.

Hindsight Experience Replay

So now let’s discuss ways in which we can improve reinforcement learning algorithms and I’ll describe to you one very simple improvement.

The improvement boils down to this really simple idea so as discussed earlier, the very reinforcement learning algorithms is work is that you try something random and if you succeed, if you do better than expected then you should do it again.

But what will happen if you try lots of random things and nothing works, this is the case when exploration is hard when you rewards are infrequent you get a lot of failures, don’t have a lot of success. So the question is can we somehow find a way to learn from failure.

Next, I’ll explain to you the idea very briefly, the idea is the following. You try to do one thing, you aim to achieve one thing but you’ll probably fall unless you’re really good. So you will achieve something else.

So, why not use the failure ti achieve the one thing as training data to achieve the other thing.

  • Setup: build a system that can reach any state
  • Goal: reach state A
  • Any trajectory ends up in some other state B
  • Use this as training data to each state B

It’s really intuitive and it works.

Learning a Hierarchy of Actions With Meta Learning

It’s a simple approach for learning hierarchy of actions, so one of the things that would be nice to do in reinforcement learning is to learn this hierarchy with the hierarchy of some kind.

But it’s never really been successful, truly successful, and I don’t want to claim that this is a success as well this is more of a demonstration which of how you could approach the problem learning a hierarchy if you had distribution over takes, then basically what you want is to train how level controllers such that they make it possible to solve the tasks quickly.

So you optimize the low level actions such that they make it possible to solve the tasks from your distribution tasks quickly.

Evolved Policy Gradients

It will be kind of cool if we could evolve a cost function which would make it possible to solve reinforcement learning problems quickly, and as easy as you usually do in there kind of situations you have a distribution over take and you literally evolve the cost function. And the fitness of the cost function is the speed in which this cost function lets you solve problems from a distribution of problems.

Goal: learn a cost function that leads to rapid learning.
  • Train a cost function such that reinforcement learning on this function learns very quickly.
  • Ingredients: a distribution over look
  • Use evolution strategies to learn the cost function

So the learned cost function allows for extremely rapid learning but the learned cost function also has a lot of information about the distribution of tasks.

In this case, this result is not magic because you need your training task distribution to be equal to a test at distribution and now it’s improved some more.

Self Play

Self play is something which is really interesting. It’s an old idea that’s existed for many years back from the 60s.

The first really cool result in self play is from 1992 by Tesauro where he used a cluster of 386 computers to train a neural network using Q-learning to play backgammon with self play. And the neural network learned to the feed the world champion and it discover strategies but bag of and experts weren’t aware of and they decided and agreed those strategies were superior.

Appealing properties of Self Play

  • Simple environment

Self play has the property that you can have very simple environments. If you run self play in one simple environment, then you can potentially get behaviors with unbounded complexity self.

  • Convert computer into data

Self play gives you a way of converting computer into data which is great because data is really hard to get but computer is easier to get.

  • Perfect curriculum

Another very nice thing about self play is that it has an natural or perfect curriculum because if you are good then your opponent is good, the table is difficult. You always vain on between 50% of the tome. So it does not matter how good you are or how bad you are. It’s always challenging at the right level of challenge and so it means that you have a very smooth path of going from agents that don’t do much to agents that potentially do a lot of things.

AI Alignment: Learning from human feedback

Here the question is that we are trying to address is really simple. You know as we train progressively more powerfully AI system it will be important to communicate to them goals of greater subtlety and intricacy and how can we do that.

Well, in this work, we investigate one approach which is having humans judge the behavior of an algorithms and some be route be really efficient.

The way it really works is that human judges provide feedback to the system. All of those bits of feedback are being cashed into a model of a reward using a triplet loss. It tries to come up with a single reward function that respects all the human feedback that was given to it.