 This post is a part of our Introduction to Machine Learning course.

Till now in our Introduction to Machine Learning course, we have seen that a supervised machine learning algorithm takes in Features of a dataset (e.g., number of hours a student studies) and Labels (e.g., their score on an exam) as inputs and estimates a Function (model) which minimizes the difference between the expected and predicted results. Now think about how a kid learns to ride a bicycle. No one gives him/her datasets to analyze. They learn by trying different things and then observing the effects of their actions. If the kind of actions they perform lead to balancing the bike they get to enjoy the ride or else they end-up falling and hurting themself. Similarly, can we also teach computers to learn from their mistakes? And the answer lies with Reinforcement Learning. Reinforcement learning is the closest one could get to the natural form of learning. It is a paradigm of Machine Learning and Artificial Intelligence which deals with making a set of sequential decisions under uncertainty.

#### What is Reinforcement Learning?

Let’s say we are given a robot inside a room and our goal is to reach to the green block which gives us +1 point. There is also a red block and if we end up there we get -1 point. After reaching to both of these blocks(states) our simulation comes to an end. There is one challenge, because the way mechanics of this world and robot are setup whenever the robot tries to move forward, in only 80% of the cases it actually moves forward. In the other 10% of cases, it goes either goes to the right and in remaining 10% it tries to go the left but since there is no place to go it will stay at the same spot. In real world where all our actions do not always end up with intended consequences (i.e. the world is stochastic) conventional planning algorithms are useless. Given this challenge, what is the best sequence of actions that we can take so as to ultimately reach to our goal?

Reinforcement Learning tries to find answer to this question.

The goal of any reinforcement learning algorithm is to find a set of actions that will result in maximum rewards (now or in near future).

And what’s the solution? Well, the only hope in this case to take an action and then observe its effects i.e. We not only need data from our sensors but also access to the physical/simulated environment. Also note that in most blocks (states) in this room except for the terminating states we do not get any feedback(reward) from the environment.

### Reinforcement Learning Problem Formulation #### Let’s Make sure you understand some RL Lingo

States is your current observation of the world. In the previous example, the positions/blocks in the room were the states of our environment.

Actions the set of actions your RL agent can perform in the environment. In the previous example, the actions that our robot could perform were moving Up, Down, Left and Right.

Rewards are the feedback that you might get from the environment. In the previous example our robot gets a reward of +1 for being in the green block and -1 for being the red block.

Let’s see if you get it. What do you think? What are the states, actions and rewards for the RL problem of playing the game of Pac-Man.

States: What could be an efficient way to represent a state in this game? We can ask ourselves what does the agent (Pac-Man) need to know in order to make a good move? It needs to know its own position, positions of all the ghosts and position of all the food. Right? So, a good representation of the state might be a tuple containing values of

{agent_loc:33; ghost_locations:[12,22,32], food_locations: [10,20,30]}

Actions: Our agent can move only in four directions in this 2D grid so our actions are {Up,Down,Left,Right}

Rewards: Our end goal is to obtain maximum possible score in the game. So, after performing an action, the increase in score can be a good reward signal for our agent. Or, if your intention is to survive in the game as long as possible, time spent in the game can be a good reward signal for the agent.

#### Markov Decision Process (MDP):

To formalize the Reinforcement Learning problem at hand we use Markov Decision Process (MDP). MDPs give us a good way to create a mathematical model of the RL problem and adhere to the Markovian property that the results of our actions only depends on the present state and everything that has happened before this is irrelevant. This might seem a bit limiting at first since intuitively we know that our set of actions/states are correlated and not independent of each other. But as always, we will work our way around this assumption. `S` denotes all the states.

`A` denotes all the actions.

`P(s’|s,a)` denotes the transition probabilities. It indicates the probability of reaching state s’ from state s given that you take the action a.

`P(r|s,a)` denotes the reward probabilities. It indicates the chance of getting reward R given that you take an action on state s.

`s_0` denotes the initial state of the world.

The goal of any reinforcement learning algorithm is to come up with a policy which maps all the states to actions. There could be many possible policies. But, we need to select an optimal policy which will give us the maximum value. Now the question is, “How do we get this maximum value?”

Well, we know that we want to get maximum rewards using a policy. Right?

So, if we sum all our rewards from t = 0 to infinity for a given policy, we should get the value of that policy. And because these rewards are stochastic, we always don’t get these rewards. So, the value of policy is actually the expectation over the sum of rewards. But wait, the sum of all the rewards from t = 0 to infinity will not be bounded. We will run out of memory if we try to compute this value. So let’s introduce a new factor, gamma (Ɣ) into our equation. This gamma (Ɣ) is called as the discount factor, which will between 0 to 1. Gamma (Ɣ) indicates the importance of a reward in our equation. Initially, the value of gamma (Ɣ) is set to 1. And then, the value of gamma (Ɣ) gradually decreases with each iteration. This is a way of saying that our future rewards are less valuable than our current reward. So if someone gives you a \$100 check right now versus \$100 one year latter, the one that you get right now is more valuable, right?

As the value of gamma (Ɣ)decreases in each step, the sum of discounted rewards will eventually converge (remember geometric series?).

Thus we have formalized the problem at hand and we will see latter posts how we go about solving this. Please let us know in the comments if this was helpful.

If this article was helpful to you, check out our Introduction to Machine Learning Course for a complete guide to Machine Learning.