Q-Learning is a reinforcement learning technique. It has the ability to compute the utility of the actions without a model for the environment. It takes the help of action-value pair and the expected reward from the current action. During this process the agent learns to move around the environment and understand the current state which is the optimal policy by taking the action with the highest reward. Let us look at an example of this technique.

# Environment

Imagine we are moving from one floor to another in the elevator in an apartment building. Considering this situation, our agent will be awarded if he starts from a random room and eventually finds a way to reach the terrace. He can move around in any direction with random action. Let us look at the available actions, states, rewards and the goal.

- States – 0, 1, 2, 3, 4, 5 (Terrace)
- Actions – 0, 1, 2, 3, 4, 5 (Terrace)
- Reward – 0, 100
- Goal – 5 (Terrace)

Let us visualise these on a Q-Table.

On this table, the rows represent the states or the floors and the columns represent the actions that can be taken at a particular floor. 0 means that particular action can be taken but has no reward, whereas -1 means the particular action is not available for that state.

We know how an elevator works, and by pressing the highest number on the keyboard we reach the terrace. But let us consider that the agent does not know to read the buttons on the elevator display board and has to figure out a way to reach the terrace by trial and error method.

The basic principle here is that the agent cannot jump a floor while moving, if he does then he is punished. If the agent moves to an adjacent floor he is not rewarded unless that particular floor is the terrace. That’s where his freedom lies.

What the agent will have is the Q matrix, where every state-action is encoded to reach the optimal policy to get the maximum expected reward.

# How Does It Work?

Following are the steps on how the Q Learning table works

- Q matrix is initialized with zeros (agent has just entered the environment)
- A random action is performed to just to another floor (current state —-> next state)
- Every episode will state at the ground floor and end at the goal, that is, the terrace
- When the state is not a final state (terrace)
- Perform a random action
- Based on the current action go to the next state
- Find the maximum Q value.

- When the state is not a final state (terrace)

We continue this process starting from the current state and move to the next state. And the next state will be our current state and the process continues until we reach the goal.

This will be our basic algorithm which works on the principle of awarding the agent for the right move and punishing for the wrong move.

# Implementation of Q-Learning

Lets us consider the Q Table initialized with zeros to start with.

Here is the reward table on which the agent will be rewarded based on the actions he takes.

# Episode No.1

Let us say the agent will be starting from the fourth floor (4), the available actions are moving down to third floor (3) or moving up to the terrace (5)

Now let us say we randomly choose to move to the state 5. Here, we have two possibilities that is we can move either the below floor – fourth floor or we can reach the goal by moving up to the terrace. We need to select the biggest Q value from the available values which are – Q (5,4), Q (5,5). We choose the max function to make the agent to move upwards to the terrace. But at this point the Q table is still filled with zeros.

Now the new state is 5, and after the max function the agent will move towards the terrace and will be awarded with a reward of 100. Our Q table after this computation will look like

# Episode No.2

Now the agent will randomly choose another state let’s say the third floor – which is state 3. Now the possible actions are moving either down to the second floor or moving up to the fourth floor. Let us look at the possible action-reward pair from the table below

Now since we have the possible actions – Q(3,2), Q(3,4). If the agent randomly choose to go the fourth state. We can update the Q table with the help of the Markov’s Decision Process we can award the agent with a reward multiplied by a gamma function (0.80). For the Q(3,4) we get a reward of 80. The updated Q table after two episode is below

# Episode No.1,000

One we have run this process for a thousand episodes we will each a optimal Q table for which the agent knows how to reach the terrace with optimal number of iterations. This is an example of Q Learning and here is a visual of how the Q table is at 1000 episodes

Eventually the agent will learn how to move from any floor to the top floor using this optimal policy.

Here is a visual description of how the environment works though every iteration of every episode.

# Conclusion

We saw how an agent learnt to move towards the terrace eventually from any random state with optimal number of moves. This is a simple example of Q-Learning technique, using Markov Decision Process. The random action can be nullified as we move forward through the episodes and lowering the probability of occurrence of these kinds of actions. The random action is allowed or forced on the agent just to make the agent understand the environment completely, that is, to see the unseen areas. This way the agent will master the environment after 1,000 episodes.