Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence “model-free”), and it can handle problems with stochastic transitions and rewards without requiring adaptations.
For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. “Q” refers to the function that the algorithm computes the expected rewards for an action taken in a given state.
Reinforcement learning involves an agent, a set of states S, and a set A of actions per state. By performing an action a€A, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).
The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state.
As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:
0 seconds wait time + 15 seconds fight time
On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, time-fighting other passengers are less. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:
5 second wait time + 0 second fight time
Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.
Q Learning Algorithm
Q-learning is a model-free reinforcement learning algorithm.
Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.
Q-learning is an off-policy learner. Means it learns the value of the optimal policy independently of the agent’s actions. On the other hand, an on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps and it will find a policy that is optimal, taking into account the exploration inherent in the policy.
Q-Table is the data structure used to calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state. To learn each value of the Q-table, Q-Learning algorithm is used.
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where m= number of states. We will initialise the values at 0.
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of time. This means that this step runs until the time we stop the training, or the training loop stops as defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0.
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward. We need to update the function Q(s,a).
Application of Reinforcement Learning
In Fanuc, a robot uses deep reinforcement learning to pick a device from one box and putting it in a container. Whether it succeeds or fails, it memorizes the object and gains knowledge and train’s itself to do this job with great speed and precision.
Many warehousing facilities used by eCommerce sites and other supermarkets use these intelligent robots for sorting their millions of products everyday and helping to deliver the right products to the right people. If you look at Tesla’s factory, it comprises of more than 160 robots that do major part of work on its cars to reduce the risk of any defect.
Reinforcement learning has helped develop several innovative applications in the financial industry. This combined with Machine Learning has made several differences in the domain over the years. Today, there are numerous technologies involved in finance, such as search engines, chatbots, etc.
Several reinforcement learning techniques can help generate more return on investment, reduce cost, improve customer experience, etc. Reinforcement learning and Machine Learning, together, can result in improved execution while approving loans, measuring risk factors, and managing investments.
One of the most popular applications of reinforcement learning in finance is portfolio management. It is building a platform that allows you to make significantly more accurate predictions with regards to stock and other such investments, thereby providing better results. This is one of the main reasons why most investors in the industry wish to create these applications to evaluate the financial market in a detailed manner. Moreover, many of these portfolio management applications, including Robo-advisors, allow you to generate more accurate results with time.
A major issue in supply chain inventory management is the coordination of inventory policies adopted by different supply chain actors, such as suppliers, manufacturers, distributors, so as to smooth material flow and minimize costs while responsively meeting customer demand.
Reinforcement learning algorithms can be built to reduce transit time for stocking as well as retrieving products in the warehouse for optimizing space utilization and warehouse operations.
With technology improving and advancing on a regular basis, it has taken over almost every industry today, especially the healthcare sector. With the implementation of reinforcement learning, the healthcare system has generated better outcomes consistently. One of the most common areas of reinforcement learning in the healthcare domain is Quotient Health.
Quotient Health is a software app built to target reduced expenses on electronic medical record assistance. The app achieves this by standardizing and enhancing the methods that create such systems. The main goal of this is to make improvements in the healthcare system, specifically by lowering unnecessary costs.
Reinforcement learning is used to solve the problem of Split Delivery Vehicle Routing. Q-learning is used to serve appropriate customers with just one vehicle.
The image processing field is a subcategory of the healthcare domain. It is, somewhat, a part of the medical industry but having a domain of its own. Honestly, reinforcement learning revolutionized not only image processing but the medical industry at large. However, here, we will discuss some of the applications of this technology in image processing alone.
The DeepMind system used a deep convolutional neural network, with layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values.
The technique used experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed. This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.
Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.