Machine Learning

Reinforcement Learning – Reward for Learning

Reinforcement Learning (RL) – last post in this sub series “Machine Learning Type” under master series “Machine Learning Explained“. Next sub series “Machine Learning Algorithms Demystified” coming up. This post talks about reinforcement machine learning only. Previous post on Supervised Learning and Unsupervised Learning are available.


Reinforcement Learning

RL compared with a scenario like  “how some new born baby animals learns to stand, run, and survive in the given environment.”


Some Basics – Reinforcement Learning

Reinforcement Learning (RL) is more general than supervised learning or unsupervised learning. It learn from interaction with environment to achieve a goal or simply learns from reward and punishments. In other words algorithms learns to react to the environment. TD-learning seems to be closest to how humans learn in this type of situation, but Q-learning and others also have their own advantages.

Reinforcement learning can be referred to a learning problem and a subfield of machine learning at the same time. As a learning problem, it refers to learning to control a system so as to maximize some numerical value which represents a long-term objective.

Reinforcement Learning

AILabPage’s – Machine Learning Series

Although RL has been around for many years as the third pillar for Machine Learning and now becoming increasingly important for Data Scientist to know when and how to implement. RL getting importance and focus as an equally important player with other two machine learning types reflects it rising importance in AI.

RL has some goals mentioned as below.

  • Decision Process
  • Reward/Penalty System
  • Recommendation System


What is Reinforcement Learning

Before we get into deeper in RL for what and why, lets find out some history of RL on how it got originated. From the best research I got the answer as it got termed in 1980’s while some research study was conducted on animals behaviour. Especially how some new born baby animals learns to stand, run, and survive in the given environment. Rewards is a survival from learning and punishment can be compared with being eaten by others.


Reinforcement Learning

AILabPage’s – Machine Learning Series

Reinforcement learning can be understood by using the concepts of agents, environments, states, actions and rewards. This is an area of machine learning; where there’s no answer key, but RL agent still has to decide how to act to perform its task. The agent is inspired by behaviourist psychology who decide how and what actions will taken in an environment to maximise some notion of cumulative reward. In the absence of existing training data, the agent learns from experience. It collects the training examples

  • this action was good
  • that action was bad

Learner is not told explicitly about which action to take but expected to discover which action yields the most lucrative result in form of reward by hit and try method. Typically, a RL setup is composed of two components, an agent and an environment.

We can’t learn to drive via reinforcement learning in the real world, failure cannot be tolerated. This is impossible when safety is a concern

Reinforcement Learning Algorithms

Although the number of RL algorithms doesn’t seems to be a easy thing to know as there are a great number of RL algorithms. Its not even easy or thinkable task to have comprehensive comparison between each of them. Below are one/two liner description for some of the widely used RL algorithms. Please note these will be described in full chapters with calculation, codes and examples in subsequent posts.


  • Q-Learning – Model-free RL algorithm based on the well-known Bellman Equation. This learning is an off-policy. In Q-learning, such policy is the greedy policy.  Q learning is one form of reinforcement learning in which the agent learns an evaluation function over states and actions.
    • Policy Iteration
    • Value Iteration
  • State-Action-Reward-State-Action (SARSA) –  Almost a replica or resembles with Q-learning. SARSA is an on-policy algorithm and that could be only difference. here.
  • Deep Q Network (DQN) – DQN’s main ability is to estimate value for unseen states which is missing in Q-learning agent. DQN get rid of the q-learnings two-dimensional array by introducing neural network techniques.
  • Deep Deterministic Policy Gradient (DDPG) – To get rid of action space that is too large DQN gets refined some how and called as DDPG.

Some extra complex algorithms

  • Trust Region Policy Optimization (TRPO) – It has consistent high performance but the computation and implementation this is extremely complicated.
  • Proximal Policy Optimization (PPO, OpenAI version) –  PPO proposes a clipped surrogate objective function.
  • Back Propagation – I am keeping it separate from DQN. The reinforcement learning uses machine learning, including neural network, to make a model. Since it may use neural network, the backprop may be used in the reinforcement learning.

Comparison of Discussed Algorithms

Reinforcement Learning

AILabPage’s – Machine Learning Series

Which specific RL algorithm to use when deciding which algorithms to be applied to a specific task gives quite a head spinning experience. I am attempting to provide an introduction for some of the well-known algorithms here.


Reinforcement Learning Process Flow

Reinforcement learning is mostly useful when there is no supervised learning set but have a reinforcement signals. Learning is from Interactions which is highly influenced by goals. Action is evaluated and get reward or punishment. Most of the RL algorithms follow this pattern. In the following paragraphs, Will briefly talk about some terms used in RL to facilitate our discussion in the next section.

Reinforcement Learning

AILabPage’s – Machine Learning Series


  • Action (A): All the possible moves that the agent can make
  • State (S): Current situation returned by the environment.
  • Reward (R): Immediate return send back from environment to evaluate last action.
  • Policy (π): Agents strategy to determine next action based on the current state.
  • Value (V): Expected long-term return with discount, as opposed to the short-term.
  • Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current action a.

Vπ(s) is defined as the expected long-term return of the current state s under policy π. Qπ(s, a) refers to the long-term return of the current state s, taking action a under policy π.


Reinforcement Learning vs Supervised Learning vs Unsupervised Learning

Reinforcement learning addresses a very broad and relevant question; How can we learn to survive in our environment?

  • There are many extensions to speed up learning.
  • There have been many successful real world applications.
Reinforcement Learning

AILabPage’s – Machine Learning Series

Semi-supervised learning, which is essentially a combination of supervised and unsupervised learning can also be compared with RL. It differs from reinforcement learning as it has direct mapping whereas reinforcement does not.


When to use Reinforcement Learning

Answer to the question above is not simple (trust me, though it could be purely my own opinion). Which kind of ML algorithm should use does not depend as much on your problem than on your dataset.

Real life business use cases for Reinforcement Learning

Some major domains where RL has been applied are as follows:

  • Robotics-  Robot uses deep reinforcement learning to pick a device from one box and putting it in a container. Whether it succeeds or fails, it memorizes the object, gains knowledge, train’s itself to do the job with great speed and precision.
  • FinTech – Leveraging reinforcement learning for evaluating trading strategies can be the good strategy along with supervised learning. It can turn out to be a robust tool for training systems to optimize financial objectives.
  • Game Theory and Multi-Agent Interaction – Reinforcement learning and games have a long and mutually beneficial common history. Best example is alpha go or chess.
Reinforcement Learning

AILabPage’s – Machine Learning Series

There are lot of other industries and areas where this set of learning is in use and changing the game like Computer Networking, Inventory Management, Vehicle Navigation and many many more.


Points to Note:

All credits if any remains on the original contributor only. We have covered reinforcement machine learning in this post, where we reward and punish algorithms for predictions and controls. Data used here is either very less or waiting for the data. Last posts on Supervised Machine Learning and Unsupervised Machine Learning got some decent feedbacks  would love to hear some feedback here also. Our next post will talk about Reinforcement Learning — Markov Decision Processes


Reinforcement LearningConclusion – Reinforcement Learning addresses the problem of learning control strategies for autonomous agents with least or no data. RL algorithms are powerful in machine learning as collecting and labelling a large set of sample patterns cost more then data it self.

Learning chess game can be a tedious task under supervised learning but RL works swiftly for same task. Trial-and-error method as it attempts its task, with the goal of maximizing long-term reward can show better results here. Reinforcement learning is closely related to dynamic programming approaches to Markov decision processes(MDP). MDP solve partially observable problem. POMDPs received a lot of attention in the reinforcement learning community. There are so many things unexplored and with the current craze of data science and machine learnings applied reinforcement learning, is certainly a breakthroughs.

#MachineLearning #


Books + Other readings Referred

============================ About the Author =======================

Read about Author at : About Me

Thank you all, for spending your time reading this post. Please share your feedback / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage    ContactMe      Twitter         ====================================================================

Facebook Comments

14 replies »

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.