Reinforcement Learning (RL) – A more general form of machine learning than supervised learning or unsupervised learning. It learns from interaction with the environment to achieve a goal or simply learns from reward (positive adjustment) and punishments (negative adjustment). This learning is inspired by behaviourism phycology. Point to note here is, there is nothing anthropomorphization in reinforcement learning.

Reinforcement Learning – History

From the best research, I got the answer as it got termed in the 1980s while some research study was conducted on animals behaviour. Especially how some new-born baby animals learn to stand, run, and survive in the given environment. Rewards is a survival from learning and punishment can be compared with being eaten by others.

Reinforcement learning (RL) can be understood by using the concepts of agents, environments, states, actions and rewards.  It collects the training examples

  • this action was good
  • that action was bad

The learner is not told explicitly about which action to take but expected to discover which action yields the most lucrative result in form of reward by a hit and try the method. Typically, the RL setup is composed of two components, an agent and an environment. In other words, algorithms learn to react to the environment. TD-learning seems to be closest to how humans learn in this type of situation, but Q-learning and others also have their own advantages.

In a simple term, the concept behind reinforcement learning is sort of adjustments algorithm makes. In machine learning, there is no such simulation of the human limbic system where it involved in emotion, motivation, memory or learning. The algorithm simply adjusts itself in a positive manner in case it needs reward like human brain release dopamine when you play candy crush, upon clearing the level, you get encouraged to move on even if you don’t have time or parents are shouting on your head. It just uses human terminology i.e. reward/punishment to prove that it’s inspired with the biological entity.

Formulating- Reinforcement Learning (RL) Problem

Reinforcement Learning – Enables an agent (Including human) to learn in an interactive environment by trial and error using feedback from its own actions and experiences. Key terms to formulate reinforcement learning as follows.

  • Environment: Physical environment in which the agent (any agent) operates.
  • Action (A): All the possible moves that the agent can make
  • State (S): Current situation returned by the environment or State/situation in which an agent is operating currently.
  • Reward (R): Immediate return sends back from the environment to evaluate the last action. Feedback from the environment on the work done.
  • Policy (π): Agents strategy to determine the next action based on the current state. It’s a method to map the agent’s state to actions.
  • Value (V): Expected long-term return with discount, as opposed to the short-term. The future reward that an agent would receive by taking action in a particular state/states.
  • Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current action a.

Vπ(s) is defined as the expected long-term return of the current state “s” under policy π. Qπ(s, a) refers to the long-term return of the current state s, taking action “a” under policy π.

Learning style here is more concerned with how an agent ought to take actions. It learns from interaction with the environment to meet a goal or simply learns from reward and punishments.

Demystifying – Reinforcement Learning (RL)

Reinforcement Learning (RL) algorithms learn to react to the environment. TD-learning and Q-Learning are two of the best algorithms in this learning. I remember reading a book on reinforcement learning some years back with a focus on  “Intelligent Machines”. Three methods of 3 methods of reinforced learning were discussed at a greater level as below.

  • Q-Learning – This is commonly used in a model-free approach. The value update rule is the core of the Q-learning algorithm. Q-learning policies are is greedy. Q and TD are related, but not the same.  Q learning is one form of reinforcement learning in which the agent learns an evaluation function over states and actions from policy and value iteration
  • Temporal Difference Learning –  TD-learning seems to be closest to how humans learn in this type of situation.
  • Model-Based – best when MDP can’t be learned.
Reinforcement Learning
AILabPage’s – Machine Learning Series

Reinforcement Learning (RL) is around for many years as the third pillar for Machine Learning. It is now becoming increasingly important for Data Scientist to know when and how to carry out. RL has some goals such as decision process, reward/penalty systems and recommendation systems.

Reinforcement Learning Algorithms

In the below picture, you will find one/two liner description for widely used RL algorithms. Please note these will be described in full chapters with calculation and examples in later posts. How all limitations will react when the same will be performed on quantum computers? It will be amazing and astonishing to see how each model will be created and parameterized.


n this article, we have overviews the major algorithms in reinforcement learning. Each algorithm will be explained in detail in upcoming posts with formula, graphics, python code and live examples. In short, it’s correct to say that reinforcement learning is all about taking actions to reap maximum rewards or face penalties if it fails. Deployment of this learning is to find the best possible path or solutions in a given problem situation

Some extra complex algorithms

  • Trust Region Policy Optimization (TRPO) – It has consistent high performance but the computation and implementation this is extremely complicated.
  • Proximal Policy Optimization (PPO, OpenAI version) –  PPO proposes a clipped surrogate objective function.
  • BackPropagation – The reinforcement learning a kind of machine learning and uses several of its techniques including a neural network, to make the best learning model. Since it may use a neural network, the backdrop may be used in reinforcement learning.

Markov decision processes(MDP)

Reinforcement learning is closely related to dynamic programming approaches to Markov decision processes (MDP). MDP solve a partially observable problem. POMDPs received a lot of attention in the reinforcement learning community. As its a process of discrete-time stochastic control to provide a mathematical framework for decision-making modelling.

A Markov Decision Process model contains:

  • A set of possible world states S
  • Set of possible actions A
  • Real-valued reward function R(s, a)
  • Description T of each action’s effects in each state.

There are so many things unexplored and with the current craze of data science and machine, learnings applied reinforcement learning, is certainly a breakthrough. Some outputs are known and some are under the control of decision-makers. MDP model contains

The interesting fact about MDP is the effects of action taken by the system in a certain state depends on only that particular state but not on history or any prior state,

Reinforcement Learning vs Supervised Learning vs Unsupervised Learning

Reinforcement learning addresses a very broad and relevant question; How can we learn to survive in our environment?

  • There are many extensions to speed up learning.
  • There have been many successful real-world applications.

Semi-supervised learning, which is essentially a combination of supervised and unsupervised learning can also be compared with RL. It differs from reinforcement learning as it has direct mapping whereas reinforcement does not.

When to use Reinforcement Learning

The answer to the question above is not simple (trust me, though it could be purely my own opinion). Which kind of ML algorithm should use does not depend as much on your problem than on your dataset.

Real-life business use cases for Reinforcement Learning

Some major domains where RL has been applied are as follows:

  • Robotics-  Robot uses deep reinforcement learning to pick a device from one box and putting it in a container. Whether it succeeds or fails, it memorizes the object, gains knowledge, train’s itself to do the job with great speed and precision.
  • FinTech – Leveraging reinforcement learning for evaluating trading strategies can be a good strategy along with supervised learning. It can turn out to be a robust tool for training systems to optimize financial objectives.
  • Game Theory and Multi-Agent Interaction – Reinforcement learning and games have a long and mutually beneficial common history. The best example is alpha go or chess.

There are a lot of other industries and areas where this set of learning is in use and changing the game like Computer Networking, Inventory Management, Vehicle Navigation and many many more.

Points to Note:

All credits if any remains on the original contributor only. We have covered reinforcement machine learning in this post, where we reward and punish algorithms for predictions and controls. Data used here are either very less or waiting for the data. Last posts on Supervised Machine Learning and Unsupervised Machine Learning got some decent feedback. Our next post will talk about Reinforcement Learning — Markov Decision Processes

Books & Other Material Referred

Feedback & Further Question

Do you have any questions about Deep Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.

Conclusion – Reinforcement Learning addresses the problem of learning control strategies for autonomous agents with least or no data. RL algorithms are powerful in machine learning as collecting and labelling a large set of sample patterns cost more than data itself. Learning chess game can be a tedious task under supervised learning but RL works swiftly for the same task. The trial-and-error method as it attempts its task, with the goal of maximizing long-term reward can show better results here.

============================ About the Author =======================

Read about Author at : About Me

Thank you all, for spending your time reading this post. Please share your feedback / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage    ContactMe      Twitter         ====================================================================

Posted by V Sharma

A Technology Specialist boasting 22+ years of exposure to Fintech, Insuretech, and Investtech with proficiency in Data Science, Advanced Analytics, AI (Machine Learning, Neural Networks, Deep Learning), and Blockchain (Trust Assessment, Tokenization, Digital Assets). Demonstrated effectiveness in Mobile Financial Services (Cross Border Remittances, Mobile Money, Mobile Banking, Payments), IT Service Management, Software Engineering, and Mobile Telecom (Mobile Data, Billing, Prepaid Charging Services). Proven success in launching start-ups and new business units - domestically and internationally - with hands-on exposure to engineering and business strategy. "A fervent Physics enthusiast with a self-proclaimed avocation for photography" in my spare time.


  1. […] Reinforcement learning: In this algorithm interacts with a dynamic environment, and it must perform a certain goal without guide or teacher. […]


  2. […] Reinforcement learning: In this algorithm interacts with a dynamic environment, and it must perform a certain goal without a guide or teacher. […]


  3. […] Reinforcement learning: In this algorithm interacts with a dynamic environment, and it must perform a certain goal without a guide or teacher. […]


  4. One of the best blogs that i have read still now. Thanks for your contribution in sharing such a useful information. Waiting for your further updates.
    How Machine Learning Reinforcement Learning
    Best Data Science Course In Pune With Placement


  5. Data Science Course with R, Python & 15+ Projects. Ranked world’s #1 Online Bootcamp. Get Noticed by the Top Hiring Companies with Bygrad Job Guaranteed Program.


Leave a Reply