Reinforcement Learning (RL) – last post in this subseries “Machine Learning Type” under master series “Machine Learning Explained“. Next subseries “Machine Learning Algorithms Demystified” coming up. This post talks about reinforcement machine learning only. The previous post on Supervised Learning and Unsupervised Learning are available.
RL compared with a scenario like “how some new born baby animals learns to stand, run, and survive in the given environment.”
Some Basics – Reinforcement Learning
Reinforcement Learning (RL) is more general than supervised learning or unsupervised learning. It learns from interaction with the environment to achieve a goal or simply learns from reward and punishments. In other words, algorithms learn to react to the environment. RL’s TD-learning seems to be closest to how humans learn in this type of situation, but Q-learning and others also have their own advantages.
Reinforcement learning can be referred to a learning problem and a subfield of machine learning at the same time. As a learning problem, it refers to learning to control a system so as to maximize some numerical value which represents a long-term objective.
Although RL has been around for many years as the third pillar for Machine Learning and now becoming increasingly important for Data Scientist to know when and how to implement. RL getting importance and focus as an equally important player with other two machine learning types reflects its rising importance in AI.
RL has some goals mentioned below.
- Decision Process
- Reward/Penalty System
- Recommendation System
in AILabPage terms “Reinforcement Learning is the process of getting mature or attaining maturity in any or everything we do“. The course of corrections we make while learning and working give us far more sense of accomplishment. For example when you learn to drive the bicycle. The reward comes by maintaining the balance and punishment comes when you lose the balance. Similarly, algorithms in reinforcement learning get adjusted to learning by adjustments either in a negative or positive manner.
What is Reinforcement Learning
Before we get into deeper in RL for what and why let’s find out some history of RL on how it got originated. From the best research, I got the answer as it got termed in the 1980s while some research study was conducted on animals behaviour. Especially how some newborn baby animals learn to stand, run, and survive in the given environment. Rewards is a survival from learning and punishment can be compared with being eaten by others.
Reinforcement learning can be understood by using the concepts of agents, environments, states, actions and rewards. This is an area of machine learning; where there’s no answer key, but RL agent still has to decide how to act to perform its task. The agent is inspired by behaviourist psychology who decide how and what actions will be taken in an environment to maximise some notion of cumulative reward. In the absence of existing training data, the agent learns from experience. It collects the training examples
- this action was good
- that action was bad
We can’t learn to drive via reinforcement learning in the real world, failure cannot be tolerated. This is impossible when safety is a concern
The learner is not told explicitly about which action to take but expected to discover which action yields the most lucrative result in form of reward by a hit and try the method. Typically, an RL setup is composed of two components, an agent and an environment.
Reinforcement Learning Algorithms
Although the number of RL algorithms doesn’t seem to be an easy thing to know as there are a great number of RL algorithms. It’s not even easy or thinkable task to have a comprehensive comparison between each of them. Below is one/two liner description for some of the widely used RL algorithms. Please note these will be described in full chapters with calculation, codes and examples in subsequent posts.
- Q-Learning – Model-free RL algorithm based on the well-known Bellman Equation. This learning is an off-policy. In Q-learning, such policy is the greedy policy. Q learning is one form of reinforcement learning in which the agent learns an evaluation function over states and actions.
- Policy Iteration
- Value Iteration
- State-Action-Reward-State-Action (SARSA) – Almost a replica or resembles with Q-learning. SARSA is an on-policy algorithm and that could be the only difference. here.
- Deep Q Network (DQN) – DQN’s main ability is to estimate value for unseen states which is missing in Q-learning agent. DQN get rid of the q-learnings two-dimensional array by introducing neural network techniques.
- Deep Deterministic Policy Gradient (DDPG) – To get rid of action space that is too large DQN gets refined somehow and called as DDPG.
Some Extra Complex Algorithms
- Trust Region Policy Optimization (TRPO) – It has consistent high performance but the computation and implementation this is extremely complicated.
- Proximal Policy Optimization (PPO, OpenAI version) – PPO proposes a clipped surrogate objective function.
- Back Propagation – I am keeping it separate from DQN. The reinforcement learning uses machine learning, including a neural network, to make a model. Since it may use a neural network, the backdrop may be used in reinforcement learning.
Comparison of Discussed Algorithms
Which specific RL algorithm to use when deciding which algorithms to be applied to a specific task gives quite a head-spinning experience. I am attempting to provide an introduction for some of the well-known algorithms here.
Reinforcement Learning Process Flow
Reinforcement learning is most useful when there is no supervised learning set but have reinforcement signals. Learning is from Interactions which is highly influenced by goals. An action is evaluated and get reward or punishment. Most of the RL algorithms follow this pattern. In the following paragraphs, Will briefly talk about some terms used in RL to facilitate our discussion in the next section.
- Action (A): All the possible moves that the agent can make
- State (S): Current situation returned by the environment.
- Reward (R): Immediate return sends back from the environment to evaluate the last action.
- Policy (π): Agents strategy to determine the next action based on the current state.
- Value (V): Expected long-term return with discount, as opposed to the short-term.
- Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current action “a”.
Vπ(s) is defined as the expected long-term return of the current state “s” under policy π. Qπ(s, a) refers to the long-term return of the current state “s”, taking action “a” under policy π.
Reinforcement Learning vs Supervised Learning vs Unsupervised Learning
Reinforcement learning addresses a very broad and relevant question; How can we learn to survive in our environment?
- There are many extensions to speed up learning.
- There have been many successful real-world applications.
Semi-supervised learning, which is essentially a combination of supervised and unsupervised learning can also be compared with RL. It differs from reinforcement learning as it has direct mapping whereas reinforcement does not.
Reinforcement Learning – When to use
The answer to the question above is not simple (trust me, though it could be purely my own opinion). Which kind of ML algorithm should use does not depend as much on your problem than on your dataset.
Real life business use cases for Reinforcement Learning
Some major domains where RL has been applied are as follows:
- Robotics- Robot uses deep reinforcement learning to pick a device from one box and putting it in a container. Whether it succeeds or fails, it memorizes the object, gains knowledge, train’s itself to do the job with great speed and precision.
- FinTech – Leveraging reinforcement learning for evaluating trading strategies can be a good strategy along with supervised learning. It can turn out to be a robust tool for training systems to optimize financial objectives.
- Game Theory and Multi-Agent Interaction – Reinforcement learning and games have a long and mutually beneficial common history. The best example is alpha go or chess.
There are a lot of other industries and areas where this set of learning is in use and changing the game like Computer Networking, Inventory Management, Vehicle Navigation and many many more. Markov decision processes(MDP) solve the partially observable problem and POMDPs received a lot of attention in the reinforcement learning community. There are so many things unexplored and with the current craze of data science and machine, learnings applied reinforcement learning, is certainly a breakthrough.
Points to Note:
All credits if any remains on the original contributor only. We have covered reinforcement machine learning in this post, where we reward and punish algorithms for predictions and controls. The technique for data used here is either very less or waiting for the data. Last posts on Supervised Machine Learning and Unsupervised Machine Learning got some decent feedbacks would love to hear some feedback here also. Our next post will talk about Reinforcement Learning — Markov Decision Processes
Books + Other readings Referred
- Research through open internet, news portals, white papers and imparted knowledge via live conferences & lectures.
- Lab and hands-on experience of @AILabPage (Self-taught learners group) members.
- Machine Learning – An Introduction
- Reinforcement Machine Learning – An Introduction
- Asynchronous Methods for Deep Reinforcement M Learning
- Data-efficient Deep Reinforcement M Learning for Dexterous Manipulation
Feedback & Further Question
Do you have any questions about Reinforcement Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.
Conclusion – Reinforcement Learning addresses the problem of learning control strategies for autonomous agents with least or no data. RL algorithms are powerful in machine learning as collecting and labelling a large set of sample patterns cost more than data itself. RL learn itself continuously so it continually gets better and better doing the task at hand. Learning chess game can be a tedious task under supervised learning but RL works swiftly for the same task. The trial-and-error method as it attempts its task, with the goal of maximizing long-term reward can show better results here. Reinforcement learning is closely related to dynamic programming approaches to Markov decision processes(MDP).
============================ About the Author =======================
Read about Author at : About Me
Thank you all, for spending your time reading this post. Please share your feedback / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.
Categories: Machine Learning