Regression – In the Machine Learning domain the general perception is “ML is just regression” which is terribly wrong in reality. The models in regression are very popular in machine learning for predicting target variables on a continuous scale though. Regression models are one of the most widely used models in machine learning. Regression analysis can be treated as a kind of future crystal ball i.e. predictive modelling technique. This analysis looks into the relationship between dependent variables and independent variables i.e. predictive variables through independent variables.
Regression analysis is used for understanding the relationship between a set of variables. In data sets, we observe some variables as
- Independent variables (IV) – Characteristics that can be measured directly
- Dependent variables (DV) -Characteristic for which value depends on independent variables.
Dependent variable “depends” on the independent variable. Both of them have to coexist and there can not be any dependent variable without an independent variable. So in nutshell quality data is a more important factor than just fancy ml architectures.
One ground rule as a reader you need to know is this, No one can just become a data scientist without crossing the path from the regression corridor. Regression algorithms need to be mastered first not just glance through or learn the basics of it. No one can solve real business problems without persistent effort to master regression.
As per Wikipedia – In statistical modelling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modelling and analysing several variables when the focus is on the relationship between a dependent variable and one or more independent variables.
In short, we can say this “Regression is a parametric technique which uses a statistical process to draw out the values of predictive variables (Unknown value) with given inputs of independent variables (known value) sets.” Because of its parametric nature, the output can be either astonishing or disastrous. Accuracy gets improved over time with some tweaks though.
Regression is like a futurist who looks into current times (variables) and applies some calculation methods to drive predictions. It’s amazing to know how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. This technique is used in
- Forecasting Matters
- Time Series Modelling
- A relationship between IV and DV i.e how many rooms will be booked in a hotel during a certain period of time.
Regression is a very important tool for analysing and modelling the data. The whole idea is to reduce the distance between a line and data points in order to get the best fit result. Minimum distance will produce minimum errors (most likely).
Artificial intelligence and machine learning are used interchangeably often but they are not the same. Machine learning is one of the most active areas and a way to achieve AI. Why ML is so good today; for this, there are a couple of reasons like below but not limited to though.
- The explosion of big data
- Hunger for new business and revenue streams in this business shrinking times
- Advancements in machine learning algorithms
- Development of extremely powerful machine with high capacity & faster computing ability
- Storage capacity
Today’s machines are learning and performing tasks; that was only be done by humans in the past like making a better judgement, decisions, playing games etc. This is possible because machines can now analyse and read through patterns and remember learnings for future use. This post will help readers to understand that machine learning is not just regression and it’s much beyond it. Today the major issue is to find resources that are skilled enough to demonstrate & differentiate their learning from university & PhD books in real business rather than just arguing on social media with others.
Machine learning should be treated as a culture in an organisation where business teams, managers and executives should have some basic knowledge of this technology. To achieve this as a culture, there have to be continuous programs and roadshows for them. There are many courses which are designed for students, employees with little or no experience, managers, professionals and executives to give them a better understanding of how to harness this magnificent technology in their business.
Regression in Machine Learning
In Machine Learning, regression analysis is used just to understand how to establish a relationship between independent variables and dependent variables. How are independent variables going to affect dependent variables? Whether there is a strong relationship or just a casual relationship between IV and DV. Machine learning is a goal not a technique in this regard while linear regression is. ML can be achieved through many different means and techniques. In short, we can say this
- Regression – It is a performance technique to find the relationship between DV and IV and it is measured by how close it fits an expected line.
- Machine learning – It is measured by how good its results are for solving a certain problem, with whatever means necessary.
Regression techniques especially linear regression help to advance more complex algorithms such as LASSO-Least Absolute Shrinkage and Selection Operator, ridge regression and least angle regression etc are much more close and useful in machine learning. Linear regression remains at the foundation level though. When you really want to start understanding machine learning keep linear regression in your pocket at all times. The journey to becoming a “Machine Learning Expert” dream starts with Regression for sure.
Why Regression Analysis is Used
It’s used for its simplicity and ability to estimate the relationship between independent variables and dependent variables. Regression analysis comes out on top whenever there is a need to see or find relationships between what we have or do and what we will get from it. How many rooms will be booked in a hotel if a hotel has 3 or above stars, the price is good, the location is good, it’s the holiday season, what are the activities available around the hotel etc?
Now take the same example for our hotel. Let’s say, we want to estimate the number of rooms to be sold out in the next 2 months based on the parameters given above. We have the recent guest data which indicates the number of rooms booked, the profile of guests, date of check-in and check-out, the location they travelled from, how much they spent in a hotel, and what activities they did through the concierge desk of the hotel.
Another use of regression besides establishing the relationship between IV and DV is the ability of comparability. Regression analysis allows us to compare the variable change effect from different scales. For example change in the price of a room based on the location of the hotel etc.
Using these insights, we can predict future booking of rooms based on current & past information. When we talk about “Why Regression”, the answer is simple which is lucrative benefits with a simple analysis. First, its ability to establish, record and demonstrate the significance of relationships between a dependent variable and an independent variable. Second its impact of multiple independent variables on a dependent variable.
How Regression Works?
Regression uses a mathematical linear function to predict (Output or dependent) variable based on below-given function. It has a goal to achieve the best and support an ordinary-least-squares regression line.
Y = a + bx + E
- Y – This is our dependent variable or variable we predict
- x – Independent variable or variable we use to make our predictions
- a – Interceptor prediction value when x = 0
- b – Slope, It explains the change in Y when X changes by 1 or more units
- E – Error or residual value. Its the difference between actual and predicted values. Its always there and can’t be ignored even you use OLS-ordinary least square technique. Its also known as a reminder to us about “future is uncertain.”
The steps involved in regression analysis are as below
- Defining problem statement
- Specifying the required model
- Data collection (Quality data)
- Data analysis i.e descriptive analytics
- Estimation of unknown parameters
- Evaluate model
- Error margin calculation
- Using the data model for predictions
You should check why OLS is the most favourite technique to reduce error. There many more for some reason such as GLS-generalised least square, PLS-percentage least square, TLS-total least squares, LAD-least absolute deviation. We will not discuss this issue here though.
Another problem learners encounter or build the wrong perception at the beginning of the learning stage is a type of regressions. Often learners struggle to differentiate between the only regression method and the most useful or popular recession algorithms for predictive modelling. Linear and logistic regressions are first and most widely used though. For this post, we will limit to 2 most commonly used and 5 less commonly used regressions as below
- Linear Regression – One of the most used and underrated due to its simplicity. As its simple to understand and use; it has gained the highest popularity. Linear regression is an extremely versatile method that can be used for predicting
- The temperature of the day or in an hour
- Likely housing prices in an area
- Likelihood of customers to churn
- Revenue per customer
- Logistic Regression – Here the target is discrete unlike in linear where the target is an interval variable. Predictive values are the probability occurrence of an event by fitting the data to the logistic curve. Predictors can be numerical or categorical.
- Ridge Regression – Technique to melt down the issue of data (independent variables) suffering from multicollinearity.
- ElasticNet Regression – A hybrid of Lasso and Ridge regression techniques. Useful when multiple features which are correlated present in data. One downside, it can suffer from double shrinkage but also can encourage group effect in case of highly correlated variables. The number of selected variables have no limit.
- Polynomial Regression – A curvy regression which has a curve that fits into the data points rather than a best fit straight line.
- Lasso Regression – Almost similar to Ridge Regression but it shrinks coefficients to zero even to exactly zero. Because of this nature, it is useful in feature selection.
- Stepwise Regression – Technique to deal with multiple independent variables. It adds and removes predictors as needed for each step.
Assumptions made in Regression
As mentioned above “Regression is a parametric technique which uses a statistical process to draw out the values of predictive variables (Unknown value) with given inputs of independent variables (known value) sets.” Making assumptions on the same line of thought is easy though it makes it restrictive.
- Multicollinearity – Correlation among independent variables does not exist.
- Heteroskedasticity – Due to the absence of constant variance this error may arise to possess constant variance.
- Data linearity –Dependent and independent variables will have linear and additive relationships.
Improving the accuracy of a Regression Model
Improving accuracy by tweaking data has a very little scope in regression unlike in other machine learning algorithms. The scope is very limited. Regression assumptions can give a pretty decent or terrible result for similar problems with little change in data. What can be done for little improvement, few steps are as below:
- Multicollinearity – This issues can be solved by using a correlation matrix to check correlated variables.
- Data linearity – These issues can be dealt with by IV transformation using techniques like a log, square etc.
- Heteroskedasticity – This is a simple way to just transform the DV using techniques in point-1
Points to Note:
All credits if any remains on the original contributor only. We have covered all basics around Machine Learning. Machine Learning is all about data, computing power and algorithms to look for information. In the previous post, we covered Generative Adversarial Networks. A family of artificial neural networks.
Books + Other readings Referred
- Research through open internet, news portals, white papers and imparted knowledge via live conferences & lectures.
- Lab and hands-on experience of @AILabPage (Self-taught learners group) members.
Feedback & Further Question
Do you have any questions about Supervised Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.
Conclusion – In this post, we discussed basic concepts around regression. You would be now able to solve small problems of regression. Though the focus in this post was around theory it was important to understand it. I particularly think that getting to know about regression its assumptions, violations, model fit, and residual plots and algorithms actually help to see a somewhat clear picture.
The answer to the question “What machine learning algorithm should I use?” is always “It depends.” It depends on the size, quality, and nature of the data. Also, what is the objective/motive data torturing? As we torture the data more we get useful information. It depends on how the math of the algorithm was translated into instructions for the computer you are using. And it depends on how much time you have.
======================== About the Author ===================
Read about Author at : About Me
Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.