Regression – In Machine Learning domain the general perception is “ML is just regression” which is terribly incorrect perception in reality. Regression analysis is used for understanding the relationship between set of variables. In data sets we observes some variables as
- Independent variables (IV) – Characteristics that can be measured directly
- Dependent variables (DV) -Characteristic for which value depends on independent variables.
Dependent variable “depends” on the independent variable. Both of them have to coexist and there cant be any dependent variable without an independent variable. So in nutshell quality data is more important factor then just fancy ml architectures.
One ground rule as a reader you need to know is this, No one can just become a data scientist without crossing path from regression corridor. Regression algorithms needs to be mastered first not just glance through or learn basics of it. No one can solve real business problems without persistent effort to master regression.
As per Wikipedia – In statistical modelling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modelling and analysing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.
In short we can say this “Regression is a parametric technique which uses statistical process to draw out the values of predictive variables (Unknown value) with given inputs of independent variables (known value) sets.” Because its parametric in nature, the output can either astonishing or disastrous. Accuracy gets improved over time with some tweaks though.
Regression analysis is kind of future crystal ball i.e. predictive modelling technique which looks into the relationship between dependent variables and independent variables i.e predictive variables through independent variables. Its like futurist looks into current variables and apply some calculation methods to drive predictions. Its amazing to know how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. This technique is used in
- Forecasting Matters
- Time Series Modelling
- Relationship between IV and DV i.e how many room will be booked in hotel during certain period of time.
Regression is very important tool for analysing and modelling the data. Whole idea is to reduce the distance between line and data points in order to get best fit result. Minimum distance will produce minimum errors (most likely).
Artificial intelligence and machine learning are used interchangeably often but for they are not same. Machine learning is one of the most active area and way to achieve AI. Why ML is so good today; for this there are couple of reasons like below but not limited to though.
- Explosion of big data
- Hunger for new business and revenue streams in this business shrinking times
- Advancements in machine learning algorithms
- Development of extremely powerful machine with high capacity & faster computing ability
- Storage capacity
Today’s machines are learning and performing tasks; that were only be done by humans in the past like making better judgement, decisions, playing games etc. This is possible because machines can now analyse and read through patterns and remember learnings for future use. This post will help readers to understand that machine learning is not just regression and its much beyond it. Today the major issue is to find resources that are skilled enough to demonstrate & differentiate their learning from university & phd books in real business rather then just arguing on social media with others.
Machine learning should be treated as a culture in an organisation where business teams, managers and executives should have some basic knowledge of this technology. In order to achieve this as a culture there has to be continuous programs and road shows for them. There are many courses which are designed for students, employees with little or no experience, managers, professionals and executive to give them better understanding on how to harness this magnificent technology in their business.
Regression in Machine Learning
In Machine Learning regression analysis is used just to understand how to establish relationship between independent variables and dependent variables. How independent variables going to effect dependent variables. Whether there is strong relationship or just a casual relationship between IV and DV. Machine learning is goal not a technique in this regards while linear regression is. ML can be achieved through many different means and techniques. In short we can say this
- Regression – It is a performance techniques to find relationship between DV and IV ands it is measured by how close it fits an expected line.
- Machine learning – It is measured by how good its results are for solving a certain problem, with whatever means necessary.
Regression techniques especially linear regression helps to advance more complex algorithms such as LASSO-Least Absolute Shrinkage and Selection Operator, ridge regression and least angle regression etc are much more close and useful in machine learning. Linear regression remain at foundation level though. When you really wants to start understanding machine learning keep linear regression in your pocket at all times. The journey for becoming “Machine Learning Expert” dream starts with Regression for sure.
Why Regression Analysis is Used
Its used for its simplicity and ability to estimates the relationship between independent variables and dependent variables. Regression analysis comes out on top when ever there is need to see or find relationships between what we have or do and what we will get from it. How many rooms will be booked in a hotel if hotel has 3 or above stars, price is good, location is good, its a holiday season, what are the activities available around the hotel etc.
Now taking the same example for our hotel. Let’s say, we want to estimate number of rooms to be sold out in next 2 months based the parameters given above. We have the recent guest data which indicates that the number of rooms booked, profile of guests, date of check in and check out, location they travelled from, how much they spent in hotel, what all activities they did through concierge desk of hotel.
Another use of regression besides establishing relationship between IV and DV is ability of comparability. Regression analysis allows us to compare the variable change effect from different scales. For example change in price of room based on location of the hotel etc.
Using these insights, we can predict future booking of rooms based on current & past information. When we talk about “Why Regression”, the answer is simple which is lucrative benefits with simple analysis. First its ability to establish, records and demonstrate the significance of relationships between dependent variable and independent variable. Second its impact of multiple independent variables on a dependent variable.
How Regression Works?
Regression uses a mathematical linear function to predict (Output or dependent) variable based on below given function. It has a goal achieve best and support an ordinary-least-squares regression line.
Y = a + bx + E
- Y – This is our dependent variable or variable we predict
- x – Independent variable or variable we use to make our predictions
- a – Intercept or prediction value when x = 0
- b – Slope, It explains the change in Y when X changes by 1 or more units
- E – Error or residual value. Its the difference between actual and predicted values. Its always there and cant be ignored even you use OLS-ordinary least square technique. Its also known as reminder to us about “future is uncertain.”
The steps involved in regression analysis are as below
- Defining problem statement
- Specifying the required model
- Data collection (Quality data)
- Data analysis i.e descriptive analytics
- Estimation of unknown parameters
- Evaluate model
- Error margin calculation
- Using data model for predictions
You should check why OLS is the most favourite technique to reduce error. There many more for same reason such as GLS-generalised least square, PLS-percentage least square, TLS-total least squares, LAD-least absolute deviation. We will not discuss this issue here though.
Another problem learners encounter or build wrong perception at beginning of learning stage is type of regressions. Often learners struggle to differentiate between the only regression method and the most useful or popular recession algorithms for predictive modelling. Linear and logistic regressions are first and most widely used though. For this post we will limit to 2 most commonly used and 5 less commonly used regressions as below
- Linear Regression – One of the most used and underrated due to its simplicity. As its simple to understand and use; it has gained highest popularity. Linear regression is extremely versatile method that can be used for predicting
- Temperature of the day or in an hour
- Likely housing prices in an area
- Likelihood of customers to churn
- Revenue per customer
- Logistic Regression – Here target is discrete unlike in linear where target is an interval variable. Predictive values are the probability occurrence of an event by fitting the data to logistic curve. Predictors can be numerical or categorical.
- Ridge Regression – Technique to melt down the issue of data (independent variables) suffering from multicollinearity.
- ElasticNet Regression – A hybrid of Lasso and Ridge regression techniques. Useful when multiple features which are correlated present in data. On down side it can suffer with double shrinkage but also can encourages group effect in case of highly correlated variables. Number of selected variables have no limit.
- Polynomial Regression – A curvy regression which has curve that fits into the data points rather then best fit straight line.
- Lasso Regression – Almost similar to Ridge Regression but it shrinks coefficients to zero even to exact zero. Because of this nature it is useful in feature selection.
- Stepwise Regression – Technique to deal with multiple independent variables. It adds and removes predictors as needed for each step.
Assumptions made in Regression
As mentioned above “Regression is a parametric technique which uses statistical process to draw out the values of predictive variables (Unknown value) with given inputs of independent variables (known value) sets.” Making assumptions on same line of thought is easy though it makes it restrictive.
- Multicollinearity – Correlation among independent variables does not exist.
- Heteroskedestacity – Due to absence of constant variance this error may arise to possess constant variance.
- Data linearity –Dependent and independent variables will have linear and additive relationships.
Improving the accuracy of a Regression Model
Improving accuracy by tweaking data has very little scope in regression unlike in another machine learning algorithms. The scope if very limited. Regression assumptions can give pretty decent or terrible result for similar problems with little change in data. What can be done for little improvement, few steps are as below:
- Multicollinearity – This issues can be solved with by using correlation matrix to check correlated variables.
- Data linearity – These issues can be dealt by IV transformation using techniques like log, square etc.
- Heteroskedasticity – This is simple way to just transform the DV using techniques in point-1
Points to Note:
All credits if any remains on the original contributor only. We have covered all basics around Machine Learning. Machine Learning is all about data, computing power and algorithms to look for information. In the previous post we covered Generative Adversarial Networks. A family of artificial neural networks.
Books + Other readings Referred
- Research through open internet, news portals, white papers and imparted knowledge via live conferences & lectures.
- Lab and hands on experience of @AILabPage (Self taught learners group) members.
Feedback & Further Question
Do you have any questions about Supervised Learning or Machine Learning ? Leave a comment or ask your question via email . Will try my best to answer it.
Conclusion – In this post we discussed basic concepts around regression. You would be now able to solve small problems of regression. Though focus in this post was around theory but it was important to understand it. I particularly think that getting to know about regression its assumptions, violations, model fit, and residual plots and algorithms actually helps to see somewhat clear picture. The answer to the question “What machine learning algorithm should I use?” is always “It depends.” It depends on the size, quality, and nature of the data. Also what is the objective / motive data torturing. As we torture the data more we get useful information. It depends on how the math of the algorithm was translated into instructions for the computer you are using. And it depends on how much time you have.
======================== About the Author ===================
Read about Author at : About Me
Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.