Training Data – There is a famous punch line about Data. “Data is not good enough if it’s not quality data”. If your data model is not working or performing as expected blame your data and data source. Instead of struggling to find an opportunity for performance tuning look for improving data quality fed into the model.

Machine Learning – Basics

AILabPage defines machine learning as “A focal point where business, data, experience meets emerging technology and decides to work together”. 

ML instructs an algorithm to learn for itself by analyzing data. Algorithms here learn to map the input to output (supervised learning). Detection of patterns (unsupervised learning) or by reward/punishment (reinforcement learning). The more data it processes, the smarter the algorithm gets.

In other words, Machine learning algorithms “learn” from the observations. When exposed to more observations, the algorithm improves its predictive performance. You can follow the below post for more basic details about Machine Learning.

Thanks to statistics, machine learning became very famous in the 1990s. Machine Learning is about the use and development of fancy learning algorithms. The intersection of computer science and statistics gave birth to probabilistic approaches in AI. This shifted the field further toward data-driven approaches. Data science is more about the extraction of knowledge (KDD) from data through algorithms to answer a particular question or solve particular problems.

It All Boils Down to the Training Data

Data factory setup, what comes in, what does not, and continuous quality checks are the most crucial steps for any business to look at for initial setup. Since the success of any data model highly depends upon training data. Algorithms used are mostly off the shelf and picking the correct algorithm is another challenge. After sorting data, data model, and choice of algorithm, our last challenge is creating high-quality data flow pipe and check gates.

Training Data

The digital transformation of any modern business requires quality data, not just any data. The three key success factors for any good business which help it to grow with the correct digital marketing drive are

  • Quality of Data collected.
  • Data Scientists.
  • Tools to visualise, analyse and summarise the data.

So if Data is the new fuel of today’s time then we must accept data scientists as oil refineries and data tools as important ingredients which help to refine and produce desired results. Cognitive Analytics provides a 360-degree view of them to make the correct decision and the right time. The out of Big Data Analytics paints an excellent picture in the below categories.

  • Descriptive Analytics
  • Diagnostic Analytics
  • Predictive Analytics
  • Prescriptive Analytics

Continuous and consistently feeding high-quality data to train the algorithm of choice for business applications is another challenging part of machine learning. In short, what we are saying is “Building an ecosystem to choose, feed, train, choose a correct algorithm and label/categorize data that gets integrated with data models” is all we need for machine learning. This is the core idea behind machine learning.

Model Regularisation – Poor Performance of Models 

Algorithms work well for many applications but can suffer from the problem of overfitting / under-fitting.  Regularisation is needed to overcome an algorithm that has a high variance on any side. Data set for prediction or classification problems is always considered to be critical as accuracy becomes make or break point. The implementation happens in two parts

  • First implementing a design model on training data set
  • Secondly testing the accuracy of training data set.

Looking at the outcome i.e. quality of accuracy forces the data scientist to decide whether to increase the accuracy or decrease it by playing around with data feature selections i.e. feature engineering.  Poor performance can be because of 2 reasons.

  • Quality of data needs to be re-looked at
  • The too simple or complex model chosen

Not always getting our desired results that are the word of caution here as we might get poor results as well. That’s machine learning for us.

Data as Greatest Natural Resource – Data Intelligence

Data generation sources like social media as 1st and winners are doing an excellent job. 2nd to this is payment data which is as big as social media or the Western world. Payments on mobile for e-commerce, online food orders, etc are almost 30 – 50 times more than in the U.S. as in Africa and Asia combined above. Off-course all this data is quality data for making more money as well as to improve the user experience. Data is also used as a yardstick for comparing algorithms.

So coming back to our core discussion point which is the importance of training data in machine learning which we also call learning by choosing the best algorithm. Unless the data fed for training is not of correct quality and standards then the machine will only give us garbage and there would be no machine learning but it would be machine spoiling may be. Machines execute algorithms on data as a fixed sequence of steps, upon execution of its task machine can evaluate and track the performance of the best algorithm. Machines’ performance gets an increase over time with

  • More and more quality
  • As soon as machine stumbles upon the algorithm it needed.

This appears to the outside world as if the machine is gradually learning over time to master the task it has been assigned to.

Let’s take an example to demystify our jargon above – In the case of scanning email for spam and not spam. The filter process the email and tag it as SPAM or NOT SPAM. The algorithm behind this picks words like “Lotto, Free, Casino, Next of Kin” etc. Here more emails get processed by the filter the stronger it gets as it simply does mapping between the input to output. In the same case data, it starts getting “chocolate, candy, sweet or love” etc you can imagine the performance. Machine learning can diligently evaluate millions of keywords or word list variants to pick the one that most accurately detects spam.

Points to Note:

All credits if any remain on the original contributor only. We have covered all basics around data models or the importance of quality data and training data. The next upcoming post will talk about implementation, usage, and practice experience for markets.

Books + Other readings Referred

  • Research through open internet, news portals, white papers and imparted knowledge via live conferences & lectures.
  • Lab and hands-on experience of  @AILabPage (Self-taught learners group) members.

Feedback & Further Question

Do you have any questions about  AI,  Machine Learning, Data Science or Big Data Analytics? Leave a question in a comment section or ask via email. Will try best to answer it.

Machine Learning (ML) - Everything You Need To Know

Conclusion -With the rise of interest in Machine Learning there are a couple of different perspectives out there around the similarities between Statistics and ML. One goes from general to the specific conclusion and vice versa but as a matter fact, the two disciplines cant be divorced. Better known as two sides of the same coin. They represent two key aspects of data science that should become integrated into the long run.

So the Statistical Machine Learning may come up soon. Statistics departments cannot run without people without programming skills. Therefore it seems reasonable to include computer science classes in a statistics curriculum. They’re taught the same way, using the same books, using the same mathematics. It depends upon data and research objective to choose the research methodology either as inductive or deductive methods.

============================ About the Author =======================

Read about Author at : About Me

Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage    ContactMe      Twitter

====================================================================

By V Sharma

A seasoned technology specialist with over 22 years of experience, I specialise in fintech and possess extensive expertise in integrating fintech with trust (blockchain), technology (AI and ML), and data (data science). My expertise includes advanced analytics, machine learning, and blockchain (including trust assessment, tokenization, and digital assets). I have a proven track record of delivering innovative solutions in mobile financial services (such as cross-border remittances, mobile money, mobile banking, and payments), IT service management, software engineering, and mobile telecom (including mobile data, billing, and prepaid charging services). With a successful history of launching start-ups and business units on a global scale, I offer hands-on experience in both engineering and business strategy. In my leisure time, I'm a blogger, a passionate physics enthusiast, and a self-proclaimed photography aficionado.

8 thoughts on “Machine Learning – It all Boils Down to the Training Data”
  1. thanks for sharing nice information and nice artical and very usefulll infroamtion…..

  2. Is your model not performing well? Try digging into your data. Instead of getting marginal improvements in performance by searching for state-of-the-art models, drastically improve your model’s accuracy by improving the quality of your data.

Leave a Reply

Discover more from Vinod Sharma's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading