Convolutional neural network – In this article, we will explore and discuss our intuitive explanation of convolutional neural networks (CNN’s) on a high level in simple language. CNN’s are inspired by the structure of the brain but our focus will not be on neural science in here as we do not have any expertise or academic knowledge in any of biological aspect. We are going artificial in this post. CNN’s are a class of Neural Networks that have proven very effective in areas of image recognition, processing and classification.
Convolutional Neural Networks are a special kind of multi-layer neural networks.
What are Convolutional Neural Networks
Convolutional neural networks (CNN) – Might look or appears like magic to many but in reality, it’s just simple science and mathematics only. CNN’s are a class of neural networks that have proven very effective in areas of image recognition thus in most of the cases its applied to image processing. This network is a great example of variation for multilayer perceptron for processing and classification. It’s a deep learning algorithm in which it takes input as an image and put weights and biases effectively to its objects and finally able to differentiate images from each other.
As per Wiki – In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analysing visual imagery.
Artificial Intelligence solutions behind CNN’s amazingly transform how businesses and developers create user experiences and solve real-world problems. CNN’s are also known as the application of neuroscience to machine learning. They employe mathematical operations known as “Convolution”; which is a specialised kind of linear operation.
Convolutional Neural Networks applications include high calibres AI systems such as AI-based robots, virtual assistants, and self-driving cars. Other common applications are used for
- Image Processing
- Video labelling
- Text analysis,
- Speech Recognition
- Natural language processing
- Text classification processing
Data Processing – Convolutional Neural Network
CNN’s have grid topology for processing data. Data points in this are called as grid-like topology as the processing of data happens in a spatial correlation between the neighbouring data points.
- 1D Grid – Time series data – Takes samples at regular time intervals
- 2D Grid – Image data – Grid of pixels
These neural networks use the convolution method as opposed to general matrix multiplication in at least one of the layer. Convolution leverages on
- Equivariant Representations – This simply means that if the input changes, the output changes in the same way
- Sparse Interactions – This allows the network to efficiently describe complicated interactions between many variables from simple building blocks.
- Parameter sharing – Using the same parameter for more than one function in a model
The above structure is created to improve a machine learning system. CNN’s also allows for working with inputs of variable size and efficiently describe complicated interactions between many variables from simple building blocks.
There are significant limitations of these Neural Networks is their constraints at the API level. Input e.g. an image and output e.g. classes of probabilities are both fixed-size vectors. Even the computation through its data models is performed by mapping using a fixed number of layers.
Some history around – Convolutional Neural Networks (CNN’s)
ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars. Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988. LeNet was one of the very first convolutional neural networks which have pushed forward Deep Learning.
In 2012, Alex Krizhevsky used Convolutional neural network in ImageNet competition and ever since then all big companies are running for this. CNN’s are the most influential innovations in the computer vision field. in 1990s LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.
Image Processing – Human vs Computers
For humans, recognition of objects is the first skill we learn right from birth. Newborn baby starts recognising faces as Papa, Mumma etc. By the time turn to an adult; recognition becomes effortless and kind of automated process.
Human behaviour for processing image is very different from machines. Humans give a label to each image automatically by just looking around and immediately characterise the scene and give each object a label without even consciously noticing.
For computer recognising objects are slightly complex as they see everything as input and output which come as a class or set of classes. This is known as image processing which we will discuss in the next section in detail. In computers, CNN’s do image recognition, image classifications. It is very useful in object detection, face recognition and success in various text classification with word embedding tasks etc.
So in simple words, computer vision is the ability to automatically understand any image or video based on visual elements and patterns.
Inputs and Outputs – How it works
Our focus in this post will be on image processing only
CNN’s require models to train and test. Each input image passes through a series of convolution layers with filters (Kernels), Pooling, fully connected layers (FC) and apply softmax function (Generalisation of the logistic function that “squashes” a K-dimensional vector of arbitrary real values to real values Kd vector) to classify an object with probabilistic values between 0 and 1. This is the reason every image in CNN’s gets represented as a matrix of pixel values.
The Convolutional Neural Network classifies an input image into categories e.g dog, cat, deer, lion or bird.
the Convolution + Pooling layers act as feature extractors from the input image while fully connected layer acts as a classifier. In the above image figure, on receiving a dear image as input, the network correctly assigns the highest probability for it (0.94) among all four categories. The sum of all probabilities in the output layer should be one though. There are four main operations in the ConvNet shown in the image above:
- Non Linearity (ReLU)
- Pooling or Sub Sampling
- Classification (Fully Connected Layer)
These operations are the basic building blocks of every Convolutional Neural Network, so understanding how this work is an important step to developing a sound understanding of ConvNets or CNN’s.
Convolutional Neural Networks – Architecture
Layers used to build Convolutional Neural Networks as we have mentioned in the above picture. Simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use four main types of layers to build our ConvNet architectures above.
Convolutional Layer, ReLU, Pooling, and Fully Connected Layer.
The initialisation of all filters and parameters/weights with random values
This holds raw pixel values of the training image as input. In the example above an image (deer) of width 32, height 32, and with three colour channels R,G,B is used. It goes through the forward propagation step and finds the output probabilities for each class. This layer ensures the spatial relationship between pixels by learning image features using small squares of input data.
- Lets assume the output probabilities for image above are [0.2, 0.1, 0.3, 0.4]
- The size of the feature map is controlled by three parameters.
- Depth – Number of filters used for the convolution operation.
- Stride – Number of pixels by which filter matrix over the input matrix.
- padding – It’s good to input matrix with zeros around the border, matrix.
- Calculating total error at the output layer with summation over all 4 classes.
- Total Error = ∑ ½ (target probability – output probability) ²
- Computation of output of neurons that are connected to local regions in the input. This may result in volume such as [32x32x16] for 16 filters.
Rectified Linear Unit (ReLU) Layer
A non-linear operation. This layer applies an element-wise activation function. ReLU is used after every Convolution operation. It is applied per pixel and replaces all negative pixel values in the feature map by zero. This leaves the size of the volume unchanged ([32x32x16]). ReLU is a non-linear operation.
Also called as subsampling or downsampling. Pool layer does a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x16] i.e reduces the dimensionality of each feature map but retains the most important information.
- Max Pooling operation on a Rectified Feature map.
Fully Connected Layer
In Fully Connected Layer -each node is connected to every other node in the adjacent layer. FC layer computes the class scores with traditional multilayer perceptron that uses a softmax activation function in the output layer. It results in the volume of size [1x1x10], where each of the 10 numbers corresponds to a class score, such as among the 10 categories of CIFAR-10.
The main job of this layer basically takes an input volume as is coming as output from conv or ReLU or pool layer proceedings. Arrange the output in the N-dimensional vector where N is the number of classes that the program has to choose from.
Convolutional Neural Networks – Real life Business Use Cases
Many modern companies are using CNN’s as the backbone of their business e.g. Pinterest use it for home feed personalisation and Instagram for search infrastructure. 3 of the biggest users are as below.
- Automatic Tagging Algorithms – Tagging, or social bookmarking, refers to the action of associating a relevant keyword or phrase with an entity (e.g. document, image, or video). Our experiment (above) showed us that effective time-frequency representation for automatic tagging and more complex models benefit from more training data.
- Photo Search – To find images that are similar to the one’s user input or on text input is available in google search results. It works well on the Chrome app. Google’s algorithms rely on more than 200 unique signals or “clues” that make it possible to guess search. Attributes here are websites, the age of content, IP address based region and PageRanks. Sadly this is highly biased based on your colour of skin. You can give it a try though.
- Product Recommendations – Large scale recommenders systems are in use in almost every e-commerce, retail, video on demand, or music streaming business. Algorithms in recommenders systems are typically classified into two categories — content-based and collaborative filtering methods although modern recommenders combine both approaches.
Books Referred & Other material referred
Points to Note:
All credits if any remains on the original contributor only. We have covered Unsupervised machine learning in this post, where we find hidden gems from unlabelled historical data. The last post was on Supervised Machine Learning. In the next upcoming post will talk about Reinforcement machine learning.
Feedback & Further Question
Do you have any questions about Deep Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.
Conclusion- This post was an attempt to explain the main concepts behind Convolutional Neural Networks in simple terms.CNN is a neural network with some convolutional and some other layers. The convolutional layer has a number of filters that do a convolutional operation. The process of building a CNNs always involves four major steps i.e Convolution, Pooling, Flattening and Full connection which was covered in details. Choosing parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix. is main core process in CNN and if you get this incorrect the whole joy gets over then and there
============================ About the Author =======================
Read about Author at : About Me
Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.