Transformer –  Transformers are one of the types of neural network architecture and known for their staunch behaviour for the assigned tasks to produce best results thus gaining popularity rapidly. We all know deep Learning a powerful tool, a game-changer that is helping to play both the roles i.e. as a saviour and a tool for the fiasco.  Transformers are used by big names like OpenAI and DeepMind for AlphaStar.  Transformer model uses attention to boost the speed of training and accuracy is maintained. It won’t be wrong to say that the transformers outperform the Google Neural Machine Translation model in specific tasks.


What is Deep Learning?

AILabPage defines Deep learning is “Undeniably a mind-blowing synchronisation technique applied on the bases of 3 foundation pillars large data, computing power, skills (enriched algorithms) and experience which practically has no limits“.

Deep Learning is a subfield of machine learning domain. Deep learning is entirely concerned with algorithms inspired by the structure and function of artificial neural networks which are inspired by the human brain (inspired only pls). Deep learning is used with too much ease to predict the unpredictable. In our opinion “We all are so busy in creating artificial intelligence by using a combination of non-bio neural networks and natural intelligence rather than exploring what we have in hand.

Deep learning, also called a subset of machine learning which is a specialist with an extremely complex skillset in order to achieve far better results from the same data set. It purely on the basis of NI (Natural Intelligence) mechanics of the biological neuron system. It has a complex skill set because of methods it uses for training i.e. learning in deep learning is based on “learning data representations” rather than “task-specific algorithms.” which is the case for other methods

“I think people need to understand that deep learning is making a lot of things, behind the scenes, much-better” – Sir Geoffrey Hinton

Human Brain  – It is a special/critical point of discussion for everyone and puzzling game of all times as well. How our brain is designed and how it functions we cant cover in this post as I am nowhere close or even can dream to be close to the neuroscientist. Out of curiosity, I am tempted to compare Artificial Neural networks with the human brain (With the help talk shows on such topics). Its fascinating to me to know, how the human brain is able to decode technologies, numbers, puzzles, handle entertainment, understand science, set body mode into pleasure, aggression, art, etc. How does the brain train itself to name a certain object by just looking 2-3 images where ANN’s need millions of those.

Deep Learning – Introduction to Recurrent Neural Networks

Deep Learning – Deep Convolutional Generative Adversarial Networks Basics

Deep Learning – Backpropagation Algorithm Basics

Introduction to the Transformer

An astonishing neural network model named as “Transformer” came into light in 2017 by Google -led team. The transformer is a deep learning model and was proposed in the paper Attention is All You NeedA TensorFlow implementation of it is available as a part of the Tensor2Tensor package. TNN have proven its effectiveness and worth, especially for natural language processing (#NLP) tasks. No TNNs are not going to replace RNN which were introduced by  David Rumelhart in 1986. Unfortunately, RNNs have serious limitations though. RNNs are trained on long sequences thus gradients tend to explode out of control or vanish to nothing in some occasions. Long short term memory neural network (LSTM) came to rescue to solve this shortcoming.

Thanks to advancement in machine learning algorithms, price/size reduction in storage capacity, more and more computing power at lower cost and explosion in data generation of all kind.  More and more new models in deep learning are being introduced at a speed which is difficult to keep track of. The beauty of TNNs is in how they contribute and add value to neural networks with the staunch and use of parallelization.

When machines are able to learn to classify and analyse the data (any kind of data) by themself we can safely say “Yes”, we have achieved a small percentage of our deep learning goals. Deep learning-powered tools are able to recognise images that contain dogs, cat or any other object, that too without the need of specifying what dog/cat looks like.  It’s even going on to next level where it’s able to recognise the species, bio-specifics and other details in the image like a table, chair, carpet, room size etc. What kind of details and how to recognize those details etc are getting advanced almost every day. The year 2020 saw many downsides beside Covid (pandemic), like how much is the hype and how much is the reality of the AI industry, let’s discuss in brief below.
Transformers are designed to handle sequential data unlike recurrent neural networks (#RNNs). The good news is the transformer is much more effective, efficient and speedy which reduce training time as they do not require that the sequential data be processed in the order. Transformer performs their task in high parallelisation environment thus their architectures have extremely high resilience. The immediacy of the transformer speaks to the rapid rate of progress in machine learning and artificial intelligence. 


The best deep learning model which is helpful from question answering to grammar correction to many more tasks. As of transformer is in the same state as convolutional neural networks in 2012 and it’s architecture going through transformation (should be for good and betterment). The good part is, it’s out of incubation already though.

In the above example, the Transformer is represented as a black box. An entire sequence (Thai characters) is parsed simultaneously in a feed-forward manner, resulting in a transformed output tensor. In the above picture, the output sequence is more concise than the input sequence.  For NLP tasks depending upon input language, word order, spacing (in Thai sentences there are no spacing) and sentence length may vary substantially. Now for below English text, if this needs to be translated to Thai, it would require  almost 3 times more efforts in RNN compare to the transformer

“Accomplishments are those which gets stuck delightfully in your memories and distinguishes you from others, remember its not about winning over others all the time or to compete with others as that can lead to fiasco some time. It’s all about your own achievements and rewards to your self to be staunch, tranquil and harmonious.”

Remember as mentioned above unlike other architectures (recurrent neural networks) and LSTMs for NLP,  there are no recurrent connections and thus no real memory of previous states happens in the case of Transformers. Transformers are even much smarter and they easily perceive entire sequences simultaneously.


Recurrent neural networks and Transformers

Unlike recurrent neural networks, the transformers are also designed to handle sequential data but with much more efficient and powerful method.

Recurrent neural networks are a linear architectural variant of recursive networks. They have a “memory” thus it differs from other neural networks. This memory remembers all the information about, what has been calculated in the previous state. It uses the same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output.

The transformers come in as a panacea for all the issues of #RNN, thus do not require the sequential data to be processed in the order.  So you really don’t need to worry if you are putting your hands directly into the transformer, without other neural networks. Transformers are the latest trendy deep learning (neural networks) model most prominent in machine translation that are dealing with sequences.


Let’s look at Transformers in Little Depth

We will pick up the same example as above to translate the Thai sentence (first language) and the machine translation tool would translate to another language (English).

“ผมต้องการ PS5” (I need ps5)

As per the Transformer architecture, we can magnify a little bit this transformer to see an encoding/decoding component and some connections between the two


  • Encoders with its identical structure ( 2 Level: Self-attention –> Feedforward neural network)
  • Decoders with its identical structure( 3 Level : Self-attention –> encoder-decoder-attention –> Feedforward neural network) 

They both create a stack of multiple levels with same numbers i.e if encoders have 5 level stake then the decoder will also have the same.  So its simple to understand that the at encoders input first enters at “Self-Attention” layer and output from this layer becomes the input for feedforward layer. In decoders, its same way treatment but with one exception that’s the layer helps the decoder to focus on relevant parts of the input sentence.


Books Referred & Other material referred


Points to Note:

In Fine-Tuning Language Models from Human Preferences paper by OpenAI, it has been demonstrated how transformer models GPT-2 and GPT-3 can generate extremely humanlike texts. All credits if any remains on the original contributor only. We have covered the Convolutional neural network a kind of machine learning in this post, where we find hidden gems from unlabelled historical data. The last post was on Supervised Machine Learning. In the next upcoming post will talk about Deep Reinforcement Learning.


Feedback & Further Question

Do you have any questions about Deep Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.


Conclusion- This post was an attempt to explain the main concepts behind the Transformer. Furthermore, the post was the attempt to outline recent key advancements in the technology, and provide insight into areas, in which deep learning can improve investigation. CNN is a neural network with some convolutional and some other layers. The convolutional layer has a number of filters that do a convolutional operation. The process of building a CNN’s always involves four major steps i.e Convolution, Pooling, Flattening and Full connection which was covered in detail. Choosing parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix. is the main core process  in CNN and if you get this incorrect the whole joy gets over then and there


============================ About the Author =======================

Read about Author atAbout Me

Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage                        ContactMe                          Twitter         ====================================================================

Posted by V Sharma

Technology specialist in Financial Technology(FinTech), Photography, Artificial Intelligence. Mobile Financial Services (Cross Border Remittances, Mobile Money, Mobile Banking, Mobile Payments), Data Science, IT Service Management, Machine Learning, Neural Networks and Deep Learning techniques. Mobile Data and Billing & Prepaid Charging Services (IN, OCS & CVBS) with over 15 years experience. Led start ups & new business units successfully at local and international levels with Hands-on Engineering & Business Strategy.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s