Transformers are a type of neural network architecture that has gained significant popularity due to their unwavering dedication to achieving optimal results in completing assigned tasks. Deep learning, which is widely recognized as a powerful tool, has significantly transformed the way we operate, proving to be both a lifesaver and a solution to disaster. Big players like OpenAI and DeepMind employ Transformers in their AlphaStar applications. By incorporating attention, the transformer model amplifies the pace of training while preserving precision.

It is ok to say that in certain tasks, the performance of transformers exceeds that of the Google Neural Machine Translation model.

Introduction to the Transformer

An astonishing neural network model named “Transformer” came to light in 2017 by a Google-led team. The transformer is a deep learning model and was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. TNN has proven its effectiveness and worth, especially for natural language processing (NLP) tasks.

TNNs are not going to replace RNNs, which were introduced by David Rumelhart in 1986. Unfortunately, RNNs have serious limitations. RNNs are trained on long sequences, so gradients tend to explode out of control or vanish to nothing on some occasions. Long-short-term memory neural networks (LSTM) came to the rescue to solve this shortcoming.

 AILabPage defines Transformer “A special kind of computer software code is called a transformer that possesses the ability to acquire knowledge in self-learning mode. It utilizes a unique approach to attentiveness, focusing on specific segments of the data provided in order to identify the significant elements”.

Transformers are mainly applied in the domains of computer vision and natural language processing.

Thanks to advancements in machine learning algorithms, price and size reductions in storage capacity, more and more computing power at lower costs, and an explosion in data generation of all kinds, More and more new models in deep learning are being introduced at a speed that is difficult to keep track of. The beauty of TNNs is in how they contribute to and add value to neural networks through the staunch use of parallelization.

When machines are able to learn to classify and analyze the data (any kind of data) by themselves, we can safely say “yes.”, We have achieved a small percentage of our deep learning goals. Deep learning-powered tools are able to recognize images that contain dogs, cats, or any other object, and that too without the need to specify what a dog or cat looks like. It’s even going to the next level where it’s able to recognize the species, bio-specifics, and other details in the image like a table, chair, carpet, room size, etc.

What kind of details and how to recognize those details, etc., are getting more advanced almost every day. The year 2020 saw many downsides besides COVID (pandemic), like how much hype and how much reality there is in the AI industry. Let’s discuss these in brief below.

Transformers and DL: Power of Sequential Data and Beyond

Transformers, a specific type of deep learning architecture introduced in the paper “Attention is All You Need” in 2017. Transformers are designed to handle sequential data and address the limitations of traditional recurrent neural networks (RNNs) for long-range dependencies.

  • The key innovation of Transformers is the self-attention mechanism, which allows the model to weigh the importance of different words or elements in the input sequence when processing each element.
  • Transformers have been especially successful in natural language processing tasks, including machine translation, language modeling, sentiment analysis, and text generation.
  • A type of neural network architecture, which have demonstrated remarkable capabilities in solving various complex problems, especially those related to natural language processing and sequential data.

Deep learning is a subset of machine learning that involves training artificial neural networks, with multiple layers to learn from data and make predictions or decisions. These deep neural networks are capable of automatically learning hierarchical representations from the input data, allowing them to handle complex tasks such as image recognition, natural language processing, speech recognition, and more.

  • Deep learning is used with too much ease to predict the unpredictable. In our opinion, “we are all so busy creating artificial intelligence by using a combination of non-biological neural networks and natural intelligence rather than exploring what we have in hand.
  • AILabPage defines Deep learning as “An innovative approach that is undoubtedly remarkable and extremely speedy and relies on three crucial elements: substantial volumes of information, tremendous computational capability, and cutting-edge algorithms and proficiency”. The potential of deep learning seems to have no boundaries.
  • Deep learning requires a specialist with an extremely complex skillset to achieve far better results from the same data set.
  • Learning in deep learning is based on “learning data representations” rather than “task-specific algorithms,” which is the case for other methods.

It’s so fascinating to me to know how the human brain can decode technologies, numbers, puzzles, handle entertainment, understand science, set body modes into pleasure, aggression, art, etc. How does the brain train itself to name a certain object by just looking at 2-3 images when ANNs need millions of those?

Transformers: Revolutionizing Sequential Data Handling

The best deep learning model is helpful for everything from question answering to grammar correction to many more tasks. Transformer is in the same state as convolutional neural networks in 2012, and its architecture is going through transformation (which should be for the better). The good part is that it’s out of incubation already.

Transformers are designed to handle sequential data, unlike recurrent neural networks (RNNs). The good news is that the transformer is much more effective, efficient, and speedy, which reduces training time as it does not require that the sequential data be processed in the same order. Transformers perform their tasks in a high-performance parallelization environment, so their architectures have extremely high resilience.

The immediacy of the transformer speaks to the rapid rate of progress in machine learning and artificial intelligence.


In the above example, the Transformer is represented as a black box. An entire Thai sequence is parsed simultaneously in a feed-forward manner, resulting in a transformed output tensor. In the above picture, the output sequence is more concise than the input sequence. For NLP tasks, depending on the input language, word order, spacing (in Thai sentences, there is no spacing), and sentence length may vary substantially. Now for the below English text: if this needs to be translated to Thai, it would require almost three times more efforts in RNN compared to the transformer.

“Accomplishments are those that get stuck delightfully in your memories and distinguish you from others; remember, it’s not about winning over others all the time or competing with others, as that can lead to fiascos sometimes. It’s all about your own achievements and rewards to yourself for being staunch, tranquil, and harmonious.”

Remember, as mentioned above, unlike other architectures (recurrent neural networks) and LSTMs for NLP, there are no recurrent connections, and thus no real memory of previous states happens in the case of Transformers. Transformers are even smarter, and they easily perceive entire sequences simultaneously.

Recurrent neural networks and Transformers

Unlike recurrent neural networks, the transformers are also designed to handle sequential data, but with a much more efficient and powerful method.

Recurrent neural networks are a linear architectural variant of recursive networks. They have a “memory,” so they differ from other neural networks. This memory remembers all the information about what was calculated in the previous state. It uses the same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output.

The transformers come in as a panacea for all the issues of #RNN and thus do not require the sequential data to be processed in order. So you really don’t need to worry if you are putting your hands directly into the transformer without other neural networks. Transformers are the latest trendy deep learning (neural networks) model most prominent in machine translation that deals with sequences.

Let’s look at Transformers in Little Depth

The success of Transformers in handling complex and context-rich information showcases their potential to drive advancements and breakthroughs in various real-world applications.

As research and development in Transformers continue to progress, we can anticipate further transformative impacts on the way machines understand and process sequential data, paving the way for more sophisticated AI systems and enhancing our capabilities in solving challenging problems across different domains.

We will pick up the same example as above to translate the Thai sentence (the first language), and the machine translation tool will translate it to another language (English).

“ผมต้องการ PS5” (I need ps5)

As per the Transformer architecture, we can magnify a little bit this transformer to see an encoding/decoding component and some connections between the two here

  • Encoders with identical structure ( 2 Level: Self-attention –> Feedforward neural network)
  • Decoders with its identical structure( 3 Level : Self-attention –> encoder-decoder-attention –> Feedforward neural network) 

They both create a stack of multiple levels with the same numbers i.e. if encoders have 5-level stakes then the decoder will also have the same.  So it’s simple to understand that the encoder’s input first enters at “Self-Attention” layer and the output from this layer becomes the input for the feedforward layer. In decoders, it’s the same way treatment but with one exception that’s the layer that helps the decoder to focus on relevant parts of the input sentence.

Transformers – Use Cases

Transformers, a type of neural network architecture, have demonstrated remarkable capabilities in solving various complex problems, especially those related to natural language processing and sequential data. Some of the key problems that can be effectively addressed using Transformers include:

  1. Machine Translation: Transformers have revolutionized machine translation tasks by enabling more accurate and contextually-aware language translations. They excel in handling long-range dependencies and capturing intricate language structures.
  2. Language Modeling: Transformers are widely used for language modeling tasks, where they learn to predict the next word in a sequence based on the context of the preceding words. This is the foundation for various language-related applications.
  3. Sentiment Analysis: Transformers can effectively analyze and classify sentiments in text data, distinguishing between positive, negative, and neutral sentiments, making them valuable for understanding customer feedback and social media sentiment.
  4. Text Generation: Transformers have the ability to generate coherent and contextually appropriate text based on given prompts or input. This is useful in chatbots, automatic text summarization, and creative writing.
  5. Question Answering: Transformers have been employed in question-answering systems, where they can comprehend questions and provide accurate answers by analyzing context and relevant information.
  6. Document Summarization: Transformers can summarize lengthy documents by extracting key information and generating concise summaries, streamlining information retrieval and understanding.
  7. Speech Recognition: Transformers are utilized in speech recognition systems to transcribe spoken language into written text, enhancing speech-to-text conversion accuracy.
  8. Language Understanding: Transformers are applied in natural language understanding tasks, allowing machines to grasp context and nuances in human language, enhancing conversational AI and chatbot interactions.
  9. Image Captioning: Transformers can generate descriptive captions for images, understanding the visual content and expressing it in natural language.
  10. Protein Folding: Transformers are now being employed in the field of bioinformatics to predict protein folding, a complex and essential task in understanding biological functions and diseases.

The versatility of Transformers lies in their attention mechanism, which enables them to effectively process long-range dependencies in sequential data. As a result, they have become a powerful tool for solving a wide range of problems in natural language processing, speech recognition, and various other domains that involve sequential and structured data.

Books Referred & Other material referred

Points to Note:

In Fine-Tuning Language Models from Human Preferences paper by OpenAI, it has been demonstrated how transformer models GPT-2 and GPT-3 can generate extremely humanlike texts. All credits if any remain on the original contributor only. We have covered the Convolutional neural network a kind of machine learning in this post, where we find hidden gems from unlabelled historical data. The last post was on Supervised Machine Learning. The next upcoming post will talk about Deep Reinforcement Learning.

Feedback & Further Question

Do you have any questions about Deep Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.


Conclusion- This post was an attempt to explain the main concepts behind the Transformer. Furthermore, the post was an attempt to outline recent key advancements in the technology and provide insight into areas, in which deep learning can improve investigation. Transformers have emerged as a versatile and powerful tool in the field of artificial intelligence and machine learning. Their attention mechanism allows them to process long-range dependencies in sequential data, making them highly effective in tasks such as natural language processing, speech recognition, and other domains involving structured data.

============================ About the Author =======================

Read about Author atAbout Me

Thank you all, for spending your time reading this post. Please share your opinion / comments / critics / agreements or disagreement. Remark for more details about posts, subjects and relevance please read the disclaimer.

FacebookPage                        ContactMe                          Twitter         ====================================================================

Posted by V Sharma

A Technology Specialist boasting 22+ years of exposure to Fintech, Insuretech, and Investtech with proficiency in Data Science, Advanced Analytics, AI (Machine Learning, Neural Networks, Deep Learning), and Blockchain (Trust Assessment, Tokenization, Digital Assets). Demonstrated effectiveness in Mobile Financial Services (Cross Border Remittances, Mobile Money, Mobile Banking, Payments), IT Service Management, Software Engineering, and Mobile Telecom (Mobile Data, Billing, Prepaid Charging Services). Proven success in launching start-ups and new business units - domestically and internationally - with hands-on exposure to engineering and business strategy. "A fervent Physics enthusiast with a self-proclaimed avocation for photography" in my spare time.

Leave a Reply