Transfomers in non-NLP applications

Recently, transformers have shown great potential in applications associated with Natural Language Processing. For example, modern Large Language Models heavily rely on them; transformers are even in the name of ChatGPT (General Pre-trained Transformer!). But what are they, and does this new technology have potential elsewhere?

But first, why are transformers used in Natural Language Processing? After all, the standard Recurrent Neural Network should do the trick with variable input sizes. While technically true, the greatest drawback of Recurrent Neural Networks is the capturing of long term dependencies. These networks can’t remember long paragraphs or conversations, because the amount of information they pay attention to or hold is limited. In situations where a large volume of information had to be continuously analyzed a new architecture had to be developed.

The transformer model was invented by engineers at Google and published in the now famous paper, “Attention is all you Need”.

In non-technical terms transformers find the numerical relationships between the words (converted to vectors through a process called tokenization) in long sequences of data. As you can imagine this approach can be used just about anywhere in AI, as connections between data vectors exist anywhere from audio processing to image recognition.

But there is another benefit to using transformers. Despite the fact that they might require expensive specialized equipment to train large models, they can consume data like no other model. For example, Llama 3, an open source model, was trained on 15 trillion tokens, or words. Due to the way the model can simultaneously read a lot of text, it can be used to in applications requiring fast data processing.

Perhaps one of the most interesting applications of transformers is code processing. After all, the relationship between individual lines of code is not only critical, but also most direct and consistent. In the future standardization and improvements to such technology could make programmers even more efficient than now