If you’ve ever wondered how large language models like GPT can generate text so easily and why some people on Twitter call it a “glorified autocomplete”, it all comes down to one thing: tokens. Now I'm no expert on this stuff but I've recently learned a thing or two about how tokens work and it’s way more fascinating than I expected.

What are tokens?

In the context of natural language processing and LLMs like GPT, tokens are like the building blocks of text. They take a piece of text and divide it into smaller chunks. These chunks can be words, letters or even parts of words. Each token is assigned a unique numerical representation and by doing this GPT can study the patterns and structures in the language.

Why does GPT do this tokenization thing?

Good question! Here's why:

It makes things consistent: Tokens help GPT process text in a consistent way. It's like putting everything in the same format so that GPT can understand and work with it.

It saves memory: Tokens are memory friendly. They don't take up as much space as storing entire words or characters. This is important because GPT deals with a ton of text, and it needs to be efficient.

It handles complexity: Language can be pretty complex so what tokenization does is it breaks down sentences into smaller pieces which makes it easier for GPT to understand what's going on and how different parts relate to each other.

It manages the word collection: GPT has a set vocabulary of tokens it knows. During training it figures out which tokens to use. If it comes across a word it doesn't know, it has special tokens to handle that situation.

If you're still having trouble understanding how tokenization works there's this awesome visual tokenizer created by Simon Willison that you can play with.

The art of tokenization

Breaking down text

Here we have a sentence "I like my tokens like my snacks, bite-sized." This sentence is now our test subject to understand how tokenization works. When we pass this sentence through the token encoder, it breaks down into smaller tokens, such as "I," "like," "my," "tokens," "snacks," ",", "bite," "-", and "sized." Each token receives its own unique representation kind of like a secret code that captures its essence.

In this particular example, all the words are small and commonly used in the English language. As a result, each word corresponds to a single token. However, it's important to note that this isn't always the case.

Nuances and quirks

If we look at the above example, we encounter some fancy and less common words. The token encoder deals with these words in a very simple but effective way. It breaks down these big words into even smaller tokens which helps the model process language more efficiently. So here the word "Token" becomes token 22906, but remember in the previous example we had " token" with a space and a lowercase "t"? It turned into token 16326!

This is where tokenization gets pretty interesting. It assigns different tokens to the same word based on things like capitaization or leading spaces. This way the tokenizer can encode each variation accurately plus it saves tokens by not giving one to every space in our text. That’s clever if you ask me.

TL;DR

Tokenization is really just a simple technique that breaks down text into smaller pieces so models like GPT can make sense of it. But the interesting part is how it handles the tiny details like whether a word starts with a capital letter or has a space in front. Those small things can totally change the token. It’s a simple thing that helps the model stay efficient and understand the messy, nuanced way we use language.

If you're curious about more tokenizer quirks and anomalies, I highly recommend checking out Simon Willison's in depth blog post.