Tokenization is literally about putting every “piece” of text into its right place before the network can understand it.
Before a neural network like GPT can process language, the text must first be converted into a form it can understand. Neural networks don’t work with words directly, they work with numbers. This conversion starts with tokenization.
The secret of all victory lies in the organization of the non-obvious.
Marcus Aurelius
A token is a chunk of text, which can be a full word (“Running”), a piece of a word (“run”), or even punctuation (“.”). The tokenizer decides how to break text into tokens and assigns each token a unique number from the model’s vocabulary.
Take this sentence:
Running makes me stronger every day.
Step 1 – Tokenization (splitting into pieces):
["Running", " makes", " me", " stronger", " every", " day", "."]
Step 2 – Convert tokens to IDs (numbers from vocabulary):
[14405, 1620, 502, 6789, 1234, 789, 13]
Step 3 – Embeddings (map numbers into dense vectors):
Each ID is turned into a high-dimensional vector (like coordinates in space). This allows the neural network to capture meaning, such as “Running” being closer to “Jogging” than “Chair.”
Step 4 – Transformer layers (Attention, Feed-Forward, etc.):
The neural network processes these vectors through attention mechanisms, linear projections, and residual connections. This is where it “learns patterns,” such as associating “stronger” with “Running” in the sentence.
Step 5 – Output (prediction or generation):
Finally, the network uses probabilities to predict the next token or generate an entire response.
So tokenization is the entry point of the whole system. Without it, the neural network wouldn’t know how to translate human language into numbers it can work with.
