Understanding Tokenization in GPT
Language models like GPT do not process text directly; instead, they divide the text into smaller units before calculations.
This process is known as Tokenization
.
In this lesson, we will explore what tokenization is and how tokens are used in GPT.
What is Tokenization?
A token
is a small unit of text, such as a word, punctuation mark, or number, that a sentence is broken into.
When AI receives a prompt like "The cat climbed up the tree.", it splits the sentence into tokens.
The / cat / climbed / up / the / tree / .
In English, tokenization mainly splits words based on spaces
and punctuation
(like periods and commas).
For instance, the sentence "The quick brown fox jumps over the lazy dog." is tokenized as follows:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
A single word can also be divided into multiple tokens based on prefixes, patterns, and suffixes.
For example, the word "unconscious" can be divided into sub-components like un
(a prefix indicating negation), consc
(a common pattern in English words), and ious
(a common suffix in English words), resulting in three tokens.
Token handling methods vary depending on the AI model and the language or character set it’s processing. ChatGPT usually assigns 1 token for every 1–4 characters in English, while other languages may use morphological tokenization.
Note: Most text-generating AIs, like ChatGPT, charge costs based on the number of input and output tokens. Therefore, it's important to reduce unnecessary tokens.
AI models learn statistical relationships between tokens and generate new text based on the prompt.
In the next lesson, we will look into Hallucination
, one of the critical issues in generative AI.
What is the process of breaking down text into smaller units in language models like GPT called?
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help