Lecture

Understanding Tokenization in GPT

Language models like GPT do not process text directly; instead, they divide the text into smaller units before calculations.

This process is known as Tokenization.

In this lesson, we will explore what tokenization is and how tokens are used in GPT.


What is Tokenization?

A token is a small unit of text, such as a word, punctuation mark, or number, that a sentence is broken into.

When AI receives a prompt like "The cat climbed up the tree.", it splits the sentence into tokens.

Tokenized Sentence Example
The / cat / climbed / up / the / tree / .

In English, tokenization mainly splits words based on spaces and punctuation (like periods and commas).

For instance, the sentence "The quick brown fox jumps over the lazy dog." is tokenized as follows:

English Tokenization Example
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

A single word can also be divided into multiple tokens based on prefixes, patterns, and suffixes.

For example, the word "unconscious" can be divided into sub-components like un (a prefix indicating negation), consc (a common pattern in English words), and ious (a common suffix in English words), resulting in three tokens.


Token handling methods vary depending on the AI model and the language or character set it’s processing. ChatGPT usually assigns 1 token for every 1–4 characters in English, while other languages may use morphological tokenization.

Note: Most text-generating AIs, like ChatGPT, charge costs based on the number of input and output tokens. Therefore, it's important to reduce unnecessary tokens.


AI models learn statistical relationships between tokens and generate new text based on the prompt.


In the next lesson, we will look into Hallucination, one of the critical issues in generative AI.

Quiz
0 / 1

What is the process of breaking down text into smaller units in language models like GPT called?

The process of breaking down text into smaller units in language models like GPT is called .
Tokenization
Parsing
Preprocessing
Morphological Analysis

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help