lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

aiFundamentalsInActionChapter1Title

aiFundamentalsInActionChapter2Title

aiFundamentalsInActionChapter3Title

aiFundamentalsInActionChapter4Title

# Understanding Tokenization in GPT

Language models like GPT do not process text directly; instead, they divide the text into smaller units before calculations.

This process is known as `Tokenization`.

In this lesson, we will explore what tokenization is and how tokens are used in GPT.

<br />

## What is Tokenization?

A `token` is a small unit of text, such as a word, punctuation mark, or number, that a sentence is broken into.

When AI receives a prompt like "The cat climbed up the tree.", it splits the sentence into tokens.

```plaintext title="Tokenized Sentence Example"
The / cat / climbed / up / the / tree / .
```

In English, tokenization mainly splits words based on `spaces` and `punctuation` (like periods and commas).

For instance, the sentence "The quick brown fox jumps over the lazy dog." is tokenized as follows:

```plaintext title="English Tokenization Example"
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
```

A single word can also be divided into multiple tokens based on prefixes, patterns, and suffixes.

For example, the word "unconscious" can be divided into sub-components like `un` (a prefix indicating negation), `consc` (a common pattern in English words), and `ious` (a common suffix in English words), resulting in three tokens.

<br />

Token handling methods vary depending on the AI model and the language or character set it’s processing. ChatGPT usually assigns 1 token for every 1–4 characters in English, while other languages may use morphological tokenization.

> *Note*: Most text-generating AIs, like ChatGPT, charge costs based on the number of input and output tokens. Therefore, it's important to reduce unnecessary tokens.

<br />

AI models learn statistical relationships between tokens and generate new text based on the prompt.

<br />

In the next lesson, we will look into `Hallucination`, one of the critical issues in generative AI.

Tokenization is the process of breaking down text into smaller units, which is essential for language models like GPT to understand and process text. This allows the AI to analyze each part of the text individually and produce better results.

### What is the process of breaking down text into smaller units in language models like GPT called?