Categorical Data Encoding
AI and machine learning models can only understand numbers.
However, much of the data we work with is text-based.
This kind of data, grouped into categories without numerical meaning, is called categorical data
.`
| ID | Color | Region | Occupation | |-----|-------|--------|------------| | 1 | Red | New York | Student | | 2 | Blue | Chicago | Employee | | 3 | Green | Los Angeles | Student | | 4 | Yellow| New York | Doctor |
In the data above, color, region, and occupation are categorical data.
These cannot be used for direct calculations, and comparing their magnitude or order is not meaningful.
Categorical data can be divided into two main types.
Nominal Data
This is categorical data without any order. Examples of nominal data include colors (red, blue, green) and regions (New York, Chicago, Los Angeles).
Ordinal Data
This is categorical data with an order. Examples of ordinal data include education levels (elementary, middle, high school) and customer satisfaction levels (low, medium, high).
Categorical data needs to be converted into numerical form for machine learning, a process known as encoding
.
What is Data Encoding?
Categorical data must be transformed into numbers so that machine learning models can comprehend it. This transformation process is known as data encoding.
For example, let's convert the color data above into numbers.
| ID | Color | Color (Encoded) | |-----|--------|----------------| | 1 | Red | 0 | | 2 | Blue | 1 | | 3 | Green | 2 | | 4 | Yellow | 3 |
This allows the model to process color data numerically.
There are methods like Label Encoding
and One-Hot Encoding
for this transformation.
We will discuss each method in more detail in the following lessons.
What is the process of converting categorical data into numbers called?
Standardization
Normalization
Encoding
Clustering
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help