aiFundamentalsMachineLearningChapter2Desc

lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

lesson16Title

lesson17Title

lesson18Title

lesson19Title

lesson20Title

lesson21Title

aiFundamentalsMachineLearningChapter2Title

aiFundamentalsMachineLearningChapter1Desc

lesson22Title

lesson23Title

aiFundamentalsMachineLearningChapter1Title

aiFundamentalsMachineLearningChapter3Desc

lesson24Title

aiFundamentalsMachineLearningChapter3Title

aiFundamentalsMachineLearningChapter4Desc

aiFundamentalsMachineLearningChapter4Title

# Preprocessing Data for Seamless Consumption

`Data preprocessing` refers to the process of **cleaning and transforming** data before analyzing it or training an AI model.

Simply put, it involves making raw data, which may be disorganized or incomplete, clean and consistent.

<br />

## Why is Preprocessing Necessary?

Datasets can contain the following issues:

- *Missing values*: Cases where some data is absent

- *Duplicate values*: Instances where the same data occurs multiple times

- *Inconsistent data*: Situations where the data format is not uniform

Without preprocessing, an AI model may learn from flawed data, leading to inaccurate predictions.

<br />

## JSONL Data Preprocessing Example

Here's how you can handle missing values, fix inconsistent formats, and remove duplicates in a dataset:

```json title="Original JSONL Data"
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Sam Brown", "city": "Chicago"}
{"name": "John Doe", "age": "thirty", "city": "New York"}
```

⬇

```json title="Missing Value Processed JSONL Data"
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Sam Brown", "age": 0, "city": "Chicago"}  // Age missing, replaced with 0
{"name": "John Doe", "age": "thirty", "city": "New York"}
```

⬇

```json title="Inconsistent Data Format Converted JSONL Data"
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Sam Brown", "age": 0, "city": "Chicago"}
{"name": "John Doe", "age": 30, "city": "New York"}  // Converted 'thirty' to 30
```

⬇

```json title="Duplicate Value Removed JSONL Data"
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Sam Brown", "age": 0, "city": "Chicago"}
// "John Doe", "30", "New York" was duplicate and hence removed
```

<br/>

As demonstrated, thorough preprocessing is a critical step in constructing a reliable dataset for model fine-tuning.

Preprocessing refers to organizing and transforming data before analyzing it or training an AI model. This process includes handling missing values, converting data formats, and removing duplicates.

### Preprocessing is the process of organizing data after training an AI model.