Lecture

Data Preprocessing and Model Evaluation Using Scikit-Learn

Preprocessing is the process of transforming data to make it suitable for a model.

Before training a machine learning model, it's necessary to prepare and preprocess the training data.

Scikit-Learn provides a variety of functionalities for data preprocessing and offers metrics for model evaluation.


Data Preprocessing

The process of data preprocessing includes the following steps.


Handling Missing Values

Missing values indicate that there are empty entries in the dataset.

When a dataset contains missing values, they can be filled with the mean or median.

Handling Missing Values
from sklearn.impute import SimpleImputer import numpy as np data = np.array([[1, 2, np.nan], [4, np.nan, 6]]) imputer = SimpleImputer(strategy="mean") filled_data = imputer.fit_transform(data)

Feature Scaling

Feature scaling is the task of making the range of all features identical.

If the magnitude of feature values varies significantly, it can degrade model performance, so normalization is used to make feature ranges consistent.

Feature Scaling
from sklearn.preprocessing import StandardScaler X = np.array([[1, 100], [2, 200], [3, 300]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(X_scaled)

In the code above, StandardScaler transforms features to have a mean of 0 and a standard deviation of 1.

Thus, X_scaled is converted to values with a mean of 0 and a standard deviation of 1.


Model Evaluation

There are various metrics available to evaluate model performance.


Classification Model Evaluation

Classification models categorize data into multiple classes.

Accuracy is calculated to determine how correctly the model made predictions.

Accuracy Evaluation
from sklearn.metrics import accuracy_score y_true = [0, 1, 1, 0] y_pred = [0, 1, 0, 0] accuracy = accuracy_score(y_true, y_pred) print(f"Accuracy: {accuracy:.2f}")

Accuracy is the ratio of correctly predicted values to actual values, and higher accuracy indicates better model performance.


Regression Model Evaluation

Regression analysis aims to statistically predict relationships between data variables.

A regression model calculates the difference between predicted values and actual values from regression analysis.

Regression Model Evaluation
from sklearn.metrics import mean_squared_error y_true = [3.0, -0.5, 2.0, 7.0] y_pred = [2.5, 0.0, 2.0, 8.0] mse = mean_squared_error(y_true, y_pred) print(f"MSE: {mse:.2f}")

The code above calculates the Mean Squared Error (MSE) to measure the difference between predictions and actual values.

Mean Squared Error, which averages the squared differences between predicted and actual values, indicates better model performance when the value is smaller.


References

Mission
0 / 1

What is the most common method for handling missing values?

Remove all missing values.

Replace missing values with -1.

Fill missing values with the mean or median.

Ignore missing values and proceed.

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help