Lecture

Dataset Structure: Features and Labels

In machine learning, every dataset is divided into two main parts:

  • Features (X) — input variables used by the model to make predictions (e.g., age, height, or number of purchases).
  • Labels (y) — the target variable the model is trying to predict (e.g., whether an email is spam or the price of a house).

In supervised learning, the model learns the relationship between features and labels to make accurate predictions.


Loading a Dataset in Scikit-learn

Scikit-learn includes several built-in datasets for experimentation. One of the most commonly used is the Iris dataset, which contains measurements of iris flower species.

Loading the Iris Dataset
from sklearn.datasets import load_iris iris = load_iris() # Features (X) - shape: (samples, features) X = iris.data print("Feature shape:", X.shape) print("First row of features:", X[0]) # Labels (y) - shape: (samples,) y = iris.target print("Label shape:", y.shape) print("First label:", y[0])

Inspecting Feature and Label Names

You can check the feature and target names to understand what each column and label represents:

Feature and Label Names
print("Feature names:", iris.feature_names) print("Target names:", iris.target_names)

The following are some key points about features and labels:

  • Features are the information your model uses to make predictions.

  • Labels define the correct answers during training.

  • X: input features, 2D array shape (n_samples, n_features).

  • y: target labels, 1D array shape (n_samples,).

Organizing data correctly into X and y is essential for Scikit-learn functions like train_test_split() and .fit(). Proper separation of features and labels is the first step in preparing data for training.

Quiz
0 / 1

Understanding Dataset Structure

In a dataset used for machine learning, the input variables are referred to as .
Features
Labels
Targets
Outputs

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help