Dataset Structure: Features and Labels
In machine learning, every dataset is divided into two main parts:
Features (X)— input variables used by the model to make predictions (e.g., age, height, or number of purchases).Labels (y)— the target variable the model is trying to predict (e.g., whether an email is spam or the price of a house).
In supervised learning, the model learns the relationship between features and labels to make accurate predictions.
Loading a Dataset in Scikit-learn
Scikit-learn includes several built-in datasets for experimentation.
One of the most commonly used is the Iris dataset, which contains measurements of iris flower species.
from sklearn.datasets import load_iris iris = load_iris() # Features (X) - shape: (samples, features) X = iris.data print("Feature shape:", X.shape) print("First row of features:", X[0]) # Labels (y) - shape: (samples,) y = iris.target print("Label shape:", y.shape) print("First label:", y[0])
Inspecting Feature and Label Names
You can check the feature and target names to understand what each column and label represents:
print("Feature names:", iris.feature_names) print("Target names:", iris.target_names)
The following are some key points about features and labels:
-
Featuresare the information your model uses to make predictions. -
Labelsdefine the correct answers during training. -
X: input features, 2D array shape(n_samples, n_features). -
y: target labels, 1D array shape(n_samples,).
Organizing data correctly into X and y is essential for Scikit-learn functions like train_test_split() and .fit().
Proper separation of features and labels is the first step in preparing data for training.
Understanding Dataset Structure
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help