lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

pythonDataAnalysisAdvancedChapter4Title

pythonDataAnalysisAdvancedChapter1Title

pythonDataAnalysisAdvancedChapter2Title

pythonDataAnalysisAdvancedChapter3Title

# Splitting Data: Train vs Test

In machine learning, datasets are divided into `training` and `testing` sets to evaluate how well a model generalizes to unseen data.

* `Training set` — used to teach the model patterns in the data.
* `Testing set` — used to evaluate performance on data the model hasn’t seen before.

Without this separation, models risk **overfitting** — memorizing data instead of learning generalizable patterns.

<br/>

## Using `train_test_split` in Scikit-learn

The `train_test_split()` function randomly divides data into training and testing sets with a single line of code.

```python title="Basic Train-Test Split"
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)
```

<br/>

## Controlling Randomness

Use the `random_state` parameter to make your results **reproducible**.
Without it, each run will generate a slightly different split.

```python title="Fixed Random State"
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)
```

<br/>

## Stratified Splits

For *classification tasks*, set `stratify=y` to keep class proportions consistent between training and testing sets.

```python title="Stratified Split"
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Check distribution
import numpy as np

unique_train, counts_train = np.unique(y_train, return_counts=True)
unique_test, counts_test = np.unique(y_test, return_counts=True)

print("Train distribution:", dict(zip(unique_train, counts_train)))
print("Test distribution:", dict(zip(unique_test, counts_test)))
```

<br/>

## Key Takeaways

* Always *split* your data before training to prevent overfitting.
* Use `train_test_split()` — it’s simple, flexible, and built into Scikit-learn.
* Apply `stratify=y` for classification to preserve label proportions.
* Set `random_state` for consistent, reproducible results.

Splitting a dataset into training and testing sets is crucial for assessing a model's ability to generalize beyond the training data. It helps prevent overfitting, where a model memorizes the data rather than learning general rules. This practice ensures that the model's performance is robust and applicable to new, unseen data.

Splitting Data: Train vs Test

Using train_test_split in Scikit-learn

Controlling Randomness

Stratified Splits

Key Takeaways

What is the primary reason for splitting a dataset into training and testing sets in machine learning?

Using `train_test_split` in Scikit-learn