Train-Test Split¶
Model evaluation is a critical part of the machine learning workflow. To assess how well a model generalizes to unseen data, we typically divide the original dataset into two parts: a training set used to build the model and a test set used to evaluate it.
Mathematical Context¶
Given a dataset $\mathcal{D} = \{ (x_i, y_i) \}_{i=1}^n$ of size $n$, the goal is to partition it into two disjoint subsets:
- Training Set ($\mathcal{D}_{train}$): $(1 - \alpha) \times n$ samples.
- Test Set ($\mathcal{D}_{test}$): $\alpha \times n$ samples.
Where $\alpha$ is the test_size (usually between 0.1 and 0.3).
$$\mathcal{D} = \mathcal{D}_{train} \cup \mathcal{D}_{test}, \quad \mathcal{D}_{train} \cap \mathcal{D}_{test} = \emptyset$$
Using train_test_split in sorix¶
sorix provides a utility function in sorix.model_selection to perform this split quickly and reproducibly.
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
import pandas as pd
import numpy as np
from sorix.model_selection import train_test_split
# Create a synthetic dataset
data = {
'feature1': np.random.randn(100),
'feature2': np.random.randn(100),
'target': np.random.randint(0, 2, 100)
}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']
print(f"Original dataset size: {len(X)}")
Original dataset size: 100
Basic Split¶
By default, train_test_split shuffles the data to ensure the sets are representative.
# Perform split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True
)
print(f"X_train size: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"X_test size: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
X_train size: 80 (80%) X_test size: 20 (20%)
Parameters¶
test_size: Float between 0 and 1. Represents the proportion of the dataset to include in the test split.random_state: Seed used by the random number generator for shuffling. Ensures reproducibility.shuffle: Boolean. Whether or not to shuffle the data before splitting. IfFalse, the split will be sequential (taking the first $1 - \alpha$ for training).
Conclusion¶
Always perform a train_test_split before training your models in sorix to avoid overfitting and get an honest estimate of performance.