Train-Test Split¶

Model evaluation is a critical part of the machine learning workflow. To assess how well a model generalizes to unseen data, we typically divide the original dataset into two parts: a training set used to build the model and a test set used to evaluate it.

Mathematical Context¶

Given a dataset $\mathcal{D} = \{ (x_i, y_i) \}_{i=1}^n$ of size $n$, the goal is to partition it into two disjoint subsets:

Training Set ($\mathcal{D}_{train}$): $(1 - \alpha) \times n$ samples.
Test Set ($\mathcal{D}_{test}$): $\alpha \times n$ samples.

Where $\alpha$ is the test_size (usually between 0.1 and 0.3).

$$\mathcal{D} = \mathcal{D}_{train} \cup \mathcal{D}_{test}, \quad \mathcal{D}_{train} \cap \mathcal{D}_{test} = \emptyset$$

Using train_test_split in sorix¶

sorix provides a utility function in sorix.model_selection to perform this split quickly and reproducibly.

In [1]:

Copied!

# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@main'

In [2]:

Copied!





import pandas as pd
import numpy as np
from sorix.model_selection import train_test_split

# Create a synthetic dataset
data = {
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randint(0, 2, 100)
}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

print(f"Original dataset size: {len(X)}")
import pandas as pd
import numpy as np
from sorix.model_selection import train_test_split

# Create a synthetic dataset
data = {
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': np.random.randint(0, 2, 100)
}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

print(f"Original dataset size: {len(X)}")

Original dataset size: 100

Basic Split¶

By default, train_test_split shuffles the data to ensure the sets are representative.

In [3]:

Copied!





# Perform split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)

print(f"X_train size: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"X_test  size: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
# Perform split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)

print(f"X_train size: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"X_test  size: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

X_train size: 80 (80%)
X_test  size: 20 (20%)

Parameters¶

test_size: Float between 0 and 1. Represents the proportion of the dataset to include in the test split.
random_state: Seed used by the random number generator for shuffling. Ensures reproducibility.
shuffle: Boolean. Whether or not to shuffle the data before splitting. If False, the split will be sequential (taking the first $1 - \alpha$ for training).

Conclusion¶

Always perform a train_test_split before training your models in sorix to avoid overfitting and get an honest estimate of performance.