train_test

sorix.model_selection.train_test ¶

train_test_split ¶

train_test_split(
    X, Y=None, test_size=0.2, random_state=42, shuffle=True
)

Method to split the data into train and test sets.

Parameters:

X (DataFrame) –

Features.
Y (Series, default: None ) –

Labels.
test_size (float, default: 0.2 ) –

Proportion of the dataset to include in the test split. Default is 0.2.
random_state (int, default: 42 ) –

Random state for reproducibility. Default is 42.
shuffle (bool, default: True ) –

Whether to shuffle the data before splitting. Default is True.

Returns:

tuple ( tuple ) –

(X_train, X_test, y_train, y_test)

Source code in sorix/model_selection/train_test.py

def train_test_split(X: pd.DataFrame,
                     Y: pd.Series = None,
                     test_size: float = 0.2,
                     random_state: int = 42,
                     shuffle: bool = True) -> tuple:
    """
    Method to split the data into train and test sets.

    Args:
        X (pd.DataFrame): Features.
        Y (pd.Series): Labels.
        test_size (float, optional): Proportion of the dataset to include in the test split. Default is 0.2.
        random_state (int, optional): Random state for reproducibility. Default is 42.
        shuffle (bool, optional): Whether to shuffle the data before splitting. Default is True.

    Returns:
        tuple: (X_train, X_test, y_train, y_test)
    """

    # Validate test_size
    if not (0 < test_size < 1):
        raise ValueError("test_size must be between 0 and 1.")

    # Set random seed for reproducibility
    np.random.seed(random_state)

    # Shuffle the data if specified
    if shuffle:
        indices = np.random.permutation(len(X))
        X = X.iloc[indices]
        if Y is not None:
            Y = Y.iloc[indices]

    # Calculate the split index
    split_index = int(len(X) * (1 - test_size))

    # Split the data
    X_train = X.iloc[:split_index]
    X_test = X.iloc[split_index:]
    if Y is not None:
        Y_train = Y.iloc[:split_index]
        Y_test = Y.iloc[split_index:]
        return X_train, X_test, Y_train, Y_test

    return X_train, X_test