Data Processing, columns, drop=True, verbose=False)[source]

Performs One Hot Encoding (OHE) usally used in Machine Learning.

  • data (pandas.DataFrame) – Data Frame on which we apply One Hot Encoding.

  • columns (list) – Column to be converted to dummy variables.

  • drop (bool, optional) – Drop the column for one attribute (first value that appears in the dataset). This helps avoid multicolinearity issues in subsequent models, defaults to True.

  • verbose (bool, optional) – Display progression, defaults to False.

>>> from import OneHotEncoding
>>> print(df)
... +----+--------+----------+-----+
... | Id | Gender | Category | Age |
... +----+--------+----------+-----+
... |  1 | Male   |        A |  23 |
... |  2 | Female |        B |  21 |
... |  3 | Female |        A |  31 |
... |  4 | Male   |        C |  22 |
... |  5 | Female |        A |  26 |
... +----+--------+----------+-----+
>>> # Encoding columns "Gender" and "Category"
>>> new_df = OneHotEncoding(df, columns=["Gender", "Category"])
>>> print(new_df)
... +----+---------------+------------+------------+-----+
... | Id | Gender_Female | Category_B | Category_C | Age |
... +----+---------------+------------+------------+-----+
... |  1 |             0 |          0 |          0 |  23 |
... |  2 |             1 |          1 |          0 |  21 |
... |  3 |             1 |          0 |          0 |  31 |
... |  4 |             0 |          0 |          1 |  22 |
... |  5 |             1 |          0 |          0 |  26 |
... +----+---------------+------------+------------+-----+
>>> # Listing the newly created columns
>>> print(new_df.meta._ohe)
... {'Gender': ['Gender_Female'],
...  'Category': ['Category_A', 'Category_B']}
>>> # Get the aggregated list of encoded columns
>>> print(new_df.meta._ohe_all_columns)
... ['Gender_Female', 'Category_B', 'Category_C']

Transformed data with One Hot Encoded variables. New attributes are added to the data frame:

  • df.meta._ohe: contains the encoded columns and the created columns.

  • df.meta._ohe_all_columns: aggregates the newly created columns in one list. This list can directly be passed or appended to the input columns argument of subsequent models.

Return type


class, columns)[source]

Bases: object

Data scaler.

  • data (pandas.DataFrame) – Data set to scale.

  • columns (list) – Columns to scale.

>>> from import Scaler, generate_dataset
>>> coeffs = [1.2556, -0.465, 1.665414, 2.5444, -7.56445]
>>> data = generate_dataset(coeffs, n=10, std_dev=2.6)
>>> # Original dataset
>>> print(data)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.669510 | -1.385569 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  1.025392 | -0.639291 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 | -0.972473 |  0.843746 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 | -1.081926 |  2.646441 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 | -0.396210 |  1.101739 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 | -0.114952 | -0.097337 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 | -0.600019 |  0.224186 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 | -0.329940 | -0.301762 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.270254 | -1.263288 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
>>> # Load scaler class
>>> scaler = Scaler(data=data, columns=['X1', 'X2'])
>>> # Scale our dataset with MinMax method
>>> scaled_df = scaler.MinMax()
>>> print(scaled_df)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.000000 |  0.000000 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  0.765898 |  0.185088 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 |  0.039781 |  0.552904 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 |  0.000000 |  1.000000 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 |  0.207162 |  0.163399 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 |  0.249221 |  0.616890 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 |  0.351444 |  0.319501 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 |  0.175148 |  0.399244 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 |  0.273307 |  0.268801 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.491445 |  0.030328 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
>>> # Unscale the new dataset to retreive previous data scale
>>> unscaled_df = scaler.unscaleMinMax(scaled_df)
>>> print(unscaled_df)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.669510 | -1.385569 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  1.025392 | -0.639291 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 | -0.972473 |  0.843746 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 | -1.081926 |  2.646441 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 | -0.396210 |  1.101739 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 | -0.114952 | -0.097337 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 | -0.600019 |  0.224186 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 | -0.329940 | -0.301762 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.270254 | -1.263288 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
MinMax(data=None, columns=None, feature_range=(0, 1), col_suffix='')[source]

Min-max scaler. Data we range between 0 and 1.

  • data (pandas.DataFrame, optional) – Data set to scale, defaults to None, takes data provided in __init__(), defaults to None.

  • columns (list, optional) – Columns to be scaled, defaults to None, takes the list provided in __init__(), defaults to None.

  • feature_range (tuple, optional) – Expected value range of the scaled data, defaults to (0, 1).

  • col_suffix (str, optional) – Suffix to add to colum names, defaults to ‘’, overrides the existing columns.

\[x_{\text{scaled}} = \dfrac{x - \min(x)}{\max(x) - \min(x)} \cdot (f\_max - f\_min) + f\_min\]

where \((f\_min, f\_max)\) defaults to \((0, 1)\) and corresponds to the expected data range of the scaled data from argument feature_range.


Data set with scaled features.

Return type


Normalize(center=True, reduce=True, data=None, columns=None, col_suffix='')[source]

Data normalizer. Centers and reduces features (from mean and standard deviation).

  • center (bool, optional) – Center the variable, i.e. substract the mean, defaults to True.

  • reduce (bool, optional) – Reduce the variable, i.e. devide by standard deviation, defaults to True.

  • data (pandas.DataFrame, optional) – Data set to normalize, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be normalize, defaults to None, takes the list provided in __init__().

  • col_suffix (str, optional) – [description], defaults to ‘’

\[x_{\text{scaled}} = \dfrac{x - \bar{x}}{\sqrt{\mathbb{V}(x)}}\]

Data set with normalized features.

Return type


unscaleMinMax(data=None, columns=None, columns_mapping={})[source]

Unscale from min-max. Retreives data from the same range as the original features.

  • data (pandas.DataFrame, optional) – Data set to unscale, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be unscaled, defaults to None, takes the list provided in __init__().

  • columns_mapping (dict, optional) – Mapping between eventual renamed columns and original scaled column.

\[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \left(\max(x) - \min(x) \right) + \min(x)\]

Unscaled data set.

Return type


unscaleNormalize(data=None, columns=None, columns_mapping={})[source]

Denormalize data to retreive the same range as the original data set.

  • data (pandas.DataFrame, optional) – Data set to unscale, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be unscaled, defaults to None, takes the list provided in __init__().

  • columns_mapping (dict, optional) – Mapping between eventual renamed columns and original scaled column.

\[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \sqrt{\mathbb{V}(x)} + \bar{x}\]

De-normalized data set.

Return type

pandas.DataFrame, n_in=1, n_out=1, dropnan=True)[source]

Function to convert a DataFrame into into multivariate time series format readable by Keras LSTM.

  • data (pandas.DataFrame) – DataFrame on which to aply the transformation.

  • n_in (int, optional) – Input dimension also known as look back or size of the window, defaults to 1

  • n_out (int, optional) – Output dimension, defaults to 1

  • dropnan (bool, optional) – Remove empty values in the series, defaults to True


Features converted for Keras LSTM.

Return type


Convert a dataframe into numpy array multivariate time series.


data (pandas.DataFrame) – Input data to transform.

>>> from import multivariate_time_series, split_sequences
>>> train_to_split = multivariate_time_series(train)
>>> X, y = split_sequences(train_to_split, look_back=7)

Transformed multivariate time series data.

Return type

numpy.ndarray, data, check_values=True, return_all=False)[source]

This function is used in regression models in order to apply transformations on the data from a formula. It allows to apply transformations from a str formula. See below for examples.

  • formula (str) –

    Regression formula to be run of the form y ~ x1 + x2. Accepted functions are:

    • \(\log(x)\) : log(X)

    • \(\exp(x)\) : exp(X)

    • \(\sqrt{x}\) : sqrt(X)

    • \(\cos(x)\) : cos(X)

    • \(\sin(x)\) : sin(X)

    • \(x^{z}\) : X ** Z

    • \(\dfrac{x}{z}\) : X/Z

    • \(x \times z\) : X*Z

  • data (pandas.DataFrame) – Data on which to perform the transformations.

  • check_values (bool, optional) – For each transformation check whether the data range satisfy the domain definition of the function, defaults to True.

  • return_all (bool, optional) – Returns the transformed data, column Y and columns X, defaults to False.


>>> from import parse_formula
>>> print(input_df)
... +-----------+-----------+-----------+
... |        X1 |        X2 |         Y |
... +-----------+-----------+-----------+
... |  0.555096 |  0.681083 | -1.383428 |
... |  1.155661 |  0.391129 | -7.780989 |
... | -0.299251 | -0.445602 | -8.146673 |
... | -0.978311 |  1.312146 |  8.653818 |
... | -0.225917 |  0.522016 | -9.684332 |
... +-----------+-----------+-----------+
>>> form = 'Y ~ X1 + X2 + exp(X2) + X1*X2'
>>> new_df = parse_formula(form, data=input_df)
>>> print(new_df)
... +-----------+-----------+-----------+-----------+-----------+
... |        X1 |        X2 |         Y |   exp(X2) |     X1*X2 |
... +-----------+-----------+-----------+-----------+-----------+
... |  0.555096 |  0.681083 | -1.383428 |  1.976017 |  0.378066 |
... |  1.155661 |  0.391129 | -7.780989 |  1.478649 |  0.452012 |
... | -0.299251 | -0.445602 | -8.146673 |  0.640438 |  0.133347 |
... | -0.978311 |  1.312146 |  8.653818 |  3.714134 | -1.283687 |
... | -0.225917 |  0.522016 | -9.684332 |  1.685422 | -0.117932 |
... +-----------+-----------+-----------+-----------+-----------+

ValueError – Returns an error when the data cannot satisfy the domain definition for the required transformation.


Transformed data set

Return type


Assigns rank to data. This is mainly used for analysis like Spearman’s correlation.


x (numpy.array) – Input vector. Format can be numpy.array, list or pandas.Series.


>>> rankdata([2., 5.44, 3.93, 3.3, 1.1])
... array([1, 4, 3, 2, 0])

Vector with ranked values.

Return type

numpy.array, look_back=1)[source]

Split a multivariate time series from into a Keras’ friendly format for LSTM.

  • data (numpy.ndarray) – Data in the format of sequences to transform.

  • look_back (int) – Size of the trailing window, number of time steps to consider, defaults to 1.

>>> from import multivariate_time_series, split_sequences
>>> train_to_split = multivariate_time_series(train)
>>> X, y = split_sequences(train_to_split, look_back=7)

  • x: Input data converted for Keras LSTM.

  • y: Target series converted for Keras LSTM.

Return type

  • numpy.ndarray

  • numpy.ndarray