Data Processing

statinf.data.ProcessData.OneHotEncoding(data, columns, drop=True, verbose=False)[source]

Performs One Hot Encoding (OHE) usally used in Machine Learning.

Parameters
  • data (pandas.DataFrame) – Data Frame on which we apply One Hot Encoding.

  • columns (list) – Column to be converted to dummy variables.

  • drop (bool, optional) – Drop the column for one attribute (first value that appears in the dataset). This helps avoid multicolinearity issues in subsequent models, defaults to True.

  • verbose (bool, optional) – Display progression, defaults to False.

Example
>>> from statinf.data import OneHotEncoding
>>> print(df)
... +----+--------+----------+-----+
... | Id | Gender | Category | Age |
... +----+--------+----------+-----+
... |  1 | Male   |        A |  23 |
... |  2 | Female |        B |  21 |
... |  3 | Female |        A |  31 |
... |  4 | Male   |        C |  22 |
... |  5 | Female |        A |  26 |
... +----+--------+----------+-----+
>>> # Encoding columns "Gender" and "Category"
>>> new_df = OneHotEncoding(df, columns=["Gender", "Category"])
>>> print(new_df)
... +----+---------------+------------+------------+-----+
... | Id | Gender_Female | Category_B | Category_C | Age |
... +----+---------------+------------+------------+-----+
... |  1 |             0 |          0 |          0 |  23 |
... |  2 |             1 |          1 |          0 |  21 |
... |  3 |             1 |          0 |          0 |  31 |
... |  4 |             0 |          0 |          1 |  22 |
... |  5 |             1 |          0 |          0 |  26 |
... +----+---------------+------------+------------+-----+
>>> # Listing the newly created columns
>>> print(new_df.meta._ohe)
... {'Gender': ['Gender_Female'],
...  'Category': ['Category_A', 'Category_B']}
>>> # Get the aggregated list of encoded columns
>>> print(new_df.meta._ohe_all_columns)
... ['Gender_Female', 'Category_B', 'Category_C']
Returns

Transformed data with One Hot Encoded variables. New attributes are added to the data frame:

  • df.meta._ohe: contains the encoded columns and the created columns.

  • df.meta._ohe_all_columns: aggregates the newly created columns in one list. This list can directly be passed or appended to the input columns argument of subsequent models.

Return type

pandas.DataFrame

class statinf.data.ProcessData.Scaler(data, columns)[source]

Bases: object

Data scaler.

Parameters
  • data (pandas.DataFrame) – Data set to scale.

  • columns (list) – Columns to scale.

Example
>>> from statinf.data import Scaler, generate_dataset
>>> coeffs = [1.2556, -0.465, 1.665414, 2.5444, -7.56445]
>>> data = generate_dataset(coeffs, n=10, std_dev=2.6)
>>> # Original dataset
>>> print(data)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.669510 | -1.385569 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  1.025392 | -0.639291 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 | -0.972473 |  0.843746 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 | -1.081926 |  2.646441 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 | -0.396210 |  1.101739 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 | -0.114952 | -0.097337 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 | -0.600019 |  0.224186 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 | -0.329940 | -0.301762 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.270254 | -1.263288 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
>>> # Load scaler class
>>> scaler = Scaler(data=data, columns=['X1', 'X2'])
>>> # Scale our dataset with MinMax method
>>> scaled_df = scaler.MinMax()
>>> print(scaled_df)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.000000 |  0.000000 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  0.765898 |  0.185088 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 |  0.039781 |  0.552904 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 |  0.000000 |  1.000000 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 |  0.207162 |  0.163399 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 |  0.249221 |  0.616890 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 |  0.351444 |  0.319501 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 |  0.175148 |  0.399244 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 |  0.273307 |  0.268801 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.491445 |  0.030328 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
>>> # Unscale the new dataset to retreive previous data scale
>>> unscaled_df = scaler.unscaleMinMax(scaled_df)
>>> print(unscaled_df)
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |        X0 |        X1 |        X2 |        X3 |        X4 |         Y |
... +-----------+-----------+-----------+-----------+-----------+-----------+
... |  0.977594 |  1.669510 | -1.385569 |  0.696975 | -1.207098 |  8.501692 |
... | -0.953802 |  1.025392 | -0.639291 |  0.658251 |  0.746814 | -7.186085 |
... | -0.148140 | -0.972473 |  0.843746 |  1.306845 |  0.269834 |  1.939924 |
... |  0.499385 | -1.081926 |  2.646441 |  0.910503 |  0.857189 |  0.389257 |
... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 |
... | -0.434994 | -0.396210 |  1.101739 | -0.660236 | -1.197566 |  7.735832 |
... |  0.032478 | -0.114952 | -0.097337 |  1.794769 |  1.239423 | -5.510332 |
... |  0.085569 | -0.600019 |  0.224186 |  0.301771 |  1.278387 | -8.648084 |
... | -0.028844 | -0.329940 | -0.301762 |  0.946077 | -0.359133 |  5.099971 |
... | -0.665312 |  0.270254 | -1.263288 |  0.545625 |  0.499162 | -6.126528 |
... +-----------+-----------+-----------+-----------+-----------+-----------+
MinMax(data=None, columns=None, feature_range=(0, 1), col_suffix='')[source]

Min-max scaler. Data we range between 0 and 1.

Parameters
  • data (pandas.DataFrame, optional) – Data set to scale, defaults to None, takes data provided in __init__(), defaults to None.

  • columns (list, optional) – Columns to be scaled, defaults to None, takes the list provided in __init__(), defaults to None.

  • feature_range (tuple, optional) – Expected value range of the scaled data, defaults to (0, 1).

  • col_suffix (str, optional) – Suffix to add to colum names, defaults to ‘’, overrides the existing columns.

Formula
\[x_{\text{scaled}} = \dfrac{x - \min(x)}{\max(x) - \min(x)} \cdot (f\_max - f\_min) + f\_min\]

where \((f\_min, f\_max)\) defaults to \((0, 1)\) and corresponds to the expected data range of the scaled data from argument feature_range.

Returns

Data set with scaled features.

Return type

pandas.DataFrame

Normalize(center=True, reduce=True, data=None, columns=None, col_suffix='')[source]

Data normalizer. Centers and reduces features (from mean and standard deviation).

Parameters
  • center (bool, optional) – Center the variable, i.e. substract the mean, defaults to True.

  • reduce (bool, optional) – Reduce the variable, i.e. devide by standard deviation, defaults to True.

  • data (pandas.DataFrame, optional) – Data set to normalize, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be normalize, defaults to None, takes the list provided in __init__().

  • col_suffix (str, optional) – [description], defaults to ‘’

Formula
\[x_{\text{scaled}} = \dfrac{x - \bar{x}}{\sqrt{\mathbb{V}(x)}}\]
Returns

Data set with normalized features.

Return type

pandas.DataFrame

unscaleMinMax(data=None, columns=None, columns_mapping={})[source]

Unscale from min-max. Retreives data from the same range as the original features.

Parameters
  • data (pandas.DataFrame, optional) – Data set to unscale, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be unscaled, defaults to None, takes the list provided in __init__().

  • columns_mapping (dict, optional) – Mapping between eventual renamed columns and original scaled column.

Formula
\[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \left(\max(x) - \min(x) \right) + \min(x)\]
Returns

Unscaled data set.

Return type

pandas.DataFrame

unscaleNormalize(data=None, columns=None, columns_mapping={})[source]

Denormalize data to retreive the same range as the original data set.

Parameters
  • data (pandas.DataFrame, optional) – Data set to unscale, defaults to None, takes data provided in __init__().

  • columns (list, optional) – Columns to be unscaled, defaults to None, takes the list provided in __init__().

  • columns_mapping (dict, optional) – Mapping between eventual renamed columns and original scaled column.

Formula
\[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \sqrt{\mathbb{V}(x)} + \bar{x}\]
Returns

De-normalized data set.

Return type

pandas.DataFrame

statinf.data.ProcessData.create_dataset(data, n_in=1, n_out=1, dropnan=True)[source]

Function to convert a DataFrame into into multivariate time series format readable by Keras LSTM.

Parameters
  • data (pandas.DataFrame) – DataFrame on which to aply the transformation.

  • n_in (int, optional) – Input dimension also known as look back or size of the window, defaults to 1

  • n_out (int, optional) – Output dimension, defaults to 1

  • dropnan (bool, optional) – Remove empty values in the series, defaults to True

Returns

Features converted for Keras LSTM.

Return type

pandas.DataFrame

statinf.data.ProcessData.multivariate_time_series(data)[source]

Convert a dataframe into numpy array multivariate time series.

Parameters

data (pandas.DataFrame) – Input data to transform.

Exemple
>>> from statinf.data import multivariate_time_series, split_sequences
>>> train_to_split = multivariate_time_series(train)
>>> X, y = split_sequences(train_to_split, look_back=7)
Returns

Transformed multivariate time series data.

Return type

numpy.ndarray

statinf.data.ProcessData.parse_formula(formula, data, check_values=True, return_all=False)[source]

This function is used in regression models in order to apply transformations on the data from a formula. It allows to apply transformations from a str formula. See below for examples.

Parameters
  • formula (str) –

    Regression formula to be run of the form y ~ x1 + x2. Accepted functions are:

    • \(\log(x)\) : log(X)

    • \(\exp(x)\) : exp(X)

    • \(\sqrt{x}\) : sqrt(X)

    • \(\cos(x)\) : cos(X)

    • \(\sin(x)\) : sin(X)

    • \(x^{z}\) : X ** Z

    • \(\dfrac{x}{z}\) : X/Z

    • \(x \times z\) : X*Z

  • data (pandas.DataFrame) – Data on which to perform the transformations.

  • check_values (bool, optional) – For each transformation check whether the data range satisfy the domain definition of the function, defaults to True.

  • return_all (bool, optional) – Returns the transformed data, column Y and columns X, defaults to False.

Example

>>> from statinf.data import parse_formula
>>> print(input_df)
... +-----------+-----------+-----------+
... |        X1 |        X2 |         Y |
... +-----------+-----------+-----------+
... |  0.555096 |  0.681083 | -1.383428 |
... |  1.155661 |  0.391129 | -7.780989 |
... | -0.299251 | -0.445602 | -8.146673 |
... | -0.978311 |  1.312146 |  8.653818 |
... | -0.225917 |  0.522016 | -9.684332 |
... +-----------+-----------+-----------+
>>> form = 'Y ~ X1 + X2 + exp(X2) + X1*X2'
>>> new_df = parse_formula(form, data=input_df)
>>> print(new_df)
... +-----------+-----------+-----------+-----------+-----------+
... |        X1 |        X2 |         Y |   exp(X2) |     X1*X2 |
... +-----------+-----------+-----------+-----------+-----------+
... |  0.555096 |  0.681083 | -1.383428 |  1.976017 |  0.378066 |
... |  1.155661 |  0.391129 | -7.780989 |  1.478649 |  0.452012 |
... | -0.299251 | -0.445602 | -8.146673 |  0.640438 |  0.133347 |
... | -0.978311 |  1.312146 |  8.653818 |  3.714134 | -1.283687 |
... | -0.225917 |  0.522016 | -9.684332 |  1.685422 | -0.117932 |
... +-----------+-----------+-----------+-----------+-----------+
Raises

ValueError – Returns an error when the data cannot satisfy the domain definition for the required transformation.

Returns

Transformed data set

Return type

pandas.DataFrame

statinf.data.ProcessData.rankdata(x)[source]

Assigns rank to data. This is mainly used for analysis like Spearman’s correlation.

Parameters

x (numpy.array) – Input vector. Format can be numpy.array, list or pandas.Series.

Example

>>> rankdata([2., 5.44, 3.93, 3.3, 1.1])
... array([1, 4, 3, 2, 0])
Returns

Vector with ranked values.

Return type

numpy.array

statinf.data.ProcessData.split_sequences(data, look_back=1)[source]

Split a multivariate time series from statinf.data.ProcessData.multivariate_time_series() into a Keras’ friendly format for LSTM.

Parameters
  • data (numpy.ndarray) – Data in the format of sequences to transform.

  • look_back (int) – Size of the trailing window, number of time steps to consider, defaults to 1.

Exemple
>>> from statinf.data import multivariate_time_series, split_sequences
>>> train_to_split = multivariate_time_series(train)
>>> X, y = split_sequences(train_to_split, look_back=7)
Returns

  • x: Input data converted for Keras LSTM.

  • y: Target series converted for Keras LSTM.

Return type

  • numpy.ndarray

  • numpy.ndarray