Data Processing
- statinf.data.ProcessData.OneHotEncoding(data, columns, drop=True, verbose=False)[source]
Performs One Hot Encoding (OHE) usally used in Machine Learning.
- Parameters
data (
pandas.DataFrame
) – Data Frame on which we apply One Hot Encoding.columns (
list
) – Column to be converted to dummy variables.drop (
bool
, optional) – Drop the column for one attribute (first value that appears in the dataset). This helps avoid multicolinearity issues in subsequent models, defaults to True.verbose (
bool
, optional) – Display progression, defaults to False.
- Example
>>> from statinf.data import OneHotEncoding >>> print(df) ... +----+--------+----------+-----+ ... | Id | Gender | Category | Age | ... +----+--------+----------+-----+ ... | 1 | Male | A | 23 | ... | 2 | Female | B | 21 | ... | 3 | Female | A | 31 | ... | 4 | Male | C | 22 | ... | 5 | Female | A | 26 | ... +----+--------+----------+-----+ >>> # Encoding columns "Gender" and "Category" >>> new_df = OneHotEncoding(df, columns=["Gender", "Category"]) >>> print(new_df) ... +----+---------------+------------+------------+-----+ ... | Id | Gender_Female | Category_B | Category_C | Age | ... +----+---------------+------------+------------+-----+ ... | 1 | 0 | 0 | 0 | 23 | ... | 2 | 1 | 1 | 0 | 21 | ... | 3 | 1 | 0 | 0 | 31 | ... | 4 | 0 | 0 | 1 | 22 | ... | 5 | 1 | 0 | 0 | 26 | ... +----+---------------+------------+------------+-----+ >>> # Listing the newly created columns >>> print(new_df.meta._ohe) ... {'Gender': ['Gender_Female'], ... 'Category': ['Category_A', 'Category_B']} >>> # Get the aggregated list of encoded columns >>> print(new_df.meta._ohe_all_columns) ... ['Gender_Female', 'Category_B', 'Category_C']
- Returns
Transformed data with One Hot Encoded variables. New attributes are added to the data frame:
df.meta._ohe
: contains the encoded columns and the created columns.df.meta._ohe_all_columns
: aggregates the newly created columns in one list. This list can directly be passed or appended to the input columns argument of subsequent models.
- Return type
pandas.DataFrame
- class statinf.data.ProcessData.Scaler(data, columns)[source]
Bases:
object
Data scaler.
- Parameters
data (
pandas.DataFrame
) – Data set to scale.columns (
list
) – Columns to scale.
- Example
>>> from statinf.data import Scaler, generate_dataset >>> coeffs = [1.2556, -0.465, 1.665414, 2.5444, -7.56445] >>> data = generate_dataset(coeffs, n=10, std_dev=2.6) >>> # Original dataset >>> print(data) ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | X0 | X1 | X2 | X3 | X4 | Y | ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | 0.977594 | 1.669510 | -1.385569 | 0.696975 | -1.207098 | 8.501692 | ... | -0.953802 | 1.025392 | -0.639291 | 0.658251 | 0.746814 | -7.186085 | ... | -0.148140 | -0.972473 | 0.843746 | 1.306845 | 0.269834 | 1.939924 | ... | 0.499385 | -1.081926 | 2.646441 | 0.910503 | 0.857189 | 0.389257 | ... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 | ... | -0.434994 | -0.396210 | 1.101739 | -0.660236 | -1.197566 | 7.735832 | ... | 0.032478 | -0.114952 | -0.097337 | 1.794769 | 1.239423 | -5.510332 | ... | 0.085569 | -0.600019 | 0.224186 | 0.301771 | 1.278387 | -8.648084 | ... | -0.028844 | -0.329940 | -0.301762 | 0.946077 | -0.359133 | 5.099971 | ... | -0.665312 | 0.270254 | -1.263288 | 0.545625 | 0.499162 | -6.126528 | ... +-----------+-----------+-----------+-----------+-----------+-----------+ >>> # Load scaler class >>> scaler = Scaler(data=data, columns=['X1', 'X2']) >>> # Scale our dataset with MinMax method >>> scaled_df = scaler.MinMax() >>> print(scaled_df) ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | X0 | X1 | X2 | X3 | X4 | Y | ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | 0.977594 | 1.000000 | 0.000000 | 0.696975 | -1.207098 | 8.501692 | ... | -0.953802 | 0.765898 | 0.185088 | 0.658251 | 0.746814 | -7.186085 | ... | -0.148140 | 0.039781 | 0.552904 | 1.306845 | 0.269834 | 1.939924 | ... | 0.499385 | 0.000000 | 1.000000 | 0.910503 | 0.857189 | 0.389257 | ... | -0.563977 | 0.207162 | 0.163399 | -0.630345 | -0.486822 | -0.125787 | ... | -0.434994 | 0.249221 | 0.616890 | -0.660236 | -1.197566 | 7.735832 | ... | 0.032478 | 0.351444 | 0.319501 | 1.794769 | 1.239423 | -5.510332 | ... | 0.085569 | 0.175148 | 0.399244 | 0.301771 | 1.278387 | -8.648084 | ... | -0.028844 | 0.273307 | 0.268801 | 0.946077 | -0.359133 | 5.099971 | ... | -0.665312 | 0.491445 | 0.030328 | 0.545625 | 0.499162 | -6.126528 | ... +-----------+-----------+-----------+-----------+-----------+-----------+ >>> # Unscale the new dataset to retreive previous data scale >>> unscaled_df = scaler.unscaleMinMax(scaled_df) >>> print(unscaled_df) ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | X0 | X1 | X2 | X3 | X4 | Y | ... +-----------+-----------+-----------+-----------+-----------+-----------+ ... | 0.977594 | 1.669510 | -1.385569 | 0.696975 | -1.207098 | 8.501692 | ... | -0.953802 | 1.025392 | -0.639291 | 0.658251 | 0.746814 | -7.186085 | ... | -0.148140 | -0.972473 | 0.843746 | 1.306845 | 0.269834 | 1.939924 | ... | 0.499385 | -1.081926 | 2.646441 | 0.910503 | 0.857189 | 0.389257 | ... | -0.563977 | -0.511933 | -0.726744 | -0.630345 | -0.486822 | -0.125787 | ... | -0.434994 | -0.396210 | 1.101739 | -0.660236 | -1.197566 | 7.735832 | ... | 0.032478 | -0.114952 | -0.097337 | 1.794769 | 1.239423 | -5.510332 | ... | 0.085569 | -0.600019 | 0.224186 | 0.301771 | 1.278387 | -8.648084 | ... | -0.028844 | -0.329940 | -0.301762 | 0.946077 | -0.359133 | 5.099971 | ... | -0.665312 | 0.270254 | -1.263288 | 0.545625 | 0.499162 | -6.126528 | ... +-----------+-----------+-----------+-----------+-----------+-----------+
- MinMax(data=None, columns=None, feature_range=(0, 1), col_suffix='')[source]
Min-max scaler. Data we range between 0 and 1.
- Parameters
data (
pandas.DataFrame
, optional) – Data set to scale, defaults to None, takes data provided in__init__()
, defaults to None.columns (
list
, optional) – Columns to be scaled, defaults to None, takes the list provided in__init__()
, defaults to None.feature_range (
tuple
, optional) – Expected value range of the scaled data, defaults to (0, 1).col_suffix (
str
, optional) – Suffix to add to colum names, defaults to ‘’, overrides the existing columns.
- Formula
- \[x_{\text{scaled}} = \dfrac{x - \min(x)}{\max(x) - \min(x)} \cdot (f\_max - f\_min) + f\_min\]
where \((f\_min, f\_max)\) defaults to \((0, 1)\) and corresponds to the expected data range of the scaled data from argument
feature_range
. - Returns
Data set with scaled features.
- Return type
pandas.DataFrame
- Normalize(center=True, reduce=True, data=None, columns=None, col_suffix='')[source]
Data normalizer. Centers and reduces features (from mean and standard deviation).
- Parameters
center (
bool
, optional) – Center the variable, i.e. substract the mean, defaults to True.reduce (
bool
, optional) – Reduce the variable, i.e. devide by standard deviation, defaults to True.data (
pandas.DataFrame
, optional) – Data set to normalize, defaults to None, takes data provided in__init__()
.columns (
list
, optional) – Columns to be normalize, defaults to None, takes the list provided in__init__()
.col_suffix (
str
, optional) – [description], defaults to ‘’
- Formula
- \[x_{\text{scaled}} = \dfrac{x - \bar{x}}{\sqrt{\mathbb{V}(x)}}\]
- Returns
Data set with normalized features.
- Return type
pandas.DataFrame
- unscaleMinMax(data=None, columns=None, columns_mapping={})[source]
Unscale from min-max. Retreives data from the same range as the original features.
- Parameters
data (
pandas.DataFrame
, optional) – Data set to unscale, defaults to None, takes data provided in__init__()
.columns (
list
, optional) – Columns to be unscaled, defaults to None, takes the list provided in__init__()
.columns_mapping (
dict
, optional) – Mapping between eventual renamed columns and original scaled column.
- Formula
- \[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \left(\max(x) - \min(x) \right) + \min(x)\]
- Returns
Unscaled data set.
- Return type
pandas.DataFrame
- unscaleNormalize(data=None, columns=None, columns_mapping={})[source]
Denormalize data to retreive the same range as the original data set.
- Parameters
data (
pandas.DataFrame
, optional) – Data set to unscale, defaults to None, takes data provided in__init__()
.columns (
list
, optional) – Columns to be unscaled, defaults to None, takes the list provided in__init__()
.columns_mapping (
dict
, optional) – Mapping between eventual renamed columns and original scaled column.
- Formula
- \[x_{\text{unscaled}} = x_{\text{scaled}} \cdot \sqrt{\mathbb{V}(x)} + \bar{x}\]
- Returns
De-normalized data set.
- Return type
pandas.DataFrame
- statinf.data.ProcessData.create_dataset(data, n_in=1, n_out=1, dropnan=True)[source]
Function to convert a DataFrame into into multivariate time series format readable by Keras LSTM.
- Parameters
data (
pandas.DataFrame
) – DataFrame on which to aply the transformation.n_in (
int
, optional) – Input dimension also known as look back or size of the window, defaults to 1n_out (
int
, optional) – Output dimension, defaults to 1dropnan (
bool
, optional) – Remove empty values in the series, defaults to True
- Returns
Features converted for Keras LSTM.
- Return type
pandas.DataFrame
- statinf.data.ProcessData.multivariate_time_series(data)[source]
Convert a dataframe into numpy array multivariate time series.
- Parameters
data (
pandas.DataFrame
) – Input data to transform.- Exemple
>>> from statinf.data import multivariate_time_series, split_sequences >>> train_to_split = multivariate_time_series(train) >>> X, y = split_sequences(train_to_split, look_back=7)
- Returns
Transformed multivariate time series data.
- Return type
numpy.ndarray
- statinf.data.ProcessData.parse_formula(formula, data, check_values=True, return_all=False)[source]
This function is used in regression models in order to apply transformations on the data from a formula. It allows to apply transformations from a
str
formula. See below for examples.- Parameters
formula (
str
) –Regression formula to be run of the form
y ~ x1 + x2
. Accepted functions are:\(\log(x)\) :
log(X)
\(\exp(x)\) :
exp(X)
\(\sqrt{x}\) :
sqrt(X)
\(\cos(x)\) :
cos(X)
\(\sin(x)\) :
sin(X)
\(x^{z}\) :
X ** Z
\(\dfrac{x}{z}\) :
X/Z
\(x \times z\) :
X*Z
data (
pandas.DataFrame
) – Data on which to perform the transformations.check_values (bool, optional) – For each transformation check whether the data range satisfy the domain definition of the function, defaults to True.
return_all (bool, optional) – Returns the transformed data, column
Y
and columnsX
, defaults to False.
- Example
>>> from statinf.data import parse_formula >>> print(input_df) ... +-----------+-----------+-----------+ ... | X1 | X2 | Y | ... +-----------+-----------+-----------+ ... | 0.555096 | 0.681083 | -1.383428 | ... | 1.155661 | 0.391129 | -7.780989 | ... | -0.299251 | -0.445602 | -8.146673 | ... | -0.978311 | 1.312146 | 8.653818 | ... | -0.225917 | 0.522016 | -9.684332 | ... +-----------+-----------+-----------+ >>> form = 'Y ~ X1 + X2 + exp(X2) + X1*X2' >>> new_df = parse_formula(form, data=input_df) >>> print(new_df) ... +-----------+-----------+-----------+-----------+-----------+ ... | X1 | X2 | Y | exp(X2) | X1*X2 | ... +-----------+-----------+-----------+-----------+-----------+ ... | 0.555096 | 0.681083 | -1.383428 | 1.976017 | 0.378066 | ... | 1.155661 | 0.391129 | -7.780989 | 1.478649 | 0.452012 | ... | -0.299251 | -0.445602 | -8.146673 | 0.640438 | 0.133347 | ... | -0.978311 | 1.312146 | 8.653818 | 3.714134 | -1.283687 | ... | -0.225917 | 0.522016 | -9.684332 | 1.685422 | -0.117932 | ... +-----------+-----------+-----------+-----------+-----------+
- Raises
ValueError – Returns an error when the data cannot satisfy the domain definition for the required transformation.
- Returns
Transformed data set
- Return type
pandas.DataFrame
- statinf.data.ProcessData.rankdata(x)[source]
Assigns rank to data. This is mainly used for analysis like Spearman’s correlation.
- Parameters
x (
numpy.array
) – Input vector. Format can benumpy.array
,list
orpandas.Series
.- Example
>>> rankdata([2., 5.44, 3.93, 3.3, 1.1]) ... array([1, 4, 3, 2, 0])
- Returns
Vector with ranked values.
- Return type
numpy.array
- statinf.data.ProcessData.split_sequences(data, look_back=1)[source]
Split a multivariate time series from
statinf.data.ProcessData.multivariate_time_series()
into a Keras’ friendly format for LSTM.- Parameters
data (
numpy.ndarray
) – Data in the format of sequences to transform.look_back (
int
) – Size of the trailing window, number of time steps to consider, defaults to 1.
- Exemple
>>> from statinf.data import multivariate_time_series, split_sequences >>> train_to_split = multivariate_time_series(train) >>> X, y = split_sequences(train_to_split, look_back=7)
- Returns
x
: Input data converted for Keras LSTM.y
: Target series converted for Keras LSTM.
- Return type
numpy.ndarray
numpy.ndarray