Linear regression

OLS

class statinf.regressions.LinearModels.LinearBayes[source]

Bases: object

Class for a bayesian linear regression with known standard deviation of the residual distribution.

Warning

This function is still under development. This is a beta version, please be aware that some functionalitie might not be available. The full stable version soon will be released.

Parameters
  • w_0 (numpy.array) – mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.

  • V_0 (numpy.array) – covariance matrix of the prior distribution of the weights, default is the identity matrix.

  • w_n (numpy.array) – mean of the posterior distribution of the weights after observing the data.

  • V_n (numpy.array) – covariance matrix the posterior distribution of the weights after observing the data.

References
  • Murphy, K. P. (2012). Machine learning: a probabilistic perspective.

Source

Inspired by: https://jessicastringham.net/2018/01/03/bayesian-linreg/

fit(X, y, true_sigma=None, w_0=None, V_0=None)[source]

Fits the data using a linear regression model, finds the posterior distribution of the weights given the std of the residuals known. The case with std unknown will be added in the next implementation.

Parameters
  • X (numpy.array) – Input data.

  • y (numpy.array) – Data values.

  • true_sigma (float) – Standard error of the residual distribution.

  • w_0 (float) – Mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.

  • V_0 (float) – Covariance matrix of the prior distribution of the weights, default is the identity matrix.

plot_posterior_line(X, y, n_lines=200, res=100, xlim=(-1, 10))[source]

Plots the model’s distribution sampled from the posterior distribution of the weights.

Parameters
  • X (numpy.array) – Input data.

  • y (numpy.array) – Data values.

  • n_lines (int) – Number of lines sampled from the posterior distribution, default is 200.

  • res (int) – Resolution of the grid, default is 100.

  • xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-1,10).

plot_weight_distributions(res=100, xlim=(-8, 8), ylim=(-8, 8))[source]

Plots the weight distribution for the prior and the posterior probabilities.

Parameters
  • res (int) – Resolution of the grid, default is 100.

  • xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-8,8).

  • ylim (tuple) – Tuple of the min and max values of the grid along the y axis, default is (-8,8).

class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=False)[source]

Bases: object

Ordinary Least Squares regression.

Parameters
  • formula (str) – Regression formula to be run of the form y ~ x1 + x2. See parse_formula() in ProcessData

  • data (pandas.DataFrame) – Input data with Pandas format.

  • fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to False.

adjusted_r_squared()[source]

Adjusted-\(R^{2}\) – Goodness of fit

Formula
\[R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size

References

Theil, Henri (1961). Economic Forecasts and Policy.

Returns

Adjusted goodness of fit.

Return type

float

fitted_values()[source]

Computes the estimated values of Y

Formula
\[\hat{Y} = X \beta\]
Returns

Fitted values for Y.

Return type

numpy.array

get_betas()[source]

Computes the estimates for each explanatory variable

Formula
\[\beta = (X'X)^{-1} X'Y\]
Returns

Estimated coefficients

Return type

numpy.array

predict(new_data, conf_level=None)[source]

Predicted \(\hat{Y}\) values for for a new dataset

Parameters
  • new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.

  • conf_level (float) – Level of the confidence interval, defaults to None.

Formulae
\[\hat{Y} = X \hat{\beta}\]

The confidence interval is computed as:

\[\left[ \hat{Y} \pm z_{1 - \frac{\alpha}{2}} \dfrac{\sigma}{\sqrt{n - 1}} \right]\]
Returns

Predictions \(\hat{Y}\)

Return type

numpy.array

r_squared()[source]

\(R^{2}\) – Goodness of fit

Formula
\[R^{2} = 1 - \dfrac{RSS}{TSS}\]
Returns

Goodness of fit.

Return type

float

rss()[source]

Residual Sum of Squares

Formula
\[RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\hat{y}_{i}\) denotes the predicted value of \(y\) for individual \(i\).

Returns

Residual Sum of Squares.

Return type

float

summary(return_df=False)[source]

Statistical summary for OLS

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.

Formulae
  • Fisher test:

\[\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size

  • Covariance matrix:

\[\mathbb{V}(\beta) = \sigma^{2} X'X\]

where \(\sigma^{2} = \frac{RSS}{n - p -1}\)

  • Coefficients’ significance:

\[p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)\]

where \(T\) denotes the Student cumulative distribution function (c.d.f.) with \(n\) degrees of freedom

References
Returns

Model’s summary.

Return type

pandas.DataFrame or str

tss()[source]

Total Sum of Squares

Formula
\[TSS = \sum_{i=1}^{n} (y_{i} - \bar{y})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\bar{y}_{i}\) denotes the average value of \(y\).

Returns

Total Sum of Squares.

Return type

float

Examples

OLS

import statinf.data as gd
from statinf.regressions import OLS

# Generate a synthetic dataset
data = gd.generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# We set the OLS formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4 + X1*X2 + exp(X2)"

# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)

ols.summary()

Output will be:

==================================================================================
|                                  OLS summary                                   |
==================================================================================
| R²             =             0.9846 | R² Adj.      =                   0.98449 |
| n              =                999 | p            =                         7 |
| Fisher value   =           10568.56 |                                          |
==================================================================================
| Variables         | Coefficients   | Std. Errors  | t-values   | Probabilities |
==================================================================================
| X0                |         1.2898 |      0.03218 |     40.085 |     0.0   *** |
| X1                |       -0.50096 |      0.03187 |    -15.718 |     0.0   *** |
| X2                |        1.62202 |      0.04264 |     38.039 |     0.0   *** |
| X3                |        2.56471 |      0.03196 |     80.252 |     0.0   *** |
| X4                |       -7.58065 |      0.03226 |   -234.983 |     0.0   *** |
| X1*X2             |       -0.03968 |      0.03438 |     -1.154 |   0.249       |
| exp(X2)           |        0.00301 |      0.01692 |      0.178 |   0.859       |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================

You can also predict new values with their confidence interval

# Generate a new synthetic dataset
test_data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# Predict with 95% confidence interval
ols.predict(test_data, conf_level=.95)

Output will be:

    Prediction  LowerBound  UpperBound
0    -19.252926  -19.265841  -19.240012
1      4.988078    4.975164    5.000993
2     10.824623   10.811708   10.837537
3     -2.725563   -2.738477   -2.712649
4      4.057040    4.044125    4.069954

LinearBayes

from statinf.regressions import LinearBayes