Linear regression

OLS

class statinf.regressions.LinearModels.LinearBayes[source]

Bases: object

Class for a bayesian linear regression with known standard deviation of the residual distribution.

Warning

This function is still under development. This is a beta version, please be aware that some functionalitie might not be available. The full stable version soon will be released.

Parameters

w_0 (numpy.array) – mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.
V_0 (numpy.array) – covariance matrix of the prior distribution of the weights, default is the identity matrix.
w_n (numpy.array) – mean of the posterior distribution of the weights after observing the data.
V_n (numpy.array) – covariance matrix the posterior distribution of the weights after observing the data.

References

Murphy, K. P. (2012). Machine learning: a probabilistic perspective.

Source

Inspired by: https://jessicastringham.net/2018/01/03/bayesian-linreg/

fit(X, y, true_sigma=None, w_0=None, V_0=None)[source]

Fits the data using a linear regression model, finds the posterior distribution of the weights given the std of the residuals known. The case with std unknown will be added in the next implementation.

Parameters

X (numpy.array) – Input data.
y (numpy.array) – Data values.
true_sigma (float) – Standard error of the residual distribution.
w_0 (float) – Mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.
V_0 (float) – Covariance matrix of the prior distribution of the weights, default is the identity matrix.

plot_posterior_line(X, y, n_lines=200, res=100, xlim=(-1, 10))[source]

Plots the model’s distribution sampled from the posterior distribution of the weights.

Parameters

X (numpy.array) – Input data.
y (numpy.array) – Data values.
n_lines (int) – Number of lines sampled from the posterior distribution, default is 200.
res (int) – Resolution of the grid, default is 100.
xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-1,10).

plot_weight_distributions(res=100, xlim=(-8, 8), ylim=(-8, 8))[source]

Plots the weight distribution for the prior and the posterior probabilities.

Parameters

res (int) – Resolution of the grid, default is 100.
xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-8,8).
ylim (tuple) – Tuple of the min and max values of the grid along the y axis, default is (-8,8).

class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=False)[source]

Bases: object

Ordinary Least Squares regression.

Parameters

formula (str) – Regression formula to be run of the form y ~ x1 + x2. See parse_formula() in ProcessData
data (pandas.DataFrame) – Input data with Pandas format.
fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to False.

adjusted_r_squared()[source]

Adjusted-\(R^{2}\) – Goodness of fit

Formula: \[R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size
References: Theil, Henri (1961). Economic Forecasts and Policy.
Returns: Adjusted goodness of fit.
Return type: float

fitted_values()[source]

Computes the estimated values of Y

Formula: \[\hat{Y} = X \beta\]
Returns: Fitted values for Y.
Return type: numpy.array

get_betas()[source]

Computes the estimates for each explanatory variable

Formula: \[\beta = (X'X)^{-1} X'Y\]
Returns: Estimated coefficients
Return type: numpy.array

predict(new_data, conf_level=None)[source]

Predicted \(\hat{Y}\) values for for a new dataset

Parameters

new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.
conf_level (float) – Level of the confidence interval, defaults to None.

Formulae

\[\hat{Y} = X \hat{\beta}\]

The confidence interval is computed as:

\[\left[ \hat{Y} \pm z_{1 - \frac{\alpha}{2}} \dfrac{\sigma}{\sqrt{n - 1}} \right]\]

Returns: Predictions \(\hat{Y}\)
Return type: numpy.array

r_squared()[source]

\(R^{2}\) – Goodness of fit

Formula: \[R^{2} = 1 - \dfrac{RSS}{TSS}\]
Returns: Goodness of fit.
Return type: float

rss()[source]

Residual Sum of Squares

Formula: \[RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\hat{y}_{i}\) denotes the predicted value of \(y\) for individual \(i\).
Returns: Residual Sum of Squares.
Return type: float

summary(return_df=False)[source]

Statistical summary for OLS

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.

Formulae

Fisher test:

\[\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size

Covariance matrix:

\[\mathbb{V}(\beta) = \sigma^{2} X'X\]

where \(\sigma^{2} = \frac{RSS}{n - p -1}\)

Coefficients’ significance:

\[p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)\]

where \(T\) denotes the Student cumulative distribution function (c.d.f.) with \(n\) degrees of freedom

References

Student. (1908). The probable error of a mean. Biometrika, 1-25.
Shen, Q., & Faraway, J. (2004). An F test for linear models with functional responses. Statistica Sinica, 1239-1257.
Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.
Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics using stata (Vol. 5, p. 706). College Station, TX: Stata press.

Returns

Model’s summary.

Return type

pandas.DataFrame or str

tss()[source]

Total Sum of Squares

Formula: \[TSS = \sum_{i=1}^{n} (y_{i} - \bar{y})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\bar{y}_{i}\) denotes the average value of \(y\).
Returns: Total Sum of Squares.
Return type: float

Examples

OLS

import statinf.data as gd
from statinf.regressions import OLS

# Generate a synthetic dataset
data = gd.generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# We set the OLS formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4 + X1*X2 + exp(X2)"

# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)

ols.summary()

Output will be:

==================================================================================
|                                  OLS summary                                   |
==================================================================================
| R²             =             0.9846 | R² Adj.      =                   0.98449 |
| n              =                999 | p            =                         7 |
| Fisher value   =           10568.56 |                                          |
==================================================================================
| Variables         | Coefficients   | Std. Errors  | t-values   | Probabilities |
==================================================================================
| X0                |         1.2898 |      0.03218 |     40.085 |     0.0   *** |
| X1                |       -0.50096 |      0.03187 |    -15.718 |     0.0   *** |
| X2                |        1.62202 |      0.04264 |     38.039 |     0.0   *** |
| X3                |        2.56471 |      0.03196 |     80.252 |     0.0   *** |
| X4                |       -7.58065 |      0.03226 |   -234.983 |     0.0   *** |
| X1*X2             |       -0.03968 |      0.03438 |     -1.154 |   0.249       |
| exp(X2)           |        0.00301 |      0.01692 |      0.178 |   0.859       |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================

You can also predict new values with their confidence interval

# Generate a new synthetic dataset
test_data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# Predict with 95% confidence interval
ols.predict(test_data, conf_level=.95)

Output will be:

    Prediction  LowerBound  UpperBound
  -19.252926  -19.265841  -19.240012
    4.988078    4.975164    5.000993
   10.824623   10.811708   10.837537
   -2.725563   -2.738477   -2.712649
    4.057040    4.044125    4.069954

LinearBayes

from statinf.regressions import LinearBayes