# Linear regression

## OLS

class statinf.regressions.LinearModels.LinearBayes[source]

Bases: object

Class for a bayesian linear regression with known standard deviation of the residual distribution.

Warning

This function is still under development. This is a beta version, please be aware that some functionalitie might not be available. The full stable version soon will be released.

Parameters
• w_0 (numpy.array) – mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.

• V_0 (numpy.array) – covariance matrix of the prior distribution of the weights, default is the identity matrix.

• w_n (numpy.array) – mean of the posterior distribution of the weights after observing the data.

• V_n (numpy.array) – covariance matrix the posterior distribution of the weights after observing the data.

References
• Murphy, K. P. (2012). Machine learning: a probabilistic perspective.

Source
fit(X, y, true_sigma=None, w_0=None, V_0=None)[source]

Fits the data using a linear regression model, finds the posterior distribution of the weights given the std of the residuals known. The case with std unknown will be added in the next implementation.

Parameters
• X (numpy.array) – Input data.

• y (numpy.array) – Data values.

• true_sigma (float) – Standard error of the residual distribution.

• w_0 (float) – Mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.

• V_0 (float) – Covariance matrix of the prior distribution of the weights, default is the identity matrix.

plot_posterior_line(X, y, n_lines=200, res=100, xlim=(-1, 10))[source]

Plots the model’s distribution sampled from the posterior distribution of the weights.

Parameters
• X (numpy.array) – Input data.

• y (numpy.array) – Data values.

• n_lines (int) – Number of lines sampled from the posterior distribution, default is 200.

• res (int) – Resolution of the grid, default is 100.

• xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-1,10).

plot_weight_distributions(res=100, xlim=(-8, 8), ylim=(-8, 8))[source]

Plots the weight distribution for the prior and the posterior probabilities.

Parameters
• res (int) – Resolution of the grid, default is 100.

• xlim (tuple) – Tuple of the min and max values of the grid along the x axis, default is (-8,8).

• ylim (tuple) – Tuple of the min and max values of the grid along the y axis, default is (-8,8).

class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=False)[source]

Bases: object

Ordinary Least Squares regression.

Parameters
• formula (str) – Regression formula to be run of the form y ~ x1 + x2. See parse_formula() in ProcessData

• data (pandas.DataFrame) – Input data with Pandas format.

• fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to False.

Adjusted-$$R^{2}$$ – Goodness of fit

Formula
$R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}$

where $$p$$ denotes the number of estimates (i.e. explanatory variables) and $$n$$ the sample size

References

Theil, Henri (1961). Economic Forecasts and Policy.

Returns

Return type

float

fitted_values()[source]

Computes the estimated values of Y

Formula
$\hat{Y} = X \beta$
Returns

Fitted values for Y.

Return type

numpy.array

get_betas()[source]

Computes the estimates for each explanatory variable

Formula
$\beta = (X'X)^{-1} X'Y$
Returns

Estimated coefficients

Return type

numpy.array

predict(new_data, conf_level=None)[source]

Predicted $$\hat{Y}$$ values for for a new dataset

Parameters
• new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.

• conf_level (float) – Level of the confidence interval, defaults to None.

Formulae
$\hat{Y} = X \hat{\beta}$

The confidence interval is computed as:

$\left[ \hat{Y} \pm z_{1 - \frac{\alpha}{2}} \dfrac{\sigma}{\sqrt{n - 1}} \right]$
Returns

Predictions $$\hat{Y}$$

Return type

numpy.array

r_squared()[source]

$$R^{2}$$ – Goodness of fit

Formula
$R^{2} = 1 - \dfrac{RSS}{TSS}$
Returns

Goodness of fit.

Return type

float

Residual Sum of Squares

Formula
$RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}$

where $$y_{i}$$ denotes the true/observed value of $$y$$ for individual $$i$$ and $$\hat{y}_{i}$$ denotes the predicted value of $$y$$ for individual $$i$$.

Returns

Residual Sum of Squares.

Return type

float

summary(return_df=False)[source]

Statistical summary for OLS

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.

Formulae
• Fisher test:

$\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}$

where $$p$$ denotes the number of estimates (i.e. explanatory variables) and $$n$$ the sample size

• Covariance matrix:

$\mathbb{V}(\beta) = \sigma^{2} X'X$

where $$\sigma^{2} = \frac{RSS}{n - p -1}$$

• Coefficients’ significance:

$p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)$

where $$T$$ denotes the Student cumulative distribution function (c.d.f.) with $$n$$ degrees of freedom

References
Returns

Model’s summary.

Return type

pandas.DataFrame or str

tss()[source]

Total Sum of Squares

Formula
$TSS = \sum_{i=1}^{n} (y_{i} - \bar{y})^{2}$

where $$y_{i}$$ denotes the true/observed value of $$y$$ for individual $$i$$ and $$\bar{y}_{i}$$ denotes the average value of $$y$$.

Returns

Total Sum of Squares.

Return type

float

## Examples

### OLS

import statinf.data as gd
from statinf.regressions import OLS

# Generate a synthetic dataset
data = gd.generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# We set the OLS formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4 + X1*X2 + exp(X2)"

# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)

ols.summary()


Output will be:

==================================================================================
|                                  OLS summary                                   |
==================================================================================
| R²             =             0.9846 | R² Adj.      =                   0.98449 |
| n              =                999 | p            =                         7 |
| Fisher value   =           10568.56 |                                          |
==================================================================================
| Variables         | Coefficients   | Std. Errors  | t-values   | Probabilities |
==================================================================================
| X0                |         1.2898 |      0.03218 |     40.085 |     0.0   *** |
| X1                |       -0.50096 |      0.03187 |    -15.718 |     0.0   *** |
| X2                |        1.62202 |      0.04264 |     38.039 |     0.0   *** |
| X3                |        2.56471 |      0.03196 |     80.252 |     0.0   *** |
| X4                |       -7.58065 |      0.03226 |   -234.983 |     0.0   *** |
| X1*X2             |       -0.03968 |      0.03438 |     -1.154 |   0.249       |
| exp(X2)           |        0.00301 |      0.01692 |      0.178 |   0.859       |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================


You can also predict new values with their confidence interval

# Generate a new synthetic dataset
test_data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# Predict with 95% confidence interval
ols.predict(test_data, conf_level=.95)


Output will be:

    Prediction  LowerBound  UpperBound
0    -19.252926  -19.265841  -19.240012
1      4.988078    4.975164    5.000993
2     10.824623   10.811708   10.837537
3     -2.725563   -2.738477   -2.712649
4      4.057040    4.044125    4.069954


### LinearBayes

from statinf.regressions import LinearBayes