# Linear regression¶

## OLS¶

class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=True)

Bases: object

Ordinary Least Squares regression

Parameters
• formula (str) – Regression formula to be run of the form y ~ x1 + x2.

• data (pandas.DataFrame) – Input data with Pandas format.

• fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to True.

adjusted_r_squared()

Adjusted-$$R^{2}$$ – Goodness of fit

Formula
$R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}$

where $$p$$ denotes the number of estimates (i.e. explanatory variables) and $$n$$ the sample size

References

Theil, Henri (1961). Economic Forecasts and Policy.

Returns

Return type

float

fitted_values()

Computes the estimated values of Y

Formula
$\hat{Y} = X \beta$
Returns

Fitted values for Y.

Return type

numpy.array

get_betas()

Computes the estimates for each explanatory variable

Formula
$\beta = (X'X)^{-1} X'Y$
Returns

Estimated coefficients

Return type

numpy.array

predict(new_data)

Predicted $$\hat{Y}$$ values for for a new dataset

Parameters

new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.

Formula
$\hat{Y} = X \hat{\beta}$
Returns

Predictions

Return type

numpy.array

r_squared()

$$R^{2}$$ – Goodness of fit

Formula
$R^{2} = 1 - \dfrac{RSS}{TSS}$
Returns

Goodness of fit.

Return type

float

rss()

Residual Sum of Squares

Formula
$RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}$

where $$y_{i}$$ denotes the true/observed value of $$y$$ for individual $$i$$ and $$\hat{y}_{i}$$ denotes the predicted value of $$y$$ for individual $$i$$.

Returns

Residual Sum of Squares.

Return type

float

summary(return_df=False)

Statistical summary for OLS

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else print a string, defaults to False.

Formulae
• Fisher test:

$\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}$

where $$p$$ denotes the number of estimates (i.e. explanatory variables) and $$n$$ the sample size

• Covariance matrix:

$\mathbb{V}(\beta) = \sigma^{2} X'X$

where $$\sigma^{2} = \frac{RSS}{n - p -1}$$

• Coefficients’ significance:

$p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)$

where $$T$$ denotes the Student cumulative distribution function (c.d.f.) with $$n$$ degrees of freedom

References
tss()

Total Sum of Squares

Formula
$TSS = \sum_{i=1}^{n} (y_{i} - \bar{y})^{2}$

where $$y_{i}$$ denotes the true/observed value of $$y$$ for individual $$i$$ and $$\bar{y}_{i}$$ denotes the average value of $$y$$.

Returns

Total Sum of Squares.

Return type

float

## Example¶

import statinf.GenerateData as gd
from statinf.regressions import OLS

# Generate a synthetic dataset
data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# We set the OLS formula
formula = "Y ~ X1 + X2 + X3 + X0"
# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)

ols.summary()


Output will be:

=========================================================================
OLS summary
=========================================================================
| R² = 0.98484                  | Adjusted-R² = 0.98477
| n  =    999                   | p =     5
| Fisher = 16146.04006
=========================================================================
Variables  Coefficients  Standard Errors    t values  Probabilites
X0      1.284372         0.032941   38.989887           0.0
X1     -0.477014         0.031606  -15.092496           0.0
X2      1.645034         0.034310   47.945992           0.0
X3      2.571289         0.031940   80.504863           0.0
X4     -7.634125         0.032077 -237.994821           0.0