Linear regression

OLS

class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=True)

Bases: object

Ordinary Least Squares regression

Parameters
  • formula (str) – Regression formula to be run of the form y ~ x1 + x2.

  • data (pandas.DataFrame) – Input data with Pandas format.

  • fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to True.

adjusted_r_squared()

Adjusted-\(R^{2}\) – Goodness of fit

Formula
\[R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size

References

Theil, Henri (1961). Economic Forecasts and Policy.

Returns

Adjusted goodness of fit.

Return type

float

fitted_values()

Computes the estimated values of Y

Formula
\[\hat{Y} = X \beta\]
Returns

Fitted values for Y.

Return type

numpy.array

get_betas()

Computes the estimates for each explanatory variable

Formula
\[\beta = (X'X)^{-1} X'Y\]
Returns

Estimated coefficients

Return type

numpy.array

predict(new_data)

Predicted \(\hat{Y}\) values for for a new dataset

Parameters

new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.

Formula
\[\hat{Y} = X \hat{\beta}\]
Returns

Predictions

Return type

numpy.array

r_squared()

\(R^{2}\) – Goodness of fit

Formula
\[R^{2} = 1 - \dfrac{RSS}{TSS}\]
Returns

Goodness of fit.

Return type

float

rss()

Residual Sum of Squares

Formula
\[RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\hat{y}_{i}\) denotes the predicted value of \(y\) for individual \(i\).

Returns

Residual Sum of Squares.

Return type

float

summary(return_df=False)

Statistical summary for OLS

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else print a string, defaults to False.

Formulae
  • Fisher test:

\[\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}\]

where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size

  • Covariance matrix:

\[\mathbb{V}(\beta) = \sigma^{2} X'X\]

where \(\sigma^{2} = \frac{RSS}{n - p -1}\)

  • Coefficients’ significance:

\[p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)\]

where \(T\) denotes the Student cumulative distribution function (c.d.f.) with \(n\) degrees of freedom

References
tss()

Total Sum of Squares

Formula
\[TSS = \sum_{i=1}^{n} (y_{i} - \bar{y})^{2}\]

where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\bar{y}_{i}\) denotes the average value of \(y\).

Returns

Total Sum of Squares.

Return type

float

Example

import statinf.GenerateData as gd
from statinf.regressions import OLS

# Generate a synthetic dataset
data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)

# We set the OLS formula
formula = "Y ~ X1 + X2 + X3 + X0"
# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)

ols.summary()

Output will be:

=========================================================================
                            OLS summary
=========================================================================
| R² = 0.98484                  | Adjusted-R² = 0.98477
| n  =    999                   | p =     5
| Fisher = 16146.04006
=========================================================================
Variables  Coefficients  Standard Errors    t values  Probabilites
    X0      1.284372         0.032941   38.989887           0.0
    X1     -0.477014         0.031606  -15.092496           0.0
    X2      1.645034         0.034310   47.945992           0.0
    X3      2.571289         0.031940   80.504863           0.0
    X4     -7.634125         0.032077 -237.994821           0.0