Linear regression
OLS
- class statinf.regressions.LinearModels.LinearBayes[source]
Bases:
object
Class for a bayesian linear regression with known standard deviation of the residual distribution.
Warning
This function is still under development. This is a beta version, please be aware that some functionalitie might not be available. The full stable version soon will be released.
- Parameters
w_0 (
numpy.array
) – mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.V_0 (
numpy.array
) – covariance matrix of the prior distribution of the weights, default is the identity matrix.w_n (
numpy.array
) – mean of the posterior distribution of the weights after observing the data.V_n (
numpy.array
) – covariance matrix the posterior distribution of the weights after observing the data.
- References
Murphy, K. P. (2012). Machine learning: a probabilistic perspective.
- Source
Inspired by: https://jessicastringham.net/2018/01/03/bayesian-linreg/
- fit(X, y, true_sigma=None, w_0=None, V_0=None)[source]
Fits the data using a linear regression model, finds the posterior distribution of the weights given the std of the residuals known. The case with std unknown will be added in the next implementation.
- Parameters
X (
numpy.array
) – Input data.y (
numpy.array
) – Data values.true_sigma (
float
) – Standard error of the residual distribution.w_0 (
float
) – Mean of the prior distribution of the weights assuming it’s a gaussian, default is 0.V_0 (
float
) – Covariance matrix of the prior distribution of the weights, default is the identity matrix.
- plot_posterior_line(X, y, n_lines=200, res=100, xlim=(-1, 10))[source]
Plots the model’s distribution sampled from the posterior distribution of the weights.
- Parameters
X (
numpy.array
) – Input data.y (
numpy.array
) – Data values.n_lines (
int
) – Number of lines sampled from the posterior distribution, default is 200.res (
int
) – Resolution of the grid, default is 100.xlim (
tuple
) – Tuple of the min and max values of the grid along the x axis, default is (-1,10).
- plot_weight_distributions(res=100, xlim=(-8, 8), ylim=(-8, 8))[source]
Plots the weight distribution for the prior and the posterior probabilities.
- Parameters
res (
int
) – Resolution of the grid, default is 100.xlim (
tuple
) – Tuple of the min and max values of the grid along the x axis, default is (-8,8).ylim (
tuple
) – Tuple of the min and max values of the grid along the y axis, default is (-8,8).
- class statinf.regressions.LinearModels.OLS(formula, data, fit_intercept=False)[source]
Bases:
object
Ordinary Least Squares regression.
- Parameters
formula (
str
) – Regression formula to be run of the formy ~ x1 + x2
. Seeparse_formula()
in ProcessDatadata (
pandas.DataFrame
) – Input data with Pandas format.fit_intercept (
bool
, optional) – Used for adding intercept in the regression formula, defaults to False.
- adjusted_r_squared()[source]
Adjusted-\(R^{2}\) – Goodness of fit
- Formula
- \[R^{2}_{adj} = 1 - (1 - R^{2}) \dfrac{n - 1}{n - p - 1}\]
where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size
- References
Theil, Henri (1961). Economic Forecasts and Policy.
- Returns
Adjusted goodness of fit.
- Return type
float
- fitted_values()[source]
Computes the estimated values of Y
- Formula
- \[\hat{Y} = X \beta\]
- Returns
Fitted values for Y.
- Return type
numpy.array
- get_betas()[source]
Computes the estimates for each explanatory variable
- Formula
- \[\beta = (X'X)^{-1} X'Y\]
- Returns
Estimated coefficients
- Return type
numpy.array
- predict(new_data, conf_level=None)[source]
Predicted \(\hat{Y}\) values for for a new dataset
- Parameters
new_data (
pandas.DataFrame
) – New data to evaluate with pandas data-frame format.conf_level (
float
) – Level of the confidence interval, defaults to None.
- Formulae
- \[\hat{Y} = X \hat{\beta}\]
The confidence interval is computed as:
\[\left[ \hat{Y} \pm z_{1 - \frac{\alpha}{2}} \dfrac{\sigma}{\sqrt{n - 1}} \right]\]- Returns
Predictions \(\hat{Y}\)
- Return type
numpy.array
- r_squared()[source]
\(R^{2}\) – Goodness of fit
- Formula
- \[R^{2} = 1 - \dfrac{RSS}{TSS}\]
- Returns
Goodness of fit.
- Return type
float
- rss()[source]
Residual Sum of Squares
- Formula
- \[RSS = \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}\]
where \(y_{i}\) denotes the true/observed value of \(y\) for individual \(i\) and \(\hat{y}_{i}\) denotes the predicted value of \(y\) for individual \(i\).
- Returns
Residual Sum of Squares.
- Return type
float
- summary(return_df=False)[source]
Statistical summary for OLS
- Parameters
return_df (
bool
) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.- Formulae
Fisher test:
\[\mathcal{F} = \dfrac{TSS - RSS}{\frac{RSS}{n - p}}\]where \(p\) denotes the number of estimates (i.e. explanatory variables) and \(n\) the sample size
Covariance matrix:
\[\mathbb{V}(\beta) = \sigma^{2} X'X\]where \(\sigma^{2} = \frac{RSS}{n - p -1}\)
Coefficients’ significance:
\[p = 2 \left( 1 - T_{n} \left( \dfrac{\beta}{\sqrt{\mathbb{V}(\beta)}} \right) \right)\]where \(T\) denotes the Student cumulative distribution function (c.d.f.) with \(n\) degrees of freedom
- References
Student. (1908). The probable error of a mean. Biometrika, 1-25.
Shen, Q., & Faraway, J. (2004). An F test for linear models with functional responses. Statistica Sinica, 1239-1257.
Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.
Cameron, A. C., & Trivedi, P. K. (2009). Microeconometrics using stata (Vol. 5, p. 706). College Station, TX: Stata press.
- Returns
Model’s summary.
- Return type
pandas.DataFrame
orstr
Examples
OLS
import statinf.data as gd
from statinf.regressions import OLS
# Generate a synthetic dataset
data = gd.generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)
# We set the OLS formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4 + X1*X2 + exp(X2)"
# We fit the OLS with the data, the formula and without intercept
ols = OLS(formula, data, fit_intercept=False)
ols.summary()
Output will be:
==================================================================================
| OLS summary |
==================================================================================
| R² = 0.9846 | R² Adj. = 0.98449 |
| n = 999 | p = 7 |
| Fisher value = 10568.56 | |
==================================================================================
| Variables | Coefficients | Std. Errors | t-values | Probabilities |
==================================================================================
| X0 | 1.2898 | 0.03218 | 40.085 | 0.0 *** |
| X1 | -0.50096 | 0.03187 | -15.718 | 0.0 *** |
| X2 | 1.62202 | 0.04264 | 38.039 | 0.0 *** |
| X3 | 2.56471 | 0.03196 | 80.252 | 0.0 *** |
| X4 | -7.58065 | 0.03226 | -234.983 | 0.0 *** |
| X1*X2 | -0.03968 | 0.03438 | -1.154 | 0.249 |
| exp(X2) | 0.00301 | 0.01692 | 0.178 | 0.859 |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================
You can also predict new values with their confidence interval
# Generate a new synthetic dataset
test_data = generate_dataset(coeffs=[1.2556, -0.465, 1.665414, 2.5444, -7.56445], n=1000, std_dev=2.6)
# Predict with 95% confidence interval
ols.predict(test_data, conf_level=.95)
Output will be:
Prediction LowerBound UpperBound
0 -19.252926 -19.265841 -19.240012
1 4.988078 4.975164 5.000993
2 10.824623 10.811708 10.837537
3 -2.725563 -2.738477 -2.712649
4 4.057040 4.044125 4.069954
LinearBayes
from statinf.regressions import LinearBayes