Generalized Linear Models
GLM
- class statinf.regressions.glm.GLM(formula, data, family='binomial', fit_intercept=False, initializer='zeros')[source]
Bases:
object
Generalized Linear Model implemented with Newton-Raphson’s method.
- Parameters
formula (
str
) – Regression formula to be run of the formy ~ x1 + x2
. Seestatinf.data.ProcessData.parse_formula()
.data (
pandas.DataFrame
) – Input data with Pandas format.family (
str
, optional) – Family distribution of the dependent variable, defaults to binomial.fit_intercept (
bool
, optional) – Used for adding intercept in the regression formula, defaults to False.initializer (
str
, optional) – Method for initializing the first parameters (seestatinf.ml.initializations()
), defaults to ‘zeros’.
Note
The modules allows Binomial (for Logit) and Gaussian (for Probit) distributions. It will soon be extended to other distributions.
- adjusted_r_squared()[source]
Mc Fadden’s pseudo-\(R^{2}\) adjusted – Adjusted goodness of fit
- Formula
- \[R^{2}_{adj} = 1 - \dfrac{LL(\hat{\beta}) - p}{LL(\bar{Y})}\]
- Returns
Adjusted goodness of fit.
- Return type
float
- fit(maxit=15, cov_type='nonrobust', improvement_threshold=0.0005, keep_hist=True, plot=False)[source]
Fits the GLM regression model using Newton-Raphson method.
- Parameters
maxit (
int
, optional) – Maximum number of iterations, defaults to 15.cov_type (
str
, optional) – Type of the covariance matrix (non-robust or sandwich), defaults to nonrobust.improvement_threshold (
float
, optional) – Threshold from which we consider the likelihood improved, defaults to 0.0005.keep_hist (
bool
, optional) – Keeps training history (gradients, hessian, etc…), can be turned off for saving memory, defaults to True.plot (
bool
, optional) – Plots evolution of log-likelihood through the different iterations (requireskeep_hist = True
), defaults to False.
- Formulae
Log-likelihood:
\[l(\beta) = y_{i} \log \left[ G(\mathbf{x_i} \beta) \right] + (1 - y_{i}) \log \left[1 - G(\mathbf{x_i} \beta) \right]\]Newton’s method:
\[\hat{\beta}_{s+1} = \hat{\beta}_{s} - H^{-1}_{s} G_{s}\]with
\[G_{s} = \dfrac{\partial}{\partial \beta_{s}} l(\beta) = \sum_{i=1}^{N} x_i (y_{i} - G(x_{i} \beta))\]and
\[H_{s} = \dfrac{\partial^2}{\partial \beta_{s}^2} l(\beta) = X'S_{s}X\]where \(S = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\) and \(G\) denotes the link function (
statinf.ml.activations.sigmoid()
for logit or gaussian c.d.f for probit).- References
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.
McCullagh, P. (2018). Generalized linear models. Routledge.
- partial_effects(variables, new_data=None, average=False)[source]
Computes Partial and Average Partial Effects (APE).
- Parameters
variables (
list
) – List of variables for which to compute the PE/APE.new_data (
pandas.DataFrame
) – Data to use for computations, optional, defaults to None (uses training set).average (
bool
, optional) – Whether to compute Average Partial Effects or not, defaults to False.
- Formula
- \[PE(X_{i}) = \beta_{i} \dfrac{e^{-\beta X}}{(e^{-\beta X} + 1)^{2}}\]
if
average = True
, \(APE = \bar{PE}\)- Raises
TypeError – If argument
variables
is neitherstr
norlist
.- Returns
Dictionnary including Partial Effects (PE) or Average Partial Effects (APE).
- Return type
dict
- predict(new_data, return_proba=False)[source]
Predicted \(\hat{Y}\) values for for a new dataset
- Parameters
new_data (
pandas.DataFrame
) – New data to evaluate with pandas data-frame format.return_proba (
bool
, optional) – Whether to return probabilities or binary values, defaults to False
- Formula
- \[f(X) = \dfrac{1}{e^{-\beta X} + 1}\]
- Returns
Predictions \(\hat{Y}\)
- Return type
numpy.ndarray
- r_squared()[source]
Mc Fadden’s pseudo-\(R^{2}\) – Goodness of fit
- Formula
- \[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]
- Returns
Goodness of fit.
- Return type
float
- summary(return_df=False)[source]
Statistical summary for GLM model
- Parameters
return_df (
bool
) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.- Formulae
Mc Fadden’s \(R^{2}\):
\[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]where \(LL\) represents the log-likelihood.
- References
Student. (1908). The probable error of a mean. Biometrika, 1-25.
Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.
Agresti, A. (2003). Categorical data analysis (Vol. 482). John Wiley & Sons.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
- Returns
Model’s summary.
- Return type
pandas.DataFrame
orstr
- variance(cov_type='nonrobust')[source]
Compute the covariance matrix for the fitted model.
- Parameters
cov_type (
str
, optional) – Type of the covariance matrix, defaults to nonrobust.- Formulae
Non-robust covariance matrix:
\[\sigma_{\beta} = {(X'SX)}^{-1}\]Sandwich covariance matrix:
\[\sigma_{\beta} = {(X'X)}^{-1} (X'\hat{S}X) {(X'X)}^{-1}\]with \(S = \text{diag}(\hat{p}_i(1 - \hat{p}_i))\) and \(\hat{S} = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\)
Note
Only non-robust covariance matrix is currently available. Sandwich estimate and \(HC0\), \(HC1\), \(HC2\), \(HC3\) will soon be implemented.
- Returns
Fisher information matrix
- Return type
numpy.array
Example
from statinf.regressions import GLM
# We set the Logit formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4"
# We fit the GLM with the data, the formula and without intercept
logit = GLM(formula, data, fit_intercept=False, family='binomial')
logit.fit(cov_type='nonrobust', plot=False)
logit.summary()
Output will be:
==================================================================================
| Logit summary |
==================================================================================
| McFadden's R² = 0.67128 | McFadden's R² Adj. = 0.6424 |
| Log-Likelihood = -227.62 | Null Log-Likelihood = -692.45 |
| LR test p-value = 0.0 | Covariance = nonrobust |
| n = 999 | p = 5 |
| Iterations = 8 | Convergence = True |
==================================================================================
| Variables | Coefficients | Std. Errors | t-values | Probabilities |
==================================================================================
| X0 | -1.13024 | 0.10888 | -10.381 | 0.0 *** |
| X1 | 0.02963 | 0.07992 | 0.371 | 0.711 |
| X2 | -1.40968 | 0.1261 | -11.179 | 0.0 *** |
| X3 | 0.5253 | 0.08966 | 5.859 | 0.0 *** |
| X4 | 0.14705 | 0.25018 | 0.588 | 0.557 |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================