Generalized Linear Models

GLM

class statinf.regressions.glm.GLM(formula, data, family='binomial', fit_intercept=False, initializer='zeros')[source]

Bases: object

Generalized Linear Model implemented with Newton-Raphson’s method.

Parameters

formula (str) – Regression formula to be run of the form y ~ x1 + x2. See statinf.data.ProcessData.parse_formula().
data (pandas.DataFrame) – Input data with Pandas format.
family (str, optional) – Family distribution of the dependent variable, defaults to binomial.
fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to False.
initializer (str, optional) – Method for initializing the first parameters (see statinf.ml.initializations()), defaults to ‘zeros’.

Note

The modules allows Binomial (for Logit) and Gaussian (for Probit) distributions. It will soon be extended to other distributions.

adjusted_r_squared()[source]

Mc Fadden’s pseudo-\(R^{2}\) adjusted – Adjusted goodness of fit

Formula: \[R^{2}_{adj} = 1 - \dfrac{LL(\hat{\beta}) - p}{LL(\bar{Y})}\]
Returns: Adjusted goodness of fit.
Return type: float

fit(maxit=15, cov_type='nonrobust', improvement_threshold=0.0005, keep_hist=True, plot=False)[source]

Fits the GLM regression model using Newton-Raphson method.

Parameters

maxit (int, optional) – Maximum number of iterations, defaults to 15.
cov_type (str, optional) – Type of the covariance matrix (non-robust or sandwich), defaults to nonrobust.
improvement_threshold (float, optional) – Threshold from which we consider the likelihood improved, defaults to 0.0005.
keep_hist (bool, optional) – Keeps training history (gradients, hessian, etc…), can be turned off for saving memory, defaults to True.
plot (bool, optional) – Plots evolution of log-likelihood through the different iterations (requires keep_hist = True), defaults to False.

Formulae

Log-likelihood:

\[l(\beta) = y_{i} \log \left[ G(\mathbf{x_i} \beta) \right] + (1 - y_{i}) \log \left[1 - G(\mathbf{x_i} \beta) \right]\]

Newton’s method:

\[\hat{\beta}_{s+1} = \hat{\beta}_{s} - H^{-1}_{s} G_{s}\]

with

\[G_{s} = \dfrac{\partial}{\partial \beta_{s}} l(\beta) = \sum_{i=1}^{N} x_i (y_{i} - G(x_{i} \beta))\]

and

\[H_{s} = \dfrac{\partial^2}{\partial \beta_{s}^2} l(\beta) = X'S_{s}X\]

where \(S = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\) and \(G\) denotes the link function (statinf.ml.activations.sigmoid() for logit or gaussian c.d.f for probit).

References

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.
McCullagh, P. (2018). Generalized linear models. Routledge.

partial_effects(variables, new_data=None, average=False)[source]

Computes Partial and Average Partial Effects (APE).

Parameters

variables (list) – List of variables for which to compute the PE/APE.
new_data (pandas.DataFrame) – Data to use for computations, optional, defaults to None (uses training set).
average (bool, optional) – Whether to compute Average Partial Effects or not, defaults to False.

Formula

\[PE(X_{i}) = \beta_{i} \dfrac{e^{-\beta X}}{(e^{-\beta X} + 1)^{2}}\]

if average = True, \(APE = \bar{PE}\)

Raises: TypeError – If argument variables is neither str nor list.
Returns: Dictionnary including Partial Effects (PE) or Average Partial Effects (APE).
Return type: dict

predict(new_data, return_proba=False)[source]

Predicted \(\hat{Y}\) values for for a new dataset

Parameters

new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.
return_proba (bool, optional) – Whether to return probabilities or binary values, defaults to False

Formula

\[f(X) = \dfrac{1}{e^{-\beta X} + 1}\]

Returns

Predictions \(\hat{Y}\)

Return type

numpy.ndarray

r_squared()[source]

Mc Fadden’s pseudo-\(R^{2}\) – Goodness of fit

Formula: \[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]
Returns: Goodness of fit.
Return type: float

summary(return_df=False)[source]

Statistical summary for GLM model

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.

Formulae

Mc Fadden’s \(R^{2}\):

\[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]

where \(LL\) represents the log-likelihood.

References

Student. (1908). The probable error of a mean. Biometrika, 1-25.
Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.
Agresti, A. (2003). Categorical data analysis (Vol. 482). John Wiley & Sons.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Returns

Model’s summary.

Return type

pandas.DataFrame or str

variance(cov_type='nonrobust')[source]

Compute the covariance matrix for the fitted model.

Parameters

cov_type (str, optional) – Type of the covariance matrix, defaults to nonrobust.

Formulae

Non-robust covariance matrix:

\[\sigma_{\beta} = {(X'SX)}^{-1}\]

Sandwich covariance matrix:

\[\sigma_{\beta} = {(X'X)}^{-1} (X'\hat{S}X) {(X'X)}^{-1}\]

with \(S = \text{diag}(\hat{p}_i(1 - \hat{p}_i))\) and \(\hat{S} = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\)

Note

Only non-robust covariance matrix is currently available. Sandwich estimate and \(HC0\), \(HC1\), \(HC2\), \(HC3\) will soon be implemented.

Returns: Fisher information matrix
Return type: numpy.array

Example

from statinf.regressions import GLM

# We set the Logit formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4"
# We fit the GLM with the data, the formula and without intercept
logit = GLM(formula, data, fit_intercept=False, family='binomial')
logit.fit(cov_type='nonrobust', plot=False)

logit.summary()

Output will be:

==================================================================================
|                                  Logit summary                                 |
==================================================================================
| McFadden's R²   =          0.67128 | McFadden's R² Adj.  =              0.6424 |
| Log-Likelihood  =          -227.62 | Null Log-Likelihood =             -692.45 |
| LR test p-value =              0.0 | Covariance          =           nonrobust |
| n               =              999 | p                   =                  5  |
| Iterations      =                8 | Convergence         =                True |
==================================================================================
| Variables         | Coefficients   | Std. Errors  | t-values   | Probabilities |
==================================================================================
| X0                |       -1.13024 |      0.10888 |    -10.381 |     0.0   *** |
| X1                |        0.02963 |      0.07992 |      0.371 |   0.711       |
| X2                |       -1.40968 |       0.1261 |    -11.179 |     0.0   *** |
| X3                |         0.5253 |      0.08966 |      5.859 |     0.0   *** |
| X4                |        0.14705 |      0.25018 |      0.588 |   0.557       |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================