Generalized Linear Models

GLM

class statinf.regressions.glm.GLM(formula, data, family='binomial', fit_intercept=False, initializer='zeros')[source]

Bases: object

Generalized Linear Model implemented with Newton-Raphson’s method.

Parameters
  • formula (str) – Regression formula to be run of the form y ~ x1 + x2. See statinf.data.ProcessData.parse_formula().

  • data (pandas.DataFrame) – Input data with Pandas format.

  • family (str, optional) – Family distribution of the dependent variable, defaults to binomial.

  • fit_intercept (bool, optional) – Used for adding intercept in the regression formula, defaults to False.

  • initializer (str, optional) – Method for initializing the first parameters (see statinf.ml.initializations()), defaults to ‘zeros’.

Note

The modules allows Binomial (for Logit) and Gaussian (for Probit) distributions. It will soon be extended to other distributions.

adjusted_r_squared()[source]

Mc Fadden’s pseudo-\(R^{2}\) adjusted – Adjusted goodness of fit

Formula
\[R^{2}_{adj} = 1 - \dfrac{LL(\hat{\beta}) - p}{LL(\bar{Y})}\]
Returns

Adjusted goodness of fit.

Return type

float

fit(maxit=15, cov_type='nonrobust', improvement_threshold=0.0005, keep_hist=True, plot=False)[source]

Fits the GLM regression model using Newton-Raphson method.

Parameters
  • maxit (int, optional) – Maximum number of iterations, defaults to 15.

  • cov_type (str, optional) – Type of the covariance matrix (non-robust or sandwich), defaults to nonrobust.

  • improvement_threshold (float, optional) – Threshold from which we consider the likelihood improved, defaults to 0.0005.

  • keep_hist (bool, optional) – Keeps training history (gradients, hessian, etc…), can be turned off for saving memory, defaults to True.

  • plot (bool, optional) – Plots evolution of log-likelihood through the different iterations (requires keep_hist = True), defaults to False.

Formulae
  • Log-likelihood:

\[l(\beta) = y_{i} \log \left[ G(\mathbf{x_i} \beta) \right] + (1 - y_{i}) \log \left[1 - G(\mathbf{x_i} \beta) \right]\]
  • Newton’s method:

\[\hat{\beta}_{s+1} = \hat{\beta}_{s} - H^{-1}_{s} G_{s}\]

with

\[G_{s} = \dfrac{\partial}{\partial \beta_{s}} l(\beta) = \sum_{i=1}^{N} x_i (y_{i} - G(x_{i} \beta))\]

and

\[H_{s} = \dfrac{\partial^2}{\partial \beta_{s}^2} l(\beta) = X'S_{s}X\]

where \(S = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\) and \(G\) denotes the link function (statinf.ml.activations.sigmoid() for logit or gaussian c.d.f for probit).

References
  • Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data.

  • Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.

  • McCullagh, P. (2018). Generalized linear models. Routledge.

partial_effects(variables, new_data=None, average=False)[source]

Computes Partial and Average Partial Effects (APE).

Parameters
  • variables (list) – List of variables for which to compute the PE/APE.

  • new_data (pandas.DataFrame) – Data to use for computations, optional, defaults to None (uses training set).

  • average (bool, optional) – Whether to compute Average Partial Effects or not, defaults to False.

Formula
\[PE(X_{i}) = \beta_{i} \dfrac{e^{-\beta X}}{(e^{-\beta X} + 1)^{2}}\]

if average = True, \(APE = \bar{PE}\)

Raises

TypeError – If argument variables is neither str nor list.

Returns

Dictionnary including Partial Effects (PE) or Average Partial Effects (APE).

Return type

dict

predict(new_data, return_proba=False)[source]

Predicted \(\hat{Y}\) values for for a new dataset

Parameters
  • new_data (pandas.DataFrame) – New data to evaluate with pandas data-frame format.

  • return_proba (bool, optional) – Whether to return probabilities or binary values, defaults to False

Formula
\[f(X) = \dfrac{1}{e^{-\beta X} + 1}\]
Returns

Predictions \(\hat{Y}\)

Return type

numpy.ndarray

r_squared()[source]

Mc Fadden’s pseudo-\(R^{2}\) – Goodness of fit

Formula
\[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]
Returns

Goodness of fit.

Return type

float

summary(return_df=False)[source]

Statistical summary for GLM model

Parameters

return_df (bool) – Return the summary as a Pandas DataFrame, else returns a string, defaults to False.

Formulae
  • Mc Fadden’s \(R^{2}\):

\[R^{2} = 1 - \dfrac{LL(\hat{\beta})}{LL(\bar{Y})}\]

where \(LL\) represents the log-likelihood.

References
  • Student. (1908). The probable error of a mean. Biometrika, 1-25.

  • Wooldridge, J. M. (2016). Introductory econometrics: A modern approach. Nelson Education.

  • Agresti, A. (2003). Categorical data analysis (Vol. 482). John Wiley & Sons.

  • Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Returns

Model’s summary.

Return type

pandas.DataFrame or str

variance(cov_type='nonrobust')[source]

Compute the covariance matrix for the fitted model.

Parameters

cov_type (str, optional) – Type of the covariance matrix, defaults to nonrobust.

Formulae
  • Non-robust covariance matrix:

\[\sigma_{\beta} = {(X'SX)}^{-1}\]
  • Sandwich covariance matrix:

\[\sigma_{\beta} = {(X'X)}^{-1} (X'\hat{S}X) {(X'X)}^{-1}\]

with \(S = \text{diag}(\hat{p}_i(1 - \hat{p}_i))\) and \(\hat{S} = \text{diag} \left( (Y_i - \hat{p}_i)^{2} \right)\)

Note

Only non-robust covariance matrix is currently available. Sandwich estimate and \(HC0\), \(HC1\), \(HC2\), \(HC3\) will soon be implemented.

Returns

Fisher information matrix

Return type

numpy.array

Example

from statinf.regressions import GLM

# We set the Logit formula
formula = "Y ~ X0 + X1 + X2 + X3 + X4"
# We fit the GLM with the data, the formula and without intercept
logit = GLM(formula, data, fit_intercept=False, family='binomial')
logit.fit(cov_type='nonrobust', plot=False)

logit.summary()

Output will be:

==================================================================================
|                                  Logit summary                                 |
==================================================================================
| McFadden's R²   =          0.67128 | McFadden's R² Adj.  =              0.6424 |
| Log-Likelihood  =          -227.62 | Null Log-Likelihood =             -692.45 |
| LR test p-value =              0.0 | Covariance          =           nonrobust |
| n               =              999 | p                   =                  5  |
| Iterations      =                8 | Convergence         =                True |
==================================================================================
| Variables         | Coefficients   | Std. Errors  | t-values   | Probabilities |
==================================================================================
| X0                |       -1.13024 |      0.10888 |    -10.381 |     0.0   *** |
| X1                |        0.02963 |      0.07992 |      0.371 |   0.711       |
| X2                |       -1.40968 |       0.1261 |    -11.179 |     0.0   *** |
| X3                |         0.5253 |      0.08966 |      5.859 |     0.0   *** |
| X4                |        0.14705 |      0.25018 |      0.588 |   0.557       |
==================================================================================
| Significance codes: 0. < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < '' < 1 |
==================================================================================