# Statistical tests

statinf.stats.tests.dispersion_test(x, alpha=0.05, two_sided=True, return_tuple=False)[source]

Dispersion test for count data.

In the two-sided setup, we aim at testing:

$H_{0}: \mathbb{E}(X) = \mathbb{V}(X) \text{ against } H_{1}: \mathbb{E}(X) \neq \mathbb{V}(X)$

The p-value is computed as:

$\mathbb{P}(T \leq t \mid H_{0} \text{ holds})$

with, under $$H_{0}$$:

$t = \dfrac{\dfrac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 - \bar{X}}{\sqrt{\dfrac{2}{n-1}} \bar{X}} \sim \chi^2_{n-1}$
Parameters
• x (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• two_sided (bool, optional) – Perform a two-sided test, defaults to True.

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> stats.dispersion_test([36, 31, 39, 25, 40, 36, 30, 36, 27, 35, 28, 37, 39, 39, 30])
... +------------------------------------------------------------+
... |                      Dispersion test                       |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value |    T-value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |         15 |   26.118948045 | -0.7499767 |     0.0 | False |
... +------------+----------------+------------+---------+-------+
...  * We reject H0, hence E(X) != V(X)
...  * E(X) = 33.87
...  * V(X) = 22.65

Reference
• Böhning, D. (1994). A note on a test for Poisson overdispersion. Biometrika, 81(2), 418-419.

• de Oliveira, J. T. (1965). Some elementary tests of mixtures of discrete distributions. Classical and contagious discrete distributions, 379-384.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.kstest(x1, x2='normal', alpha=0.05, return_tuple=False, **kwargs)[source]

Kolmogorov-Smirnov test for sample tests.

We test if two samples come from the same distribution $$H_{0}: F(x) = G(x)$$ against $$H_{1}: F(x)$$ follows a distribution that $$G(x)$$.

The statistic is:

$D_{mn} = \sup_{-\infty < x < + \infty} |F_{n}(x) - G_{m}(x)|$

and the critical value is given by:

$c = \mathbb{P} \left[ \left( \dfrac{mn}{n + m} \right)^{2} D_{mn} < K_{\alpha} \right]$

where $$K_{\alpha}$$ represents the quantile of order $$\alpha$$ from a Kolmogorov distribution.

This test is an alternative to the $$\chi^{2}$$-test where we compare two distributions. By comparing an unknown empirical distribution $$F_{n}$$ with a distribution $$G_{n}$$ pulled from a known function, we can assess whether $$F_{n}$$ follows the same distribution as $$G_{n}$$. For instance, when comparing $$F_{n}$$ (unknown) to $$G_{n} \sim \mathcal{N}(\mu, \sigma^{2})$$, not rejecting $$H_{0}$$ would mean that we cannot reject the hypothesis that $$F_{n} \sim \mathcal{N}(\mu, \sigma^{2})$$.

Parameters
• x1 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• x2 (str, optional) – Sample to be compared. Can be external empirical sample in the same format as x1 or the name of a cdf which can be ‘normal’, ‘beta’, ‘gamma’, ‘poisson’, ‘chisquare’, ‘exponential’, ‘gamma’, defaults to ‘normal’.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• return_tuple (bool, optional) – Return a tuple with K statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> import numpy as np
>>> stats.kstest(np.random.normal(size=100))
... +------------------------------------------------------------+
... |                   Kolmogorov-Smirnov test                  |
... +------------+----------------+------------+---------+-------+
... | D value    | Critical value | K-value    | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |       0.09 |  1.35809863932 |  0.6363961 | 0.81275 | True  |
... +------------+----------------+------------+---------+-------+
...  * We cannot reject the hypothesis H0: F(x) ~ normal
...  * Confidence level is 95.0%, we need p > 0.05

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Kolmogorov-Smirnov, A. N., Kolmogorov, A., & Kolmogorov, M. (1933). Sulla determinazione empírica di uma legge di distribuzione.

• Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov’s distribution. Journal of Statistical Software, 8(18), 1-4.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.ttest(x, mu=0, alpha=0.05, is_bernoulli=False, two_sided=True, return_tuple=False)[source]

One sample Student’s test.

In the two-sided setup, we aim at testing:

$H_{0}: \bar{X} = \mu \text{ against } H_{1}: \bar{X} \neq \mu$

The one-sided setup tests $$H_{0}: \bar{X} \leq \mu$$ against $$H_{0}: \bar{X} > \mu$$.

The p-value is computed as:

$\mathbb{P}(|Z| \geq |t| \mid H_{0} \text{ holds})$

with, under $$H_{0}$$:

$t = \dfrac{\bar{X}_{1} - \mu }{ \dfrac{s}{\sqrt{n}} } \sim \mathcal{N}(0, 1)$

if $$s = \mathbb{V}(\mathbf{X})$$ is known or if $$n \gg 30$$, otherwise $$t \sim \mathcal{T}_{n - 1}$$, a Student distribution with $$n-1$$ degrees of freedom. In case of a bernoulli distribution, then $$s = \mu(1 - \mu)$$.

Parameters
• x (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• mu (int, optional) – Theoretical mean to be evaluated in the null hypothesis, defaults to 0.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• is_bernoulli (bool, optional) – Input value follows a Bernoulli distribution, i.e. $$\mathbf{X} \sim \mathcal{B}(p)$$ with $$p \in [0, 1]$$, defaults to False.

• two_sided (bool, optional) – Perform a two-sided test, defaults to True.

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> stats.ttest([30.02, 29.99, 30.11, 29.97, 30.01, 29.99], mu=30)
... +------------------------------------------------------------+
... |                   One Sample Student test                  |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value |    T-value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |          5 |   2.5705818366 |  0.7392961 | 0.49295 | True  |
... +------------+----------------+------------+---------+-------+
... * Confidence level is 95.0%, we need p > 0.025 for two-sided test
... * We cannot reject the hypothesis H0: X_bar = 30

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Student. (1908). The probable error of a mean. Biometrika, 1-25.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.ttest_2samp(x1, x2, alpha=0.05, paired=False, is_bernoulli=False, two_sided=True, return_tuple=False)[source]

Two samples Student’s test.

If the samples are independent ($$X_{1} \bot X_{2}$$), we test:

$H_{0}: \bar{X}_{1} = \bar{X}_{2} \text{ against } H_{1}: \bar{X}_{1} \neq \bar{X}_{2}$

The test statistic is:

$t = \dfrac{\bar{X}_{1} - \bar{X}_{2} }{ \sqrt{ \dfrac{(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}}{n_{1} + n_{2} - 1} } \sqrt{\dfrac{1}{n_{1}} + \dfrac{1}{n_{2}}} }$

where:

• $$t \sim \mathcal{T}_{n_{1} + n_{2} - 2}$$, if $$\mathbf{X} \sim \mathcal{N}(\mu_{X}, \sigma^{2}_{X})$$ and $$\mathbf{Y} \sim \mathcal{N}(\mu_{Y}, \sigma^{2}_{Y})$$

• $$t \sim \mathcal{N}(0, 1)$$, if $$n_{1} \gg 30$$ and $$n_{2} \gg 30$$

If samples are paired:

$H_{0}: \bar{X}_{1} = \bar{X}_{2} \Leftrightarrow \bar{X}_{1} - \bar{X}_{2} = 0 \Leftrightarrow \bar{X}_{D} = 0$

We then compared a one sample test where the tested vector is the difference between both vectors $$X_{D} = X_{1} - X_{2}$$ for which we compare whether it is equal to 0.

Parameters
• x1 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• x2 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• paired (bool, optional) – Performs one sample test of the difference if samples are paired, defaults to False.

• is_bernoulli (bool, optional) – Input variables follow a Bernoulli distribution, i.e. $$\mathbf{X} \sim \mathcal{B}(p)$$ with $$p \in [0, 1]$$, defaults to False.

• two_sided (bool, optional) – Perform a two-sided test, defaults to True.

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> a = [30.02, 29.99, 30.11, 29.97, 30.01, 29.99]
>>> b = [29.89, 29.93, 29.72, 29.98, 30.02, 29.98]
>>> stats.ttest_2samp(a, b)
... +------------------------------------------------------------+
... |                   Two Samples Student test                 |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value |    T-value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |         10 |  -1.8124611228 |  1.1310325 | 0.28444 | True  |
... +------------+----------------+------------+---------+-------+
...  * Confidence level is 95.0%, we need p > 0.05
...  * We cannot reject the hypothesis H0: X1 = X2
...  * Samples with unequal variances
...  * Same sample sizes

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Student. (1908). The probable error of a mean. Biometrika, 1-25.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.wilcoxon(x, y=None, alpha=0.05, alternative='two-sided', mode='auto', zero_method='wilcox', return_tuple=False)[source]

Wilcoxon signed-rank test.

Parameters
• x (numpy.array) – First sample to compare. If y is not provided, will correspond to the difference $$x - y$$.

• y (numpy.array, optional) – Second sample to compare, defaults to None.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• alternative (str, optional) – Perform a one or two-sided test. Values can be two-sided, greater, less, defaults to ‘two-sided’.

• mode (str, optional) – Method to calculate the p-value. Computes the exact distribution is sample size is less than 25, otherwise uses normal approximation. Values can be auto, approx or exact, defaults to ‘auto’.

• zero_method (str, optional) – Method to handle the zero differences., defaults to ‘wilcox’

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> import numpy as np
>>> x = np.random.poisson(2, size=100)
>>> y = x_dist + np.random.normal(loc=0, scale=1, size=100)
>>> stats.wilcoxon(x, y)
... +------------------------------------------------------------+
... |                       Wilcoxon test                        |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value | Stat value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |        100 |   1.9599639845 |  -1.316878 | 0.18788 | True  |
... +------------+----------------+------------+---------+-------+
...  * We cannot reject H0: x - y ~ symmetric distribution centered in 0
...  * The T-value is: 2142.0

Reference
• Wilcoxon, F., Individual Comparisons by Ranking Methods, Biometrics Bulletin, Vol. 1, 1945, pp. 80-83.

• Cureton, E.E., The Normal Approximation to the Signed-Rank Sampling Distribution When Zero Differences are Present, Journal of the American Statistical Association, Vol. 62, 1967, pp. 1068-1069.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

tuple`