# Statistical tests¶

statinf.stats.tests.kstest(x1, x2='normal', alpha=0.05, return_tuple=False, **kwargs)[source]

Kolmogorov-Smirnov test for sample tests.

We test if two samples come from the same distribution $$H_{0}: F(x) = G(x)$$ against $$H_{1}: F(x)$$ follows a distribution that $$G(x)$$.

The statistic is:

$D_{mn} = \sup_{-\infty < x < + \infty} |F_{n}(x) - G_{m}(x)|$

and the critical value is given by:

$c = \mathbb{P} \left[ \left( \dfrac{mn}{n + m} \right)^{2} D_{mn} < K_{\alpha} \right]$

where $$K_{\alpha}$$ represents the quantile of order $$\alpha$$ from a Kolmogorov distribution.

This test is an alternative to the $$\chi^{2}$$-test where we compare two distributions. By comparing an unknown empirical distribution $$F_{n}$$ with a distribution $$G_{n}$$ pulled from a known function, we can assess whether $$F_{n}$$ follows the same distribution as $$G_{n}$$. For instance, when comparing $$F_{n}$$ (unknown) to $$G_{n} \sim \mathcal{N}(\mu, \sigma^{2})$$, not rejecting $$H_{0}$$ would mean that we cannot reject the hypothesis that $$F_{n} \sim \mathcal{N}(\mu, \sigma^{2})$$.

Parameters
• x1 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• x2 (str, optional) – Sample to be compared. Can be external empirical sample in the same format as x1 or the name of a cdf which can be ‘normal’, ‘beta’, ‘gamma’, ‘poisson’, ‘chisquare’, ‘exponential’, ‘gamma’, defaults to ‘normal’.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• return_tuple (bool, optional) – Return a tuple with K statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> import numpy as np
>>> stats.kstest(np.random.normal(size=100))
... +------------------------------------------------------------+
... |                   Kolmogorov-Smirnov test                  |
... +------------+----------------+------------+---------+-------+
... | D value    | Critical value | K-value    | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |       0.09 |  1.35809863932 |  0.6363961 | 0.81275 | True  |
... +------------+----------------+------------+---------+-------+
...  * We cannot reject the hypothesis H0: F(x) ~ normal
...  * Confidence level is 95.0%, we need p > 0.05

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Kolmogorov-Smirnov, A. N., Kolmogorov, A., & Kolmogorov, M. (1933). Sulla determinazione empírica di uma legge di distribuzione.

• Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov’s distribution. Journal of Statistical Software, 8(18), 1-4.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.ttest(x, mu=0, alpha=0.05, is_bernoulli=False, two_sided=True, return_tuple=False)[source]

One sample Student’s test.

In the two-sided setup, we aim at testing:

$H_{0}: \bar{X} = \mu \text{ against } H_{1}: \bar{X} \neq \mu$

The one-sided setup tests $$H_{0}: \bar{X} \leq \mu$$ against $$H_{0}: \bar{X} > \mu$$.

The p-value is computed as:

$\mathbb{P}(|Z| \geq |t| \mid H_{0} \text{ holds})$

with, under $$H_{0}$$:

$t = \dfrac{\bar{X}_{1} - \mu }{ \dfrac{s}{\sqrt{n}} } \sim \mathcal{N}(0, 1)$

if $$s = \mathbb{V}(\mathbf{X})$$ is known or if $$n \gg 30$$, otherwise $$t \sim \mathcal{T}_{n - 1}$$, a Student distribution with $$n-1$$ degrees of freedom. In case of a bernoulli distribution, then $$s = \mu(1 - \mu)$$.

Parameters
• x (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• mu (int, optional) – Theoretical mean to be evaluated in the null hypothesis, defaults to 0.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• is_bernoulli (bool, optional) – Input value follows a Bernoulli distribution, i.e. $$\mathbf{X} \sim \mathcal{B}(p)$$ with $$p \in [0, 1]$$, defaults to False.

• two_sided (bool, optional) – Perform a two-sided test, defaults to True.

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> stats.ttest([30.02, 29.99, 30.11, 29.97, 30.01, 29.99], mu=30)
... +------------------------------------------------------------+
... |                   One Sample Student test                  |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value |    T-value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |          5 |   2.5705818366 |  0.7392961 | 0.49295 | True  |
... +------------+----------------+------------+---------+-------+
... * Confidence level is 95.0%, we need p > 0.025 for two-sided test
... * We cannot reject the hypothesis H0: X_bar = 30

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Student. (1908). The probable error of a mean. Biometrika, 1-25.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple

statinf.stats.tests.ttest_2samp(x1, x2, alpha=0.05, paired=False, is_bernoulli=False, two_sided=True, return_tuple=False)[source]

Two samples Student’s test.

If the samples are independent ($$X_{1} \bot X_{2}$$), we test:

$H_{0}: \bar{X}_{1} = \bar{X}_{2} \text{ against } H_{1}: \bar{X}_{1} \neq \bar{X}_{2}$

The test statistic is:

$t = \dfrac{\bar{X}_{1} - \bar{X}_{2} }{ \sqrt{ \dfrac{(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}}{n_{1} + n_{2} - 1} } \sqrt{\dfrac{1}{n_{1}} + \dfrac{1}{n_{2}}} }$

where:

• $$t \sim \mathcal{T}_{n_{1} + n_{2} - 2}$$, if $$\mathbf{X} \sim \mathcal{N}(\mu_{X}, \sigma^{2}_{X})$$ and $$\mathbf{Y} \sim \mathcal{N}(\mu_{Y}, \sigma^{2}_{Y})$$

• $$t \sim \mathcal{N}(0, 1)$$, if $$n_{1} \gg 30$$ and $$n_{2} \gg 30$$

If samples are paired:

$H_{0}: \bar{X}_{1} = \bar{X}_{2} \Leftrightarrow \bar{X}_{1} - \bar{X}_{2} = 0 \Leftrightarrow \bar{X}_{D} = 0$

We then compared a one sample test where the tested vector is the difference between both vectors $$X_{D} = X_{1} - X_{2}$$ for which we compare whether it is equal to 0.

Parameters
• x1 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• x2 (numpy.array) – Input variable. Format can be numpy.array, list or pandas.Series.

• alpha (float, optional) – Confidence level, defaults to 0.05.

• paired (bool, optional) – Performs one sample test of the difference if samples are paired, defaults to False.

• is_bernoulli (bool, optional) – Input variables follow a Bernoulli distribution, i.e. $$\mathbf{X} \sim \mathcal{B}(p)$$ with $$p \in [0, 1]$$, defaults to False.

• two_sided (bool, optional) – Perform a two-sided test, defaults to True.

• return_tuple (bool, optional) – Return a tuple with t statistic, critical value and p-value, defaults to False.

Example

>>> from statinf import stats
>>> a = [30.02, 29.99, 30.11, 29.97, 30.01, 29.99]
>>> b = [29.89, 29.93, 29.72, 29.98, 30.02, 29.98]
>>> stats.ttest_2samp(a, b)
... +------------------------------------------------------------+
... |                   Two Samples Student test                 |
... +------------+----------------+------------+---------+-------+
... |     df     | Critical value |    T-value | p-value |   H0  |
... +------------+----------------+------------+---------+-------+
... |         10 |  -1.8124611228 |  1.1310325 | 0.28444 | True  |
... +------------+----------------+------------+---------+-------+
...  * Confidence level is 95.0%, we need p > 0.05
...  * We cannot reject the hypothesis H0: X1 = X2
...  * Samples with unequal variances
...  * Same sample sizes

Reference
• DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

• Student. (1908). The probable error of a mean. Biometrika, 1-25.

Returns

Summary for the test or tuple statistic, critical value, p-value.

Return type

str or tuple