Introduction: elementary statistics in python

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
print(os.listdir("../input")) #listing data files
['data.csv', 'test.csv', 'test2.csv', 'melb_data.csv.zip', 'melb_data.csv', 'matches.csv', 'deliveries.csv.zip', 'deliveries.csv', 'data1.csv', 'el1.csv', 'social_deltas.csv', 'AEP_hourly.csv', 'social_totals_agg.csv', 'social_totals.csv', 'airports.csv', 'airport-frequencies.csv', 'runways.csv', 'navaids.csv', 'countries.csv', 'regions.csv', 'airports.txt']
In [2]:
import matplotlib.pyplot as plt
print(plt.style.available) # look at available plot styles
['seaborn-ticks', 'ggplot', 'dark_background', 'bmh', 'seaborn-poster', 'seaborn-notebook', 'fast', 'seaborn', 'classic', 'Solarize_Light2', 'seaborn-dark', 'seaborn-pastel', 'seaborn-muted', '_classic_test', 'seaborn-paper', 'seaborn-colorblind', 'seaborn-bright', 'seaborn-talk', 'seaborn-dark-palette', 'tableau-colorblind10', 'seaborn-darkgrid', 'seaborn-whitegrid', 'fivethirtyeight', 'grayscale', 'seaborn-white', 'seaborn-deep']
In [3]:
#emulates the aesthetics of ggplot
plt.style.use("ggplot")
from operator import attrgetter

Frequency distributions

In [4]:
#data list indicating preference (notes) in interval [1,6]
data1 = pd.DataFrame({"preferences":[4,6,2,2,1,2,3,2,4,4]})
In [5]:
fv1 = data1["preferences"].value_counts(sort=False)  #fv1 is a series
#1st column (indicating notes is the index of the series)
fv1
Out[5]:
1    1
2    4
3    1
4    3
6    1
Name: preferences, dtype: int64
In [6]:
#making data look nice: create data frame from series above
notes = pd.DataFrame({"notes":fv1.index,"frequence":fv1.values},index=range(len(fv1)))
notes
Out[6]:
notes frequence
0 1 1
1 2 4
2 3 1
3 4 3
4 6 1
In [7]:
plt = notes.frequence.plot.bar()
In [8]:
#Let's calculate frequency distributions of another data set using intervals of data
mylist = [20,18,6,24,33,9,10,19,27,33,22,17,19,31,25,21,28,13,21,12,33,23,18,13,7,16,7,26]

#create dataframe from the list above
data2 = pd.DataFrame( {'values':mylist} )
In [9]:
#defining explicit intervals of classification
fv2 = data2["values"].value_counts(sort=False,bins=[4,9,14,19,24,29,34])
print(fv2)
(3.999, 9.0]    4
(9.0, 14.0]     4
(14.0, 19.0]    6
(19.0, 24.0]    6
(24.0, 29.0]    4
(29.0, 34.0]    4
Name: values, dtype: int64
In [10]:
#or calculate width
max = np.max(data2['values'])
min = np.min(data2['values'])
print("min =",min,"max =",max)
min = 6 max = 33
In [11]:
fv2 = data2["values"].value_counts(sort=False,bins=range(4,35,5))
fv2
Out[11]:
(3.999, 9.0]    4
(9.0, 14.0]     4
(14.0, 19.0]    6
(19.0, 24.0]    6
(24.0, 29.0]    4
(29.0, 34.0]    4
Name: values, dtype: int64
In [12]:
#convert series result to a dataframe
table1 = pd.DataFrame({"intervals":fv2.index,"f":fv2.values},index=range(len(fv2)))
table1
Out[12]:
intervals f
0 (3.999, 9.0] 4
1 (9.0, 14.0] 4
2 (14.0, 19.0] 6
3 (19.0, 24.0] 6
4 (24.0, 29.0] 4
5 (29.0, 34.0] 4
In [13]:
#cumulative frequence
table1 = (table1.assign(F=table1.f.cumsum()))
#total number or frequencies 
N = table1.f.sum()
#relative frequencies
table1['f%'] = (table1['f'] / N) * 100
table1['F%'] = (table1['F'] / N) * 100
table1
Out[13]:
intervals f F f% F%
0 (3.999, 9.0] 4 4 14.285714 14.285714
1 (9.0, 14.0] 4 8 14.285714 28.571429
2 (14.0, 19.0] 6 14 21.428571 50.000000
3 (19.0, 24.0] 6 20 21.428571 71.428571
4 (24.0, 29.0] 4 24 14.285714 85.714286
5 (29.0, 34.0] 4 28 14.285714 100.000000
In [14]:
plt = table1.plot.bar(x='intervals', y="f")

Mean, variance and standard deviation

\begin{align} Arithmetic\,mean = {Sum\,of\,all\,numbers \over No.\,of\,values\,in\,the \,set}\,\,\,\, \end{align}

Because it's important to make distinction between population (the entire lot of elements we investigate) and the sample (a representative part of that entire lot), we have the following notations:

Sample mean: $\bar{x} = {\sum_{i=1}^{n} x_{i} \over n}$

Population mean: $\mu = {\sum_{i=1}^{n} x_{i} \over N}$

where n, N is the number of elements in data set of a sample, respective population.

In general, when we deal with population (all elements of investigation) we have parameters and when work with samples, we have statistics.

In [15]:
mylist = [2,3,3,5,8,9,12]
#create data frame
data = pd.DataFrame( {'values':mylist} )
data
Out[15]:
values
0 2
1 3
2 3
3 5
4 8
5 9
6 12
In [16]:
sum_of_data = data.values.sum()
n = len(data.index)
mean = sum_of_data / n
mean
Out[16]:
6.0

or

In [17]:
mean = data.values.mean()
mean
Out[17]:
6.0

Mean is the "central weight" of the data, all the values in the set being equally distanced from the mean.

In [18]:
data['dev'] = data['values'] - mean
data
Out[18]:
values dev
0 2 -4.0
1 3 -3.0
2 3 -3.0
3 5 -1.0
4 8 2.0
5 9 3.0
6 12 6.0

As we could easily verify, in every case, the sum of distances (deviations) from the mean is always 0.

$ \bar{x} = {\sum_{i=1}^{n} x_{i} \over n} \Rightarrow n \times \bar{x} = \sum_{i=1}^{n} x_{i} \quad (1) \\ \sum_{i=1}^n(x_i - \bar{x}) = x_1 - \bar{x} + x_2 - \bar{x} + ... + x_n - \bar{x} = \\ = (x_1 + x_2 + ... x_n) - n \times \bar{x} = \sum_{i=1}^{n} x_{i} - n \times \bar{x} \quad (2) \\ (1) + (2) \Rightarrow \sum_{i=1}^n(x_i - \bar{x}) = \sum_{i=1}^{n} x_{i} - \sum_{i=1}^{n} x_{i} = 0 $

So, if we try to calculate the spread of data around the mean as the sum of deviations from the mean we get 0.

In [19]:
data.dev.sum()
Out[19]:
0.0

To avoid that result, we could use the absolute values of the deviations from the mean:

Mean deviation = ${\sum_{i=1}^{n} |x_{i}-\bar{x}| \over n}$

In [20]:
data['abs_dev'] = np.absolute(data['dev'])
data
Out[20]:
values dev abs_dev
0 2 -4.0 4.0
1 3 -3.0 3.0
2 3 -3.0 3.0
3 5 -1.0 1.0
4 8 2.0 2.0
5 9 3.0 3.0
6 12 6.0 6.0
In [21]:
Mean_deviation = data['abs_dev'].sum()
Mean_deviation
Out[21]:
22.0

Mean deviation is a measure of spreading data around the mean but for mathematical reasons (to make easier regression analysis), instead of this mean is used the variance:

Sample variance: $S^2={\sum_{i=i}^{n} (x_{i}-\bar{x})^2 \over n-1}$ (using n-1 instead of n gives a better estimation of population variance)

n - dimension of sample

Population variance: $\sigma^2={\sum_{i=i}^{n} (x_{i}-\mu)^2 \over N}$

N - dimension of population

In order to have a measure of data spread around the mean, we have to extract radical from the mean of squared deviations.

This is standard deviation.

Standard deviation for a sample: $S = \sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})}^2 \over n - 1}$

Standard deviation for population: $\sigma = \sqrt{\sum_{i=1}^{n}{(x_i - \mu)}^2 \over n}$

In [22]:
data['square_dev'] = data['dev']**2
data
Out[22]:
values dev abs_dev square_dev
0 2 -4.0 4.0 16.0
1 3 -3.0 3.0 9.0
2 3 -3.0 3.0 9.0
3 5 -1.0 1.0 1.0
4 8 2.0 2.0 4.0
5 9 3.0 3.0 9.0
6 12 6.0 6.0 36.0
In [23]:
variance = data.square_dev.sum()/len(data.index)
print("variance=",variance)
print("std_dev=",np.sqrt(variance))
variance= 12.0
std_dev= 3.4641016151377544
In [24]:
#pandas calculations
#Delta Degrees of Freedom: denominator of fraction is (n - ddof)
#in this case, biased formula with n at denominator
data['values'].std(ddof = 0)
Out[24]:
3.4641016151377544
In [25]:
#having a sample, we use unbiased formula with n-1
data['values'].std(ddof = 1)
Out[25]:
3.7416573867739413

The standard deviation of a frequency distribution

If we start from the "biased" formula (underesitmates the paramenter - std. dev. of a population investigated) of standard deviation of a sample:

$$S = \sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})}^2 \over n}$$

we obtain the following equivalent formula:

$$S = \sqrt{{\sum_{i=1}^{n} x_i^2 \over n}-({\sum_{i=1}^{n} x_i \over n})^2} \quad (1)$$

Proof:

$ {\sum_{i=1}^{n} (x_i - \bar{x})^2 \over n} = \\ = {(x_1^2 - 2 x_1 \bar{x} + \bar{x}^2) + (x_2^2 - 2 x_2 \bar{x} + \bar{x}^2) + ... + (x_n^2 - 2 x_n \bar{x} + \bar{x}^2) \over n} = \\ = {x_1^2 + x_2^2 + ... + x_n^2 - 2 \bar{x} (x_1 + x_2 + ... x_n) + n \bar{x}^2 \over n} = \\ = {{{\sum_{i=1}^{n} x_i^2} - 2 { \sum_{i=1}^{n} x_i \over n} \sum_{i=1}^{n} x_i + n ({\sum_{i=1}^{n} x_i \over n})^2} \over n} = \\ = {{\sum_{i=1}^{n} x_i^2-2{(\sum_{i=1}^{n}x_i)^2\over n}+{(\sum_{i=1}^{n}x_i)^2\over n}} \over n} = \\ = {{\sum_{i=1}^{n} x_i^2 - {(\sum_{i=1}^{n} x_i)^2 \over n}} \over n} = \\ = {{\sum_{i=1}^{n} x_i^2} \over n} - ({\sum_{i=1}^{n} x_i \over n})^2 $

So, starting from the "unbiased" formula of standard deviation of a sample:

$$S = \sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})}^2 \over n - 1}$$

we will have the following equivalent formula:

$$S = \sqrt{{ {n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2} \over n(n-1)}} \quad (2)$$

because:

$ {\sum_{i=1}^{n} (x_i - \bar{x})^2 \over n-1} = {{\sum_{i=1}^{n} x_i^2 - {(\sum_{i=1}^{n} x_i)^2 \over n}} \over n-1} = {{n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2} \over n(n-1)} $

In the end, the above (1) and (2), applied to a frequency distribution, will became:

$$S_{biased} = \sqrt{{\sum fm^2 \over \sum f}-({\sum fm \over \sum f})^2}$$$$S_{unbiased} = \sqrt{{ {\sum f \sum fm^2 - (\sum fm)^2} \over \sum f (\sum f-1)}}$$

where:

f - interval frequencies

m - midpoints of intervals.

Apllied to an example:

In [26]:
# read csv (comma separated value) into dataframe
data2 = pd.read_csv('../input/test2.csv')
data2.head()
Out[26]:
weight
0 130
1 130
2 120
3 110
4 89
In [27]:
data_mean = data2.weight.mean()
data_sd = data2.weight.std(ddof = 1)
print("mean =",data_mean,"standard deviation =",data_sd)
mean = 119.97368421052632 standard deviation = 17.27439417657974
In [28]:
data2.describe()
Out[28]:
weight
count 38.000000
mean 119.973684
std 17.274394
min 87.000000
25% 110.000000
50% 120.000000
75% 130.000000
max 170.000000

Now, let's classify data set into intervals of 10 units from 85 to 174

In [29]:
#freq by intervals (interval limits analytic defined)
weight_f = data2["weight"].value_counts(sort=False,bins=[84,94,104,114,124,134,144,164,174])
weight_f
Out[29]:
(83.999, 94.0]     3
(94.0, 104.0]      3
(104.0, 114.0]     6
(114.0, 124.0]    11
(124.0, 134.0]     9
(134.0, 144.0]     4
(144.0, 164.0]     1
(164.0, 174.0]     1
Name: weight, dtype: int64
In [30]:
#interval limits defined using range
weight_f = data2["weight"].value_counts(sort=False,bins=range(84,175,10))
weight_f
Out[30]:
(83.999, 94.0]     3
(94.0, 104.0]      3
(104.0, 114.0]     6
(114.0, 124.0]    11
(124.0, 134.0]     9
(134.0, 144.0]     4
(144.0, 154.0]     0
(154.0, 164.0]     1
(164.0, 174.0]     1
Name: weight, dtype: int64
In [31]:
#convert series result to a dataframe
table = pd.DataFrame({"intervals":weight_f.index, "f":weight_f.values},index=range(len(weight_f)))
table
Out[31]:
intervals f
0 (83.999, 94.0] 3
1 (94.0, 104.0] 3
2 (104.0, 114.0] 6
3 (114.0, 124.0] 11
4 (124.0, 134.0] 9
5 (134.0, 144.0] 4
6 (144.0, 154.0] 0
7 (154.0, 164.0] 1
8 (164.0, 174.0] 1
In [32]:
#get middle of intervals (mean of margins) 
#m stands for x values as the "representative" value for the interval
#so for the 1st row, for example, we could say we have 3 values of 88.9995
table['m'] = table['intervals'].map(attrgetter('mid'))
table
Out[32]:
intervals f m
0 (83.999, 94.0] 3 88.9995
1 (94.0, 104.0] 3 99.0000
2 (104.0, 114.0] 6 109.0000
3 (114.0, 124.0] 11 119.0000
4 (124.0, 134.0] 9 129.0000
5 (134.0, 144.0] 4 139.0000
6 (144.0, 154.0] 0 149.0000
7 (154.0, 164.0] 1 159.0000
8 (164.0, 174.0] 1 169.0000
In [38]:
table['fm'] = table['f'] * table['m']
table['fm2'] = table['f'] * table['m'] ** 2
table
Out[38]:
intervals f m fm fm2
0 (83.999, 94.0] 3 88.9995 266.9985 23762.733001
1 (94.0, 104.0] 3 99.0000 297.0000 29403.000000
2 (104.0, 114.0] 6 109.0000 654.0000 71286.000000
3 (114.0, 124.0] 11 119.0000 1309.0000 155771.000000
4 (124.0, 134.0] 9 129.0000 1161.0000 149769.000000
5 (134.0, 144.0] 4 139.0000 556.0000 77284.000000
6 (144.0, 154.0] 0 149.0000 0.0000 0.000000
7 (154.0, 164.0] 1 159.0000 159.0000 25281.000000
8 (164.0, 174.0] 1 169.0000 169.0000 28561.000000
In [51]:
sum_f = table.f.sum()
sum_fm = table.fm.sum()
sum_fm2 = table.fm2.sum()

s = np.sqrt((sum_f * sum_fm2 - sum_fm ** 2) / (sum_f * (sum_f-1)) ) 
print("std. dev for freq. distribution =", s)
#which is closed to
print("std. dev of data =", data_sd)
std. dev for freq. distribution = 17.2691761602393
std. dev of data = 17.27439417657974

Similar, the mean of frequncy distribution is:

$$\bar{x} = {\sum fm \over \sum f}$$
In [54]:
mean_f = sum_fm / sum_f
print("mean of freq. distrib. =",mean_f)
#which is closed to
print("data mean =", data_mean)
mean of freq. distrib. = 120.31575
data mean = 119.97368421052632
In [ ]: