Cheers! Stats with Beers¶

Welcome to the second module in Engineering Computations, our series in computational thinking for undergraduate science and engineering students. This module explores practical statistical analysis with Python.

This first lesson explores how we can answer questions using data combined with practical methods from statistics.

We'll need some fun data to work with. We found a neat data set of canned craft beers in the US, scraped from the web and cleaned up by Jean-Nicholas Hould (@nickhould on GitHub)—who we want to thank for having a permissive license on his GitHub repository so we can reuse his work!

The data source doesn't say that the set includes all the canned beers brewed in the country. So we have to asume that the data is a sample and may contain biases.

To manipulate the data, you'll start with NumPy—the array library for Python that you learned about in Module 1, lesson 4. But you'll also learn about a new Python library for data analysis: pandas. It is an open-source library providing high-performance, easy-to-use data structures and data-analysis tools. Even though pandas is great for data analysis, we won't exploit all its power in this lesson. But you'll learn more about it later on!

With pandas, you will read the data file (in csv format, for comma-separated values), display it in a nice table, and extract the columns that we need, which we'll convert to numpy arrays to work with.

Let's start by importing the two Python libraries that we need.

In [1]:

import pandas
import numpy

Step 1: Read the data file¶

Below, we'll take a peek into the data file, beers.csv, using the system command head (which we can use with a bang, thanks to IPython).

But first, we will download the data using a Python library for opening a URL on the Internet. We created a short URL for the data file in the public repository with our course materials.

The cell below should download the data in your current working directory. The next cell shows you the first few lines of the data.

In [2]:

from urllib.request import urlretrieve
URL = 'http://go.gwu.edu/engcomp2data1'
urlretrieve(URL, 'beers.csv')

Out[2]:

('beers.csv', <http.client.HTTPMessage at 0x11d88c9e8>)

In [3]:

!head "beers.csv"

,abv,ibu,id,name,style,brewery_id,ounces
0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,0.075,,2262,Sex and Candy,American IPA,177,12.0
5,0.077,,2261,Black Exodus,Oatmeal Stout,177,12.0
6,0.045,,2260,Lake Street Express,American Pale Ale (APA),177,12.0
7,0.065,,2259,Foreman,American Porter,177,12.0
8,0.055,,2258,Jade,American Pale Ale (APA),177,12.0

You can use pandas to read the data from the csv file, and save it into a new variable called beers. Let's then check the type of this new variable—rememeber that we can use the function type() to do this.

In [4]:

beers = pandas.read_csv("beers.csv")

In [5]:

type(beers)

Out[5]:

pandas.core.frame.DataFrame

This is a new data type for us: a pandas DataFrame. From the pandas documentation: "A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types" [4]. You can think of it as the contens of a spreadsheet, saved into one handy Python variable. If you print it out, you get a nicely laid-out table:

In [6]:

beers

Out[6]:

	Unnamed: 0	abv	ibu	id	name	style	brewery_id	ounces
0	0	0.050	NaN	1436	Pub Beer	American Pale Lager	408	12.0
1	1	0.066	NaN	2265	Devil's Cup	American Pale Ale (APA)	177	12.0
2	2	0.071	NaN	2264	Rise of the Phoenix	American IPA	177	12.0
3	3	0.090	NaN	2263	Sinister	American Double / Imperial IPA	177	12.0
4	4	0.075	NaN	2262	Sex and Candy	American IPA	177	12.0
5	5	0.077	NaN	2261	Black Exodus	Oatmeal Stout	177	12.0
6	6	0.045	NaN	2260	Lake Street Express	American Pale Ale (APA)	177	12.0
7	7	0.065	NaN	2259	Foreman	American Porter	177	12.0
8	8	0.055	NaN	2258	Jade	American Pale Ale (APA)	177	12.0
9	9	0.086	NaN	2131	Cone Crusher	American Double / Imperial IPA	177	12.0
10	10	0.072	NaN	2099	Sophomoric Saison	Saison / Farmhouse Ale	177	12.0
11	11	0.073	NaN	2098	Regional Ring Of Fire	Saison / Farmhouse Ale	177	12.0
12	12	0.069	NaN	2097	Garce Selé	Saison / Farmhouse Ale	177	12.0
13	13	0.085	NaN	1980	Troll Destroyer	Belgian IPA	177	12.0
14	14	0.061	60.0	1979	Bitter Bitch	American Pale Ale (APA)	177	12.0
15	15	0.060	NaN	2318	Ginja Ninja	Cider	154	12.0
16	16	0.060	NaN	2170	Cherried Away	Cider	154	12.0
17	17	0.060	NaN	2169	Rhubarbarian	Cider	154	12.0
18	18	0.060	NaN	1502	BrightCider	Cider	154	12.0
19	19	0.082	NaN	1593	He Said Baltic-Style Porter	Baltic Porter	368	12.0
20	20	0.082	NaN	1592	He Said Belgian-Style Tripel	Tripel	368	12.0
21	21	0.099	92.0	1036	Lower De Boom	American Barleywine	368	8.4
22	22	0.079	45.0	1024	Fireside Chat	Winter Warmer	368	12.0
23	23	0.079	NaN	976	Marooned On Hog Island	American Stout	368	12.0
24	24	0.044	42.0	876	Bitter American	American Pale Ale (APA)	368	12.0
25	25	0.049	17.0	802	Hell or High Watermelon Wheat (2009)	Fruit / Vegetable Beer	368	12.0
26	26	0.049	17.0	801	Hell or High Watermelon Wheat (2009)	Fruit / Vegetable Beer	368	12.0
27	27	0.049	17.0	800	21st Amendment Watermelon Wheat Beer (2006)	Fruit / Vegetable Beer	368	12.0
28	28	0.070	70.0	799	21st Amendment IPA (2006)	American IPA	368	12.0
29	29	0.070	70.0	797	Brew Free! or Die IPA (2008)	American IPA	368	12.0
...	...	...	...	...	...	...	...	...
2380	2380	0.080	31.0	761	P-51 Porter	American Porter	509	16.0
2381	2381	0.055	NaN	2149	#001 Golden Amber Lager	American Amber / Red Lager	211	12.0
2382	2382	0.071	60.0	2148	#002 American I.P.A.	American IPA	211	12.0
2383	2383	0.052	NaN	2147	#003 Brown & Robust Porter	American Porter	211	12.0
2384	2384	0.048	38.0	2146	#004 Session I.P.A.	American IPA	211	12.0
2385	2385	0.059	NaN	2047	Tarasque	Saison / Farmhouse Ale	239	12.0
2386	2386	0.062	61.0	1470	Ananda India Pale Ale	American IPA	239	12.0
2387	2387	0.045	23.0	1469	Tiny Bomb	American Pilsner	239	12.0
2388	2388	0.058	72.0	2627	Train Hopper	American IPA	14	12.0
2389	2389	0.045	NaN	2626	Edward’s Portly Brown	American Brown Ale	14	12.0
2390	2390	0.059	135.0	1676	Troopers Alley IPA	American IPA	344	12.0
2391	2391	0.047	15.0	1468	Wolverine Premium Lager	American Pale Lager	402	12.0
2392	2392	0.050	NaN	822	Woodchuck Amber Hard Cider	Cider	501	12.0
2393	2393	0.065	82.0	2417	4000 Footer IPA	American IPA	109	12.0
2394	2394	0.028	15.0	2306	Summer Brew	American Pilsner	109	12.0
2395	2395	0.065	69.0	1697	Be Hoppy IPA	American IPA	339	16.0
2396	2396	0.069	69.0	2194	Worthy IPA	American IPA	199	12.0
2397	2397	0.045	25.0	1514	Easy Day Kolsch	Kölsch	199	12.0
2398	2398	0.077	30.0	1513	Lights Out Vanilla Cream Extra Stout	American Double / Imperial IPA	199	12.0
2399	2399	0.069	69.0	1512	Worthy IPA (2013)	American IPA	199	12.0
2400	2400	0.060	50.0	1511	Worthy Pale	American Pale Ale (APA)	199	12.0
2401	2401	0.042	NaN	1345	Patty's Chile Beer	Chile Beer	424	12.0
2402	2402	0.082	NaN	1316	Colorojo Imperial Red Ale	American Strong Ale	424	12.0
2403	2403	0.055	NaN	1045	Wynkoop Pumpkin Ale	Pumpkin Ale	424	12.0
2404	2404	0.075	NaN	1035	Rocky Mountain Oyster Stout	American Stout	424	12.0
2405	2405	0.067	45.0	928	Belgorado	Belgian IPA	424	12.0
2406	2406	0.052	NaN	807	Rail Yard Ale	American Amber / Red Ale	424	12.0
2407	2407	0.055	NaN	620	B3K Black Lager	Schwarzbier	424	12.0
2408	2408	0.055	40.0	145	Silverback Pale Ale	American Pale Ale (APA)	424	12.0
2409	2409	0.052	NaN	84	Rail Yard Ale (2009)	American Amber / Red Ale	424	12.0

2410 rows × 8 columns

Inspect the table above. The first column is a numbering scheme for the beers. The other columns contain the following data:

abv: Alcohol-by-volume of the beer.
ibu: International Bittering Units of the beer.
id: Unique identifier of the beer.
name: Name of the beer.
style: Style of the beer.
brewery_id: Unique identifier of the brewery.
ounces: Ounces of beer in the can.

Step 2: Explore the data¶

In the field of statistics, Exploratory Data Analysis (EDA) has the goal of summarizing the main features of our data, and seeing what the data can tell us without formal modeling or hypothesis-testing. [2]

Let's start by extracting the columns with the abv and ibu values, and converting them to NumPy arrays. One of the advantages of data frames in pandas is that we can access a column simply using its header, like this:

data_frame['name_of_column']

The output of this action is a pandas Series. From the documentation: "a Series is a 1-dimensional labeled array capable of holding any data type." [4]

Check the type of a column extracted by header:

In [7]:

type(beers['abv'])

Out[7]:

pandas.core.series.Series

Of course, you can index and slice a data series like you know how to do with strings, lists and arrays. Here, we display the first ten elements of the abv series:

In [8]:

beers['abv'][:10]

Out[8]:

0    0.050
1    0.066
2    0.071
3    0.090
4    0.075
5    0.077
6    0.045
7    0.065
8    0.055
9    0.086
Name: abv, dtype: float64

Inspect the data in the table again: you'll notice that there are NaN (not-a-number) elements in both the abv and ibu columns. Those values mean that there was no data reported for that beer. A typical task when cleaning up data is to deal with these pesky NaNs.

Let's extract the two series corresponding to the abv and ibu columns, clean the data by removing all NaN values, and then access the values of each series and assign them to a NumPy array.

In [9]:

abv_series = beers['abv']

In [10]:

len(abv_series)

Out[10]:

Another advantage of pandas is that it has the ability to handle missing data. The data-frame method dropna() returns a new data frame with only the good values of the original: all the null values are thrown out. This is super useful!

In [11]:

abv_clean = abv_series.dropna()

Check out the length of the cleaned-up abv data; you'll see that it's shorter than the original. NaNs gone!

In [12]:

len(abv_clean)

Out[12]:

Remember that a a pandas Series consists of a column of values, and their labels. You can extract the values via the series.values attribute, which returns a numpy.ndarray (multidimensional array). In the case of the abv_clean series, you get a one-dimensional array. We save it into the variable name abv.

In [13]:

abv = abv_clean.values

In [14]:

print(abv)

[0.05  0.066 0.071 ... 0.055 0.055 0.052]

In [15]:

type(abv)

Out[15]:

numpy.ndarray

Now we repeat the whole process for the ibu column: extract the column into a series, clean it up removing NaNs, extract the series values as an array, check how many values we lost.

In [16]:

ibu_series = beers['ibu']

len(ibu_series)

Out[16]:

In [17]:

ibu_clean = ibu_series.dropna()

ibu = ibu_clean.values

len(ibu)

Out[17]:

Exercise:

Write a Python function that calculates the percentage of missing values for a certain data series. Use the function to calculate the percentage of missing values for the abv and ibu data sets.

For the original series, before cleaning, remember that you can access the values with series.values (e.g., abv_series.values).

Important:

Notice that in the case of the variable ibu we are missing almost 42% of the values. This is important, because it will affect our analysis. When we do descriptive statistics, we will ignore these missing values, and having 42% missing will very likely cause bias.

Step 3: Ready, stats, go!¶

Now that we have NumPy arrays with clean data, let's see how we can manipulate them to get some useful information.

Focusing on the numerical variables abv and ibu, we'll walk through some "descriptive statistics," below. In other words, we aim to generate statistics that summarize the data concisely.

Maximum and minimum¶

The maximum and minimum values of a dataset are helpful as they tell us the range of our sample: the range gives some indication of the variability in the data. We can obtain them for our abv and ibu arrays with the min() and max() functions from NumPy.

abv

In [18]:

abv_min = numpy.min(abv)
abv_max = numpy.max(abv)

In [19]:

print('The minimum value for abv is: ', abv_min)
print('The maximum value for abv is: ', abv_max)

The minimum value for abv is:  0.001
The maximum value for abv is:  0.128

ibu

In [20]:

ibu_min = numpy.min(ibu)
ibu_max = numpy.max(ibu)

In [21]:

print('The minimum value for ibu is: ', ibu_min)
print('The maximum value for ibu is: ', ibu_max)

The minimum value for ibu is:  4.0
The maximum value for ibu is:  138.0

Mean value¶

The mean value is one of the main measures to describe the central tendency of the data: an indication of where's the "center" of the data. If we have a sample of $N$ values, $x_i$, the mean, $\bar{x}$, is calculated by:

\begin{equation*} \bar{x} = \frac{1}{N}\sum_{i} x_i \end{equation*}

In words, that is the sum of the data values divided by the number of values, $N$.

You've already learned how to write a function to compute the mean in Module 1 Lesson 5, but you also learned that NumPy has a built-in mean() function. We'll use this to get the mean of the abv and ibu values.

In [22]:

abv_mean = numpy.mean(abv)
ibu_mean = numpy.mean(ibu)

Next, we'll print these two variables, but we'll use some fancy new way of printing with Python's string formatter, string.format(). There's a sweet site dedicated to Python's string formatter, called PyFormat, where you can learn lots of tricks!

The basic trick is to use curly brackets {} as placeholder for a variable value that you want to print in the middle of a string (say, a sentence that explains what you are printing), and to pass the variable name as argument to .format(), preceded by the string.

Let's try something out…

In [23]:

print('The mean value for abv is {} and for ibu {}'.format(abv_mean, ibu_mean))

The mean value for abv is 0.059773424190800686 and for ibu 42.71316725978647

Ugh! That doesn't look very good, does it? Here's where Python's string formatting gets fancy. We can print fewer decimal digits, so the sentence is more readable. For example, if we want to have four decimal digits, we specify it this way:

In [24]:

print('The mean value for abv is {:.4f} and for ibu {:.4f}'.format(abv_mean, ibu_mean))

The mean value for abv is 0.0598 and for ibu 42.7132

Inside the curly brackets—the placeholders for the values we want to print—the f is for float and the .4 is for four digits after the decimal dot. The colon here marks the beginning of the format specification (as there are options that can be passed before). There are so many tricks to Python's string formatter that you'll usually look up just what you need. Another useful resource for string formatting is the Python String Format Cookbook. Check it out!

Variance and standard deviation¶

While the mean indicates where's the center of your data, the variance and standard deviation describe the spread or variability of the data. We already mentioned that the range (difference between largest and smallest data values) is also an indication of variability. But the standard deviation is the most common measure of variability.

We really like the way Prof. Kristin Sainani, of Stanford University, presents this in her online course on Statistics in Medicine. In her lecture "Describing Quantitative Data: What is the variability in the data?", available on YouTube, she asks: What if someone were to ask you to devise a statistic that gives the avarage distance from the mean? Think about this a little bit.

The distance from the mean, for any data value, is $x_i - \bar{x}$. So what is the average of the distances from the mean? If we try to simply compute the average of all the values $x_i - \bar{x}$, some of which are negative, you'll just get zero! It doesn't work.

Since the problem is the negative distances from the mean, you might suggest using absolute values. But this is just mathematically inconvenient. Another way to get rid of negative values is to take the squares. And that's how we get to the expression for the variance: it is the average of the squares of the deviations from the mean. For a set of $N$ values,

\begin{equation*} \text{var} = \frac{1}{N}\sum_{i} (x_i - \bar{x})^2 \end{equation*}

The variance itself is hard to interpret. The problem with it is that the units are strange (they are the square of the original units). The standard deviation, the square root of the variance, is more meaningful because it has the same units as the original variable. Often, the symbol $\sigma$ is used for it:

\begin{equation*} \sigma = \sqrt{\text{var}} = \sqrt{\frac{1}{N}\sum_{i} (x_i - \bar{x})^2} \end{equation*}

Sample vs. population¶

The above definitions are used when $N$ (the number of values) represents the entire population. But if we have a sample of that population, the formulas have to be adjusted: instead of dividing by $N$ we divide by $N-1$. This is important, especially when we work with real data since usually we have samples of populations.

The standard deviation of a sample is denoted by $s$, and the formula is:

\begin{equation*} s = \sqrt{\frac{1}{N-1}\sum_{i} (x_i - \bar{x})^2} \end{equation*}

Why? This gets a little technical, but the reason is that if you have a sample of the population, you don't know the real value of the mean, and $\bar{x}$ is actually an estimate of the mean. That's why you'll often find the symbol $\mu$ used to denote the population mean, and distinguish it with the sample mean, $\bar{x}$. Using $\bar{x}$ to compute the standard deviation introduces a small bias: $\bar{x}$ is computed from the sample values, and the data are on average (slightly) closer to $\bar{x}$ than the population is to $\mu$. Dividing by $N-1$ instead of $N$ corrects this bias!

Prof. Sainani explains it by saying that we lost one degree of freedom when we estimated the mean using $\bar{x}$. For example, say we have 100 people and I give you their mean age, and the actual age for 99 people from the sample: you'll be able to calculate the age of that 100th person. Once we calculated the mean, we only have 99 degrees of freedom left because that 100th person's age is fixed.

Let's code!¶

Now that we have the math sorted out, we can program functions to compute the variance and the standard deviation. In our case, we are working with samples of the population of craft beers, so we need to use the formulas with $N-1$ in the denominator.

In [25]:

def sample_var(array):
    """ Calculates the variance of an array that contains values of a sample of a 
    population. 
    
    Arguments
    ---------
    array : array, contains sample of values. 
    
    Returns
    -------
    var   : float, variance of the array .
    """
    
    sum_sqr = 0 
    mean = numpy.mean(array)
    
    for element in array:
        sum_sqr += (element - mean)**2
    
    N = len(array)
    var = sum_sqr / (N - 1)
    
    return var
    

Notice that we used numpy.mean() in our function: do you think we can make this function even more Pythonic?

Hint: Yes!, we totally can.

Exercise:

Re-write the function sample_var() using numpy.sum() to replace the for-loop. Name the function var_pythonic.

We have the sample variance, so we take its square root to get the standard deviation. We can make it a function, even though it's just one line of Python, to make our code more readable:

In [26]:

def sample_std(array):
    """ Computes the standard deviation of an array that contains values
    of a sample of a population.
    
    Arguments
    ---------
    array : array, contains sample of values. 
    
    Returns
    -------
    std   : float, standard deviation of the array.
    """
    
    std = numpy.sqrt(sample_var(array))
    
    return std

Let's call our brand new functions and assign the output values to new variables:

In [27]:

abv_std = sample_std(abv)
ibu_std = sample_std(ibu)

If we print these values using the string formatter, only printing 4 decimal digits, we can display our descriptive statistics in a pleasant, human-readable way.

In [28]:

print('The standard deviation for abv is {:.4f} and for ibu {:.4f}'.format(abv_std, ibu_std))

The standard deviation for abv is 0.0135 and for ibu 25.9541

These numbers tell us that the abv values are quite concentrated around the mean value, while the ibu values are quite spread out from their mean. How could we check these descriptions of the data? A good way of doing so is using graphics: various types of plots can tell us things about the data.

We'll learn about histograms in this lesson, and in the following lesson we'll explore box plots.

Step 4: Distribution plots¶

Every time that we work with data, visualizing it is very useful. Visualizations give us a better idea of how our data behaves. One way of visualizing data is with a frequency-distribution plot known as histogram: a graphical representation of how the data is distributed. To make a histogram, first we need to "bin" the range of values (divide the range into intervals) and then we count how many data values fall into each interval. The intervals are usually consecutive (not always), of equal size and non-overlapping.

Thanks to Python and Matplotlib, making histograms is easy. We recommend that you always read the documentation, in this case about histograms. We'll show you here an example using the hist() function from pyplot, but this is just a starting point.

Let's import the libraries that we need for plotting, as you learned in Module 1 Lesson 5, then study the plotting commands used below. Try changing some of the plot options and seeing the effect.

In [29]:

from matplotlib import pyplot
%matplotlib inline

#set font styles
pyplot.rc('font', family='serif', size=16)

In [30]:

#You can set the size of the figure by doing:
pyplot.figure(figsize=(10,5))

#Plotting
pyplot.hist(abv, bins=20, color='#3498db', histtype='bar', edgecolor='white') 
#The \n is to leave a blank line between the title and the plot
pyplot.title('abv \n')
pyplot.xlabel('Alcohol by Volume (abv) ')
pyplot.ylabel('Frequency');

In [31]:

#You can set the size of the figure by doing:
pyplot.figure(figsize=(10,5))

#Plotting
pyplot.hist(ibu, bins=20, color='#e67e22', histtype='bar', edgecolor='white') 
#The \n is to leave a blanck line between the title and the plot
pyplot.title('ibu \n')
pyplot.xlabel('International Bittering Units (ibu)')
pyplot.ylabel('Frequency');

Exploratory exercise:

Play around with the plots, change the values of the bins, colors, etc.

Comparing with a normal distribution¶

A normal (or Gaussian) distribution is a special type of distribution that behaves as shown in the figure: 68% of the values are within one standard deviation $\sigma$ from the mean; 95% lie within $2\sigma$; and at a distance of $\pm3\sigma$ from the mean, we cover 99.7% of the values. This fact is known as the $3$-$\sigma$ rule, or 68-95-99.7 (empirical) rule.

Standard deviation and coverage in a normal distribution. Modified figure based on original from Wikimedia Commons, the free media repository.

Notice that our histograms don't follow the shape of a normal distribution, known as Bell Curve. Our histograms are not centered in the mean value, and they are not symetric with respect to it. They are what we call skewed to the right (yes, to the right). A right (or positive) skewed distribution looks like it's been pushed to the left: the right tail is longer and most of the values are concentrated on the left of the figure. Imagine that "right-skewed" means that a force from the right pushes on the curve.

Discuss with your neighbor:

How do you think that skewness will affect the percentages of coverage by standard deviation compared to the Bell Curve?
Can we calculate those percentages?

Spoiler alert! (and Exercise):

Yes we can, and guess what: we can do it in a few lines of Python. But before doing that, we want you to explain in your own words how the following piece of code works.

Hints:

Check what the logical operation numpy.logical_and(1<x, x<4) returns.
Check what happens if you sum booleans. For example, True + True, True + False and so on.

In [32]:

x = numpy.array([1,2,3,4])
num_ele = numpy.logical_and(1<x, x<4).sum()
print(num_ele)

Now, using the same idea, we will calculate the number of elements in each interval of width $(1\sigma, 2\sigma, 3\sigma)$, and get the corresponding percentage.

Since we want to compute this for both of our variables, abv and ibu, we'll write a function to do so. Study carefully the code below. Better yet, explain it to your neighbor.

In [33]:

def std_percentages(x, x_mean, x_std):
    """ Computes the percentage of coverage at 1std, 2std and 3std from the
    mean value of a certain variable x.
    
    Arguments
    ---------
    x      : array, data we want to compute on. 
    x_mean : float, mean value of x array.
    x_std  : float, standard deviation of x array.
    
    Returns
    -------
    
    per_std_1 : float, percentage of values within 1 standard deviation.
    per_std_2 : float, percentage of values within 2 standard deviations.
    per_std_3 : float, percentage of values within 3 standard deviations.    
    """
    
    std_1 = x_std
    std_2 = 2 * x_std
    std_3 = 3 * x_std
    
    elem_std_1 = numpy.logical_and((x_mean - std_1) < x, x < (x_mean + std_1)).sum()
    per_std_1 = elem_std_1 * 100 / len(x) 
    
    elem_std_2 = numpy.logical_and((x_mean - std_2) < x, x < (x_mean + std_2)).sum()
    per_std_2 = elem_std_2 * 100 / len(x) 
    
    elem_std_3 = numpy.logical_and((x_mean - std_3) < x, x < (x_mean + std_3)).sum()
    per_std_3 = elem_std_3 * 100 / len(x) 
    
    return per_std_1, per_std_2, per_std_3
    

Let's compute the percentages next. Notice that the function above returns three values. If we want to assign each value to a different variable, we need to follow a specific syntax. In our example this would be:

abv

In [34]:

abv_std1_per, abv_std2_per, abv_std3_per = std_percentages(abv, abv_mean, abv_std)

Let's pretty-print the values of our variables so we can inspect them:

In [35]:

print('The percentage of coverage at 1 std of the abv_mean is : {:.2f} %'.format(abv_std1_per))
print('The percentage of coverage at 2 std of the abv_mean is : {:.2f} %'.format(abv_std2_per))
print('The percentage of coverage at 3 std of the abv_mean is : {:.2f} %'.format(abv_std3_per))

The percentage of coverage at 1 std of the abv_mean is : 74.06 %
The percentage of coverage at 2 std of the abv_mean is : 94.34 %
The percentage of coverage at 3 std of the abv_mean is : 99.79 %

ibu

In [36]:

ibu_std1_per, ibu_std2_per, ibu_std3_per = std_percentages(ibu, ibu_mean, ibu_std)

In [37]:

print('The percentage of coverage at 1 std of the ibu_mean is : {:.2f} %'.format(ibu_std1_per))
print('The percentage of coverage at 2 std of the ibu_mean is : {:.2f} %'.format(ibu_std2_per))
print('The percentage of coverage at 3 std of the ibu_mean is : {:.2f} %'.format(ibu_std3_per))

The percentage of coverage at 1 std of the ibu_mean is : 68.11 %
The percentage of coverage at 2 std of the ibu_mean is : 95.66 %
The percentage of coverage at 3 std of the ibu_mean is : 99.72 %

Notice that in both cases the percentages are not that far from the values for normal distribution (68%, 95%, 99.7%), especially for $2\sigma$ and $3\sigma$. So usually you can use these values as a rule of thumb.

What we've learned¶

Read data from a csv file using pandas.
The concepts of Data Frame and Series in pandas.
Clean null (NaN) values from a Series using pandas.
Convert a pandas Series into a numpy array.
Compute maximum and minimum, and range.
Revise concept of mean value.
Compute the variance and standard deviation.
Use the mean and standard deviation to understand how the data is distributed.
Plot frequency distribution diagrams (histograms).
Normal distribution and 3-sigma rule.

References¶

Craft beer datatset by Jean-Nicholas Hould.
Exploratory Data Analysis, Wikipedia article.
Think Python: How to Think Like a Computer Scientist (2012). Allen Downey. Green Tea Press. PDF available
Intro to data Structures, pandas documentation.
Think Stats: Probability and Statistics for Programmers version 1.6.0 (2011). Allen Downey. Green Tea Press. PDF available

Recommended viewing¶

From "Statistics in Medicine," a free course offered years ago in Stanford Online by Prof. Kristin Sainani, we highly recommend that you watch these three lectures:

In [1]:

# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../../../styles/custom.css'
HTML(open(css_file, "r").read())

Out[1]: