3 Exploring Quantitative Data

3.1 Introduction

This is the first topic where we are going to see how to perform statistical analysis using two software: R and Python. Every language has it’s strengths and weaknesses, so instead of attempting to exactly replicate the same chart/output in each software, we shall try to understand their respective approaches better.

Our end goal is to analyse data - toward that end, we should be versatile and adaptable. Focus on learning how to be fast and productive in whichever environment you are told to work in. If you have a choice, then be aware of what each framework can do best, so that you can choose the right one.

Although it is a simplistic view, the following diagram, Figure 3.1, provides a useful taxonomy of the types of columns we might have in our dataset. Having at least an initial inkling of the type of data matters, because it helps decide what type of summary to generate or what type of plot to make. Quantitative data are sometimes also referred to as numerical data.

There are two main ways of summarising data: numerically and graphically. This topic will cover:

Numerical summaries for univariate quantitative variables.
Numerical summary for association between two quantitative variables.
Useful graphs for univariate and bivariate quantitative variables.

Techniques for categorical variables will be covered in Section 4.1.

Before proceeding, let us introduce one of the datasets that we’ll be using in this textbook. The dataset comes from the UCI Machine Learning Repository, which is a very useful place to get datasets for practice.

Example 3.1 (Student Performance: Dataset Introduction)

The particular dataset can be downloaded from this page. Once you unzip the file, you will find two csv files in the student/ folder:

student-mat.csv (performance in Mathematics)
student-por.csv (performance in Portugese)

Each dataset was collected using school reports and questionnaires. Each row corresponds to a student. The columns are attributes, including student grades, demographic information, and other social and school-related information. Each file corresponds to the students’ performance in one of the two subjects. For more information, you can refer to Cortez and Silva (2008).

#	Feature	Description (Type)	Details
1	school	student’s school (binary)	“GP” - Gabriel Pereira, “MS” - Mousinho da Silveira
2	sex	student’s sex (binary)	“F” - female, “M” - male
3	age	student’s age (numeric)	from 15 to 22
4	address	student’s home address type (binary)	“U” - urban, “R” - rural
5	famsize	family size (binary)	“LE3” - less or equal to 3, “GT3” - greater than 3
6	Pstatus	parent’s cohabitation status (binary)	“T” - living together, “A” - apart
7	Medu	mother’s education (numeric)	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, 4 - higher education
8	Fedu	father’s education (numeric)	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, 4 - higher education
9	Mjob	mother’s job (nominal)	teacher, health, civil services, at_home, other
10	Fjob	father’s job (nominal)	teacher, health, civil services, at_home, other
11	reason	reason to choose this school (nominal)	close to home, school reputation, course preference, other
12	guardian	student’s guardian (nominal)	mother, father, other
13	traveltime	home to school travel time (numeric)	1 - <15 min, 2 - 15 to 30 min, 3 - 30 min to 1 hour, 4 - >1 hour
14	studytime	weekly study time (numeric)	1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, 4 - >10 hours
15	failures	number of past class failures (numeric)	n if 1<=n<3, else 4
16	schoolsup	extra educational support (binary)	yes or no
17	famsup	family educational support (binary)	yes or no
18	paid	extra paid classes within the course subject (Math or Portuguese) (binary)	yes or no
19	activities	extra-curricular activities (binary)	yes or no
20	nursery	attended nursery school (binary)	yes or no
21	higher	wants to take higher education (binary)	yes or no
22	internet	Internet access at home (binary)	yes or no
23	romantic	with a romantic relationship (binary)	yes or no
24	famrel	quality of family relationships (numeric)	from 1 - very bad to 5 - excellent
25	freetime	free time after school (numeric)	from 1 - very low to 5 - very high
26	goout	going out with friends (numeric)	from 1 - very low to 5 - very high
27	Dalc	workday alcohol consumption (numeric)	from 1 - very low to 5 - very high
28	Walc	weekend alcohol consumption (numeric)	from 1 - very low to 5 - very high
29	health	current health status (numeric)	from 1 - very bad to 5 - very good
30	absences	number of school absences (numeric)	from 0 to 93

In the explanatory variables above, notice that also severable variables have been stored in the dataset file using numbers, they are in fact categorical variables. Examples of these are variables 24 to 29. It seems fair to treat the three output columns as numeric though. G3 is the main output variable.

#	Feature	Description (Type)	Details
31	G1	first period grade (numeric)	from 0 to 20
32	G2	second period grade (numeric)	from 0 to 20
33	G3	final grade (numeric, output target)	from 0 to 20

3.2 Numerical Summaries

Numerical summaries include:

Basic information about the data, e.g. number of observations and missing values.
Measures of central tendency, e.g. mean, median
Measures of spread, e.g. standard deviation, IQR (interquartile range), range.

Example 3.2 (Student Performance: Numerical Summaries)

Let us read in the dataset and generate numerical summaries of the output variable of interest (G3).

R code
Python code

stud_perf <- read.table("data/student/student-mat.csv", sep=";", 
                        header=TRUE)
summary(stud_perf$G3)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    8.00   11.00   10.42   14.00   20.00

sum(is.na(stud_perf$G3))

[1] 0

import pandas as pd
import numpy as np

stud_perf  = pd.read_csv("data/student/student-mat.csv", delimiter=";")
stud_perf.G3.describe()

count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64

#stud_perf.G3.info()

From the output, we can understand that we have 395 observations, ranging from 0 to 20. I would surmise that the data is more or less symmetric in the middle (distance from 3rd-quartile to median is identical to distance from median to 1st-quartile). There are no missing values in the data.

However, summaries of a single variable are rarely useful since we do not have a basis for comparison. In this dataset, we are interested in how the grade varies with one or some of the other variables. Let’s begin with Mother’s education.

R code
Python code

round(aggregate(G3 ~ Medu, data=stud_perf, FUN=summary), 2)

  Medu G3.Min. G3.1st Qu. G3.Median G3.Mean G3.3rd Qu. G3.Max.
1    0    9.00      12.00     15.00   13.00      15.00   15.00
2    1    0.00       7.50     10.00    8.68      11.00   16.00
3    2    0.00       8.00     11.00    9.73      13.00   19.00
4    3    0.00       8.00     10.00   10.30      13.00   19.00
5    4    0.00       9.50     12.00   11.76      15.00   20.00

table(stud_perf$Medu)


  0   1   2   3   4 
  3  59 103  99 131

stud_perf[['Medu', 'G3']].groupby('Medu').describe()

         G3                                                  
      count       mean       std  min   25%   50%   75%   max
Medu                                                         
0       3.0  13.000000  3.464102  9.0  12.0  15.0  15.0  15.0
1      59.0   8.677966  4.364594  0.0   7.5  10.0  11.0  16.0
2     103.0   9.728155  4.636163  0.0   8.0  11.0  13.0  19.0
3      99.0  10.303030  4.623486  0.0   8.0  10.0  13.0  19.0
4     131.0  11.763359  4.267646  0.0   9.5  12.0  15.0  20.0

Now we begin to understand the context of G3 a little better. As the education level of the mother increases, the mean does increase. The middle 50-percent of the grade does seem to increase as well. The exception is the case where the mother has no education, but we can see that there are only 3 observations in that category so we should read too much into it.

Here are some things to note about numerical summaries:

If the mean and the median are close to each other, it indicates that the distribution of the data is close to symmetric.
The mean is sensitive outliers but the median is not. We shall see more about this in Section 5.1.
When the mean is much larger than the median, it suggests that there could be a few very large observations. It has resulted in a right-skewed distribution. Conversely, if the mean is much smaller than the median, we probably have a left-skewed distribution.

While numerical summaries provide us with some basic information about the data, they also leave out a lot. Even for experts, it is possible to have a wrong mental idea about the data from the numerical summaries alone. For instance, all three histograms in Figure 3.2 have a mean of 0 and standard deviation of 1!

3 histograms — Figure 3.2: Three datasets with mean 0 and s.d. 1

That’s why we have to turn to graphics as well, in order to summarise our data.

3.3 Graphical Summaries

Histograms

The first chart type we shall consider is a histogram. A histogram is a graph that uses bars to portray the frequencies or relative frequencies of the possible outcomes for a quantitative variable.

When we create a histogram, here are some things that we look for:

What is the overall pattern?: Do the data cluster together, or is there a gap such that one or more observations deviate from the rest?
Do the data have a single mound or peak? If yes, then we have what is known as a unimodal distribution. Data with two ‘peaks’ are referred to as bimodal, and data with many peaks are referred to as multimodal. See Figure 3.4.
Is the distribution symmetric or skewed? See Figure 3.5.
Are there any suspected outliers? See Figure 3.3.

Example 3.3 (Student Performance: Histograms)

Now let us return to the student performance dataset.

R code
Python code

hist(stud_perf$G3, main="G3 histogram")

fig = stud_perf.G3.hist(grid=False)
fig.set_title('G3 histogram');

Note

Do you notice anything different with the two histograms?

In general, we now have a little more information to accompany the 5 number summaries. It appears that the distribution is not a basic unimodal one. There is a large spike of about 40 students who scored very low scores.

As we mentioned earlier, it is not useful to inspect a histogram in a silo. Hence we shall condition on Mother’s education once more, to create separate histograms for each group. Remember that this is a case where the response variable G3 is quantitative, and the explanatory variable Medu is ordinal.

R code
Python code

library(lattice)
histogram(~G3 | Medu, data=stud_perf, type="density",
          as.table=TRUE)

stud_perf.G3.hist(by=stud_perf.Medu, figsize=(15,10), density=True, 
                  layout=(2,3));

Although the heights of the panels in the two versions are not identical, we can, by looking at either one, see that the proportion of 0-scores reduces as the mother’s education increases. Perhaps this is reading too much into the dataset, but there seem to be more scores on the higher end for highest educated mothers.

Density Plots

Histograms are not perfect - when using them, we have to experiment with the bin size since this could mask details about the data. It is also easy to get distracted by the blockiness of histograms. An alternative to histograms is the kernel density plot. Essentially, this is obtained by smoothing the heights of the rectangles in a histogram.

Suppose we have observed an i.i.d sample \(x_1, x_2, \ldots, x_n\) from a continuous pdf \(f(\cdot)\). Then the kernel density estimate at \(x\) is given by

\[ \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K \left( \frac{x - x_i}{h} \right) \] where

\(K\) is a density function. A typical choice is the standard normal. The kernel places greater weights on nearby points (to \(x\)).
\(h\) is a bandwidth, which determines which of the nearest points are used. The effect is similar to the number of bins in a histogram.

Example 3.4 (Student Performance: Density Estimates)

R code
Python code

densityplot(~G3, groups=Medu, data=stud_perf, auto.key = TRUE, bw=1.5)

import matplotlib.pyplot as plt
f, axs = plt.subplots(2, 3, squeeze=False, figsize=(15,6))
out2 = stud_perf.groupby("Medu")
for y,df0 in enumerate(out2):
    tmp = plt.subplot(2, 3, y+1)
    df0[1].G3.plot(kind='kde')
    tmp.set_title(df0[0])

As you can see, with density plots it is also possible to overlay them for closer comparison. This is not possible with histograms without some transparency in the rectangle colours.

Boxplots

A boxplot provides a skeletal representation of a distribution. Boxplots are very well suited for comparing multiple groups.

Here are the steps for drawing a boxplot:

Determine \(Q_1\), \(Q_2\) and \(Q_3\). The box is made from \(Q_1\) and \(Q_3\). The median is drawn as a line or a dot within the box.
Determine the max-whisker reach: \(Q_3 + 1.5 \times IQR\); the min-whisker reach by \(Q_1 − 1.5 \times IQR\).
Any data point that is out of the range from the min to max whisker reach is classified as a potential outlier.
Excluding the potential outliers, the maximum point determines the upper whisker and the minimum point determines the lower whisker of a boxplot.

A boxplot helps us to identify the median, lower and upper quantiles and outlier(s) (see Figure 3.6).

Example 3.5 (Student Performance: Boxplots)

Instead of using mother’s education once again, we use the number of times a student goes out (goout) as the explanatory variable this time. From the boxplot outputs, it appears there is no strong strictly increasing/decreasing trend associated with G3. Instead, although the differences between the categories are not large, it seems as though there is an “optimal” number of times that students could go out. Too little and too much leads to lower median G3 scores.

R code
Python code

bwplot(G3 ~ goout, horizontal = FALSE, data=stud_perf)

stud_perf.plot.box(column='G3', by='goout')

G3    Axes(0.125,0.11;0.775x0.77)
dtype: object

QQ-plots

Finally, we turn to QQ-plots. A Quantile-Quantile plot is a graphical diagnostic tool for assessing if a dataset follows a particular distribution. Most of the time we would be interested in comparing against a Normal distribution.

A QQ-plot plots the standardized sample quantiles against the theoretical quantiles of a N(0; 1) distribution. If they fall on a straight line, then we would say that there is evidence that the data came from a normal distribution.

Especially for unimodal datasets, the points in the middle will typically fall close to the line. The value of a QQ-plot is in judging if the tails of the data are fatter or thinner than the tails of the Normal.

Note

Please be careful! Some software/packages will switch the axes (i.e. plot the sample quantiles on the x-axis instead of the y-axis, unlike Figure 3.7). Please observe and interpret accordingly.

Example 3.6 (Concrete Slump: Dataset Introduction)

Concrete is a highly complex material. The slump flow of concrete is not only determined by the water content, but that is also influenced by other concrete ingredients. The UCI page for this dataset is here. The reference for this article is Yeh (2007).

The data set includes 103 data points. There are 7 input variables, and 3 output variables in the data set. These are the input columns in the data, all in units of kg per \(m^3\) concrete:

#	Feature	Details
1	Cement
2	Slag
3	Fly ash
4	Water
5	SP	A super plasticizer to improve consistency.
6	Coarse Aggr.
7	Fine Aggr.

There are three output variables in the dataset. You can read more about Slump and Flow from this wikipedia page.

#	Feature	Units
1	SLUMP	cm
2	FLOW	cm
3	28-day Compressive Strength	MPa

R code
Python code

To read in the data in R:

concrete <- read.csv("data/concrete+slump+test/slump_test.data")
names(concrete)[c(1,11)] <- c("id", "Comp.Strength")

concrete = pd.read_csv("data/concrete+slump+test/slump_test.data")
concrete.rename(columns={'No':'id', 
                         'Compressive Strength (28-day)(Mpa)':'Comp_Strength'},  
                inplace=True)

Let us consider the Comp.Strength output variable. The histogram overlay in Figure 3.8 suggests some skewness and fatter tails than the Normal.

Figure 3.8: Comparing data with reference Normal (blue)

Example 3.7 (Concrete: QQ-plots)

The next chart is a QQ-plot, for assessing deviations from Normality.

R code
Python code

qqnorm(concrete$Comp.Strength)
qqline(concrete$Comp.Strength)

from scipy import stats
import statsmodels.api as sm
sm.qqplot(concrete.Comp_Strength, line="q");

The deviation of the tails does not seem to be that large, judging from the QQ-plot.

3.4 Correlation

When we are studying two quantitative variables, the most common numerical summary to quantify the relationship between them is the correlation coefficient. Suppose that \(x_1, x_2, \ldots, x_n\) and \(y_1, \ldots, y_n\) are two variables from a set of \(n\) objects or people. The sample correlation between these two variables is computed as:

\[ r = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \bar{x})(y_i - \bar{y})}{s_x s_y} \] where \(s_x\) and \(s_y\) are the sample standard deviations. \(r\) is an estimate of the correlation between random variables \(X\) and \(Y\).

A few things to note about the value \(r\), which is also referred to as the Pearson correlation:

\(r\) is always between -1 and 1.
A positive value for \(r\) indicates a positive association and a negative value for \(r\) indicates a negative association.
Two variables have the same correlation, no matter which one is coded as \(X\) and which one is coded as \(Y\).

Figure 3.9 and Figure 3.10 contain sample plots of data and their corresponding \(r\) values. Notice how \(r\) does not reflect strong non-linear relationships.

3.5 Scatterplot Matrices

Example 3.8 (Concrete: Scatterplots) When we have multiple quantitative variables in a dataset, it is common to create a matrix of scatterplots. This allows for simultaneous inspection of bivariate relationships.

R code
Python code

col_to_use <- c("Cement", "Slag", "Comp.Strength", "Water", "SLUMP.cm.",
                "FLOW.cm.")
pairs(concrete[, col_to_use], panel = panel.smooth)

pd.plotting.scatter_matrix(concrete[['Cement', 'Slag', 'Comp_Strength', 'Water', 
                                     'SLUMP(cm)', 'FLOW(cm)']], 
                           figsize=(12,12));

We can see that a few of the variables have several 0 values - cement, slag, slump and flow. Water appears to have some relation with slump and with flow.

The scatterplots allow a visual understanding of the patterns, but it is usually also good to compute the correlation of all pairs of variables.

R code
Python code

library(psych)
corPlot(cor(concrete[, col_to_use]), cex=0.8, show.legend = FALSE)

corr = concrete[['Cement', 'Slag', 'Comp_Strength', 'Water', 
                 'SLUMP(cm)', 'FLOW(cm)']].corr()
corr.style.background_gradient(cmap='coolwarm_r')

	Cement	Slag	Comp_Strength	Water	SLUMP(cm)	FLOW(cm)
Cement	1.000000	-0.243553	0.445725	0.221091	0.145913	0.186461
Slag	-0.243553	1.000000	-0.331588	-0.026775	-0.284037	-0.327231
Comp_Strength	0.445725	-0.331588	1.000000	-0.254235	-0.223358	-0.124029
Water	0.221091	-0.026775	-0.254235	1.000000	0.466568	0.632026
SLUMP(cm)	0.145913	-0.284037	-0.223358	0.466568	1.000000	0.906135
FLOW(cm)	0.186461	-0.327231	-0.124029	0.632026	0.906135	1.000000

The plots you see above are known as heatmaps. They enable us to pick out groups of variables that are similar to one another. As you can see from the blue block in the lower right corner, Water, SLUMP.cm and FLOW.cm are very similar to one another.

3 Exploring Quantitative Data

3.1 Introduction

3.2 Numerical Summaries

3.3 Graphical Summaries

Histograms

Density Plots

Boxplots

QQ-plots

3.4 Correlation

3.5 Scatterplot Matrices

3.6 References

Website References