Statistics - Unlocking the World of Data

My notes from the Coursera EdinburghX course: Statistics - Unlocking the World of Data.

Week 1

Types of data

Data corresponds to information.
Types of data
- Qualitative: descriptions of qualities (red, tall, etc)
  - Nominal: no natural ordering (race, gender)
  - Ordinal: a natural ordering (rating, level of agreement)
- Quantitive: measurements of quantities (23 cm, 10kg, etc)
  - Discrete: can take specific values (number, cost)
  - Continuous: can take any value in a given interval (weight,percentage)
Usually we are interested in data that comes from recording some quantity or quality of interest. We may know in advance what the possible values of the data could be, but until we collect the data we do not know the actual values.

The statistics cycle

Pose a question -> Collect data -> Analyse data -> Interpret results -> Post a question -> …
We typically start with a question that we wish to answer. Given the question we collect relevant data that allows us to investigate this question. Using the data we conduct an appropriate statistical test and interpret the results accordingly. Given our updated understanding of the world from the results obtained we may pose a further follow-up or in-depth question that we wish to investigate, and so on…

Visualizing data

Line graphs can be used whenever we want to see how a variable changes over time.
The bar chart is one of the most commonly seen types of graph or chart today, and is a useful way of showing and comparing amounts or values across different categories.
To get a better idea of the shape the data set, we can group together similar values, and produce a bar chart for the grouped values. This type of bar chart is often referred to as a histogram.
Histograms and bar charts allow us to see the “shape” of a data set. There are several different features to look out for in these types of charts. The shape of a histogram or bar chart is often described as the distribution of the data.
Pie charts, like bar charts, allow us to show and comparing amounts or values across different categories.
The type of graph you should use depends not only on the type of your data, but also on what you want to do with your data.Do you want to compare values in your data? Then bar charts, histograms and line graphs are typically most useful.
- Do you want to see the distribution of your data? Bar charts and histograms allow you to easily see the distribution of your data.
- Do you want to look for trends in your data over time? A line graph is a useful graphical tool to use.
- Do you want to look at the composition of your data? A pie chart is a useful graphical representation.

Summarising data

Ways of summarising data can be divided into two categories: Those that aim to give a ‘typical value’ (or average) of a data set, and those that give a measure of how different the values of the data set generally are from this typical value. Such measures are called summary statistics.
Given a set of data, an average gives an idea of what a typical value of that data set is.
- The most widely used average is called the arithmetic mean, or, more commonly, just the mean. We calculate the mean of a data set by adding up all the values in the data set, and dividing it by the number of pieces of data in the set.
- Another important average is the median. The median is found by ordering the data from smallest to largest and choosing the middle value.
In statistics, an unusually low or high data value is called an outlier.
The mean is affected by outliers while the median isn’t, but the overall distribution of the data has an impact too.
When we take an average, we will always lose information about our data - we are summarising the full data set with just a single “typical” value. How much information is lost, and how critical this is, depends on how much the data values vary from the average.
There are many ways to measure the variability of a data set.
- The most straightforward way of measuring the variability of a data set is to use the range, which is simply the difference between the smallest and the largest values in a data set.
- The interquartile range is the difference between the lower quartile and the upper quartile. The lower quartile is the value such that one quarter of the data lies below it, and the upper quartile is the value such that one quarter of the data lies above it.
- The variance and the standard deviation work by giving a measure of how far away the different data values are from the mean. The two are closely related, with the standard deviation being the square root of the variance.
- The box plot can be used to visualize variability. It’s different from many other visual representations of data because rather than displaying the entire data set, it shows only certain summary statistics: the minimum, lower quartile, median, upper quartile and maximum values.
Summary statistics do not capture all the information we might want to, or should, know about a data set, and hence that it’s important to visualise data as well as look at the summary statistics.
Visualising data allows us to see patterns in the data that we would not spot from looking at summary statistics alone.

Week 2 - Patterns in data

Visualizing data with two variables

Scatter plots are a very useful way of presenting and looking at relationships between two different variables.
When looking for patterns in a scatter plot, it is important to first make sure that the scatter plot is drawn correctly and clearly.
The variable that we think is having an effect on the other is called the explanatory variable, while the variable that we think is being affected is called the response variable (so that the response variable responds to changes in the explanatory variable). It is a convention to place the explanatory variable on the horizontal axis, and the response variable on the vertical axis.
Sometimes there will not be an explanatory and a response variable. In this case, we can usually put either variable on either axis.
When using a scatter plot to look for patterns in data, it can be helpful adjust the axes to ensure that data is as easy to read as possible.
An outlier is a data point that doesn’t fit in with the general pattern of the data in the scatter plot.
Conventionally, graphs use what we call a linear scale on their axes, where evenly spaced markers on the scale correspond to evenly spaced measurements of the variable.
We can also use a logarithmic scale, where the markers are evenly spaced, but the values that they correspond to are not. This makes certain data plots easier to analyse.

Relationships in data

There are many different types of relationship that we might find between two variables. Of course, there may well be no relationship between two variables.
If there is a relationship between two variables, then this can take on one of many different forms.
A linear relationship occurs when one variable either increases or decreases as the other variable increases.
- Positive linear relationship: as one variable increases, so does the other.
- Negative linear relationship: as one variable increases, the other decreases.
A relationship that is not linear is a nonlinear relationship.
A monotonic relationship is simply any relationship where an increase in one variable always leads to an increase in the other variable.
When there is a linear relationship between two variables, we say that they are correlated, or that there is correlation between them. Correlation can be either positive or negative.
Where there is strong correlation between two variables, the points on the corresponding scatter plot lie very close to the straight line describing the general relationship. Where there is weak correlation, the points might be scattered quite far from this underlying straight line.
The correlation coefficient can be used to give a measurement of the linear relationship between two variables. It works by assigning a single number between plus and minus one to the two variables. The closer to 1 or -1, the stronger the correlation; the closer to 0, the weaker the correlation.
There are several different, albeit similar, ways of calculating a correlation coefficient between two variables, but the most commonly used is the Pearson’s correlation coefficient.
Line graphs can be used to visualize correlation.

Patterns in data

The line of best fit is a simpler line that is plotted on a graph that essentially summarises the main underlying trend of the points over time. The line is fitted such that it minimises in some way the distances between the points and this simpler line of best fit.
The simplest line of best fit will be a straight line, and so these are often used. Such lines generally fit the data well if the two variables of interest have a linear relationship.
A line of best fit is very useful for giving us an indication of the general pattern, or trend, that the data is following. For this reason, a line of best fit is sometimes referred to as a trend line.

Linear regression

The concept of regression allows us to remove any ambiguity by allowing us to calculate a ‘best’ line of best fit for any graph.
The most commonly used method to calculate the best fit line is the method of least-squares regression.
The vertical distance between a point and the regression line is called a residual. By squaring the residuals and adding them all together, we get a number that gives us an indication of how far away the points on the scatter plot are from the line. We can calculate this value for any straight line, and there is always one unique line that minimises it; we call this the linear regression line.
A linear regression line is very informative if we have a linear relationship between two variables, as it gives us an idea of what a ‘typical’ data point looks like, and can then be used to estimate or predict the value of one of the variables for a given value of the other variable.
When we do not have a linear relationship between the two variables, it is not appropriate to fit a linear regression line, as important information about the relationship between the variables can be lost.
Any straight line can be described in terms of its gradient, or slope, and its intercept, the point at which the line crosses the vertical axis.

Interpreting patterns

Extrapolation is the process of using a line of best fit to estimate the value that a variable might take outside of the range of the observed data values. Extrapolation must be dealt with care.
Interpolation is the process of using a line of best fit to estimate the value that a variable might take when it is within the range of the observed data values, but different from them.
Simpson’s paradox, where a trend is shown in different groups of data, but when the groups are combined the trend disappears or reverses.
Causal relationship, or cause-and-effect relationship, where two variables are related in a way such that a change in one variable directly causes a change in the other. Whenever we have a causal relationship between two variables, those variables will be correlated, but correlation does not imply causation.

Week 3 - Collecting data

Populations and samples

Generally in statistics, we are interested in an entire collection or group of items. Such a group is called a population.
A sample is a subset of the population of interest, obtained by selecting a certain number of the members of the population.
The process of drawing conclusions about a whole population by studying only a sample is called inference. Because of the difficulty in studying an entire population, inference is the basis of most of statistics.
When using a sample to infer something about a population we do lose information about the population. How much information is lost depends on the size of the sample and how the sample was chosen. We call such a sample a representative sample.

Sampling strategies

In order to draw reliable conclusions about our population of interest, it is important to carefully choose how we collect a sample from the population.
Random sampling: members of the population are randomly chosen to be included in the sample.
Systematic sampling: members of the population are chosen using some fixed method. Generally, the population will be ordered, and every, for example, tenth member chosen to be included in the sample. The main advantage of systematic sampling is its simplicity, but there are cases in which it is not appropriate.
Stratified sampling: if we want to ensure that the diversity of a population is maintained in the final sample, then our best sampling method is stratified sampling. To choose a stratified sample, for each individual stratum (sub-population) that has been identified we then collect a random sample, as described above.
The variation between different samples of a population is called sampling variability. The less variability, the better.
Sampling bias occurs where the properties of samples are consistently different from the properties of the population.

Experiments and studies

To start an experiment, we begin with a question that we want to answer.
We cannot generally run an experiment on every member of the population, and so we have to select a sample; we call these sampled members of the population experimental units (if they are people we call them subjects).
In order to assess the actual effect that we are interested in, we include in the experiment a second sample of experimental units that are not subject to the treatment, intervention or condition. We call this the control group, and the analysis of the experiment involves comparing data from the control group with data from the other group, which is called a treatment group, to see if there is a difference. To reduce the chance of bias, it is important that the experimental units in both the treatment and the control group are comparable.
Different units may still respond differently to the treatment, intervention or condition we want to study, which is called natural variability.
In some cases, it is not ethical or possible to collect data using an experiment. In such cases, we can often collect data instead using an observational study.
A confounding variable is any variable that is related to both the explanatory and the response variable in a study, in a way that helps to explain an apparent causal relationship between the explanatory and response variables.

Using data

Publication bias happens when only studies reporting significant results are published in scientific journals. If a study ends with neutral or negative findings, the leading researcher often struggles to find a journal that will agree to publish the results. This means that evidence from these studies might get discarded, and disappear from the scientific world. This creates a significant gap in the evidence available for systematic reviews and meta-analysis, and leads to bias in the results.
A Capture-recapture study can be applied in many different situations to estimate a population size, by taking two samples and recording the number of individuals caught in both samples, in the first sample but not the second, and in the second sample but not the first.
- a is the number of individuals caught in both samples, b is the number of individuals observed in the first sample but not the second, c is the number of individuals observed in the second sample but not the first, and ? is the number of individuals that were not observed in either sample. It is the ? that we do not know, and want to estimate. By making use of patterns in the data, we can find a formula for ?, and this allows us to derive the following formula for the estimated population size, called the Lincoln-Petersen estimator.
- Pop size = (a + b)(b + c) / a
- Note that the Lincoln-Petersen estimator gives only an estimate of the true population size and, in order to produce this estimate, a number of assumptions are made.
- More advanced techniques allow us to remove some of these assumptions, although they do require additional data to be collected.

Week 4 - Uncertainty on data

Quantifying chance

Whenever we consider how likely it is that something will happen, we are thinking about chance. Probability is the area of maths that allows us to quantify and study chance.
In probability, we are interested in the likelihood of some event happening. An event is an outcome, or set of outcomes, of an experiment, or an observation, or set of observations, of a variable.
Probability assigns to an event a number between 0 and 1, indicating how likely it is that the event will occur. A probability of 0 means that the event is impossible and cannot happen, whilst a probability of 1 means that the event will certainly happen.
We can represent a probability of event A happending with the notation P(A) = 1/6.
To calculate a theoretical probability for an event, we first need to know what all the possible outcomes of the experiment are, or what values the variable can possibly take. The set of all possible outcomes of an experiment, or values a variable can take, is called the sample space.
Most events that we are interested in are much more complex than those we have seen so far. To calculate more complex probabilities, we need to be able to construct more complex events out of simpler ones. P(A || B), P(A && B), P(A)’.
Two events, A and B, are independent of one another if the probability that one of the events happens is unaffected by whether or not the other one happens. The events are dependent on one another if they are not independent of each other.
- P(A && B) = P(A) x P(B)
- P(A || B) = P(A) + P(B) - P(A && B)
- P(A || B) = P(A) + P(B) for mutually exclusive events.
- P(A) = 1 - P(A’)

Conditional probability

A conditional probability is the probability that an event occurs, given that, or if, some other event has occurred.
Given an event A and an event B, we write the probability that A occurs, given that B has already happened, as P(A|B).
We can use conditional probabilities to test whether or not two events are dependent: Two events, A and B, are independent only if P(A|B) = P(A) and P(B|\A) = P(B).
Bayes’ Theorem can be used to calculated conditional probabilities. P(A|B) = ( P(B|A) * P(A) ) / P(B) or P(A|B) = P(A and B) / P(B)
No medical test is perfect. There is always the risk of false positives - where someone does not have the disease, but tests positive; and false negatives - when someone tests negative for the disease, but does in fact have the disease.
Alternatively we want to maximise the probability of true positives (this is called the sensitivity of the test) and the probability of true negatives (this is called the specificity of the test).

Visualizing probability

Venn diagrams are a common way of presenting simple proportions or probabilities. Whenever we have two events, A and B, we can create a Venn diagram to illustrate the probabilities P(A), P(B), P(A && B), P(A && B’) and P (B && A’).
Probability trees are a useful way of allowing us to see and compare all the possible outcomes of an experiment. They are particularly useful when the outcomes we are interested in occur after several different events have happened, and these events are not independent of one another. Each path through the tree represents a unique outcome of the experiment. Because of how the probability tree splits up the outcomes, the branches making up a path correspond to independent events, and so the probability of each outcome along a given path from the initial state is calculated by multiplying together the probabilities on each branch of the given path.
A probability distribution is simply a graph showing the probability of the different outcomes of an experiment, or values of a variable, occurring.
Bernoulli trials is simply any experiment in which there are two possible outcomes (e.g. a mother is giving birth to a single baby; there are two possibilities: the baby could be a boy, or it could be a girl, and there is a (roughly) 50% chance of each). Generally, we consider one of the outcomes of the experiment to be a “success”, and the other a “failure”. The probability of a success is called the success probability of the Bernoulli trial.
A binomial distribution can be used whenever we want to determine how likely it is that we will obtain a given number of successes in a sequence of independent and identical Bernoulli trials.
The precise binomial distribution that we use depends on the number of Bernoulli trials involved and the success probability of each Bernoulli trial. We call the number of Bernoulli trials and the success probability the parameters of the distribution. For shorthand, we write down the binomial distribution with n Bernoulli trials and success probability p as Bin(n,p).
The geometric distribution with success probability p, written Geom(p), tells us the probability that each trial (the first, the second, the third and so on) will be the first successful trial in a sequence of independent and identical Bernoulli trials with success probability p.
The Poisson distribution is commonly used when events occur at some known mean rate, and we are interested in the number of these events that occur within a given period of time.
The Pascal distribution describes the probability of the number of failures in a sequence of identical and independent Bernoulli trials that there will be before a specified number of successes.

Shapes of data

The continuous probability distribution of a variable describes the probability that the variable will lie within any given range of values. The probability distribution can be described by a distribution function.
One distribution shape that appeared over and over again, in applications as diverse as studying peoples’ heights, predicting outcomes in gambling, and estimating the error of measurements. It became known as the normal distribution.
Generally, the mean is represented by the Greek letter “mu”, μ, and the variance by “sigma squared”, σ2, because the standard deviation is written as σ, and the variance is equal to the square of the standard deviation. We write the normal distribution with mean μ and variance σ2 as N(μ,σ2).
The variance and the standard deviation are measures of the variability, or spread, of a variable.
The exponential distribution describes the wait time between consecutive events that happen randomly at a known rate. Unlike the normal distribution, the exponential distribution has just one parameter affecting its shape: the rate at which events occur, which we will write as the Greek letter lambda, λ. We write an exponential distribution with rate λ as Exp(λ).
The Student’s t-distribution has just one parameter, called the ‘degrees of freedom’ of the distribution. The degrees of freedom can be any positive number, although, in practice, we will usually use the t-distribution when the degrees of freedom is a whole number (we will see why later this week). We commonly write the degrees of freedom as k, and write a t-distribution with k degrees of freedom as tk), although in the probability calculator below, the degrees of freedom is denoted by ‘df’.
Whenever we are studying a variable, we have to think carefully about what distribution it might follow. Many natural processes follow a normal distribution, but some do not. The great advantage of having a variable that is normally distributed is that is is relatively easy to calculate probabilities from the normal distribution, as the distribution function has a known mathematical formula.

Sampling distributions

Often, we are interested in some property of a population or experiment, which we call a parameter. The true parameter for the whole population is called the population parameter. Generally, we can’t gather data on an entire population, and so we would take a sample from the population, which we use to estimate the population parameter. An estimate for our parameter that we obtain from sample data is called a sample parameter.
The properties of the sample parameter can be explored by investigating the sampling distribution of the sample parameter: This is the distribution of the sample parameter for all samples of some specified size.

The central limit theorem

We have seen that whenever a variable, such as the height of a female, is normally distributed, then the distribution of the sample mean of the given variable is also normal. In fact, under some very mild conditions, whatever the distribution of a variable, providing we take large enough samples of the population (usually of size around 30 or higher), and that these samples are chosen independently of each other, then the distribution of the corresponding sample mean will be normally distributed.
In such cases, if a variable has mean μ and variance σ2, then the distribution of the sample mean from independent samples of size n, will have a normal distribution with mean μ and variance σ2/n, that is, distribution N(μ,σ2/n). This is the central limit theorem.
The normal distribution N(n×p,n×p×(1−p)) approximates the binomial distribution Bin(n,p).

Estimation

One of the simplest ways to estimate a property of a population is using a point estimate. This is a single estimated value of an unknown population parameter, based on observed sample data.
One tool that statisticians use to quantify the uncertainty of an estimate is called the confidence interval. Because every sample is different, every sample we might observe would give rise to a different confidence interval.
A confidence interval has an associated confidence level, and we might talk about, for example, a 90% confidence interval or 95% confidence interval. A 95% confidence interval is one that is constructed in such a way that 95% of all such confidence intervals, from all possible samples, will contain the true population parameter. The higher the confidence level, the more confident we are in our estimate.
A confidence interval can be expressed as [64.8, 65,8] or 65.3 ± 0.5.

Week 6 - Statistical testing

Hypotheses

A statistical hypothesis is a statement about a parameter of some population that we are interested in.
Any statistical hypothesis test begins with the statement of two hypotheses: A null hypothesis, and an alternative hypothesis.
- The null hypothesis is a statement about the parameter of interest corresponding to the ‘default position’ - the current belief about the parameter, in normal circumstances. The null hypothesis is denoted by H0.
- In a hypothesis test, we test the null hypothesis against an alternative hypothesis, which is some other hypothesis that we might suspect is true. The alternative hypothesis is often denoted by H1, or HA.
A hypothesis test can be either one-sided or two-sided.
- A hypothesis test is one-sided if the alternative hypothesis states that the population parameter is either greater than some value, or alternatively that the population parameter is less than some value.
- The hypothesis test is two-sided if the alternative hypothesis states that the population parameter is not equal to something.
A hypothesis test runs in a similar way to a legal trial, in which a defendant is considered innocent until proven guilty. The null hypothesis corresponds to the proposition that the defendant is innocent, while the alternative hypothesis corresponds to the proposition that the defendant is guilty.
1. At the start of the experiment the null hypothesis is assumed to be true.
2. Data is collected.
3. If the data is inconsistent with, or contradicts, the null hypothesis, then we conclude that the null hypothesis is not true. We say that we ‘reject the null hypothesis in favour of the alternative hypothesis’.
4. Alternatively, if the data is consistent with, and does not contradict, the null hypothesis, then we do not reject the null hypothesis in favour of the alternative hypothesis.
Note that we would never accept the null hypothesis. If our data is deemed to be consistent with the null hypothesis, then we do not have evidence against the null hypothesis, but this does not mean that we have evidence for it. It is, therefore, wrong to ever say that you would accept the null hypothesis. Instead, we generally say that we ‘do not reject’ the null hypothesis, that we ‘fail to reject’ the null hypothesis, or that we have ‘no evidence against’ the null hypothesis.
The data that we collect could be a very large set of values, and so, in order to determine whether or not it is an unlikely outcome, we need to simplify it by summarising it in some way. In order to do this, we choose a test statistic, which is simply some summary statistic of the data, such as the sample mean.
- For the data that we have collected, we can calculate the observed test statistic for that data. If the observed test statistic is within the set of the most unlikely values that we would see if the null hypothesis is true, then this is evidence against the null hypothesis, and so we reject the null hypothesis in favour of the alternative hypothesis.
- The significance level of the hypothesis test, which is a percentage that is always specified before performing such a test. A 5% significance level means that we reject the null hypothesis in favour of the alternative hypothesis if our observed test statistic is within the 5% most unlikely possible test statistics if the null hypothesis is true.
Another way of conducting a hypothesis test is to calculate the p-value of the observed test statistic. This is the probability that, assuming the null hypothesis is true, we would obtain a value for the test statistic that is at least as extreme as the one that we have observed. We reject the null hypothesis in favour of the alternative hypothesis if our p-value is less than the significance level of the hypothesis test.
- The p-value of a sample parameter can tell us more than just whether or not to reject the null hypothesis in favour of the alternative hypothesis. It can also tell us how confident we can be in this rejection. The lower the p-value, the less likely we would be to observe a sample parameter at least as extreme as the observed value, if the null hypothesis is true.