My personal notes on the book The Art of Data Science, by Roger D. Peng.

## 2. Epicyles of analysis

• The five activities:
1. State the questions
2. Exploratory data analysis
3. Model building
4. Interpret
5. Communicate
• These five activities occur at different time scales.
• For each of these core activities, it is critical that you engage in the following steps:
• Setting expectations.
• Collecting data (information), comparing it to your expectation, and if the expectations match.
• Setting expectations: deliberately think about what you expect before you do anything.
• Comparing expectations to data: compare the results and see if they match your expectations; otherwise, iterate.

## 3. Stating and refining the question

• Types of questions
• Descríptive: it seeks to summarize a characteristic of a set of data.
• Exploratory: you analyze the data to see if there are patterns, trends or relationships between variables. This is also called hypothesis-generating, because you look for patterns that would support proposing a hypothesis.
• Inferential: it would be a restatement of this proposed hypothesis as a question which would be answered by analyzing a different set of data.
• Predictive: you are interested in predicting a certain value or event.
• Causal: it asks about whether changing one factor will change another factor.
• Mechanistic: how a factor change affects other factors.
• Characteristics of a good question
• The question should be of interest to your audience.
• The question should stem from a plausible framework.
• The question should be answerable.
• The question should be specific.
• Translating a question into a data problem
• Every question must be operationalizated as a data analysis that leads to a result.
• Some questions do not lead to interpretable results. The typical type of questions that does not meet this criterion is a question that uses innapropriate data.

## 4. Exploratory data analysis

• It is the process of exploring your data.
• The most used tool for exploratory data analysis is data visualization.
• Goals
• Determine if there are any problems with the dataset.
• Determine whether the question you are asking can be answered by the data you have.
• The epycicle of analysis still applies to exploratory data analysis.
• If you do not find evidence of a signal in the data using just a single plot or analysis, then often it’s unlikely that you will find something using a more sophisticated analysis.
• Follow-up questions
• Do you have the right data?
• Do you need other data?
• Do you have the right question?

## 5. Using models to explore your data

• A model is something we construct to help us understand the real world.
• A statistical model serves two key purposes on data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data was sampled.
• Data reductions: you take the original set of numbers contained on your dataset and transform them into a smaller set of numbers. This typically ends up with a statistic, a summary of data as the mean, median, etc.

### 5.1 Models as expectations

• A statistical model must also impose some structure on the data.
• A statistical model provides a description of how the world works and how the data was generated.
• Normal model: it says that the randomness in a set of data can be explained by the normal distribution, as a bell-shaped curve.
• The normal distribution is fully specified by two parameters: the mean and the standard deviation.

### 5.2 Comparing model expectations to reality

• The usefulness of a model depends how closely it mirrors the data we collect in the real world.
• We can do it by plotting the model on a histogram.

### 5.3 Reacting to data: refining our expectations

• If the model and the data do not match very well, we need to get a different model, different data, or both.
• Gamma distribution: it allows only positive values.

### 5.4 Examining linear relationships

• Linear regression is an useful statistical technique that allows us to understand linear relationships between variables of interest.
• In most real-life cases, we need to run many iterations of the data analysis process.
• If possible, try to replicate the analysis using a different, possibly independent, dataset.
• It is always important to be hyper-critical of your findings and to challenge them as much as possible.

## 6. Inference

• The goal of inference is to be able to make a statement about something that is not observed, and ideally to be able characterize any uncertainty you have about that statement.

### 6.1 Identifying the population

• We refer to the things you cannot observe as the population, and the data we observe as the sample.
• The goal is to use the sample to make a statement about the population.
• First of all, you must figure out what the population is and which feature of the population you want to make a statement about.
• If you cannot coherently identify or describe the population, then you cannot make an inference.

### 6.2 Describe the sampling process

• Being able to describe the sampling process is important for determining whether the data is useful for making inferences about features of the population.

### 6.3 Describe a model for the population

• We need to have an abstract representation of how the elements of the population are related to each other.

### 6.4 Factores affecting the quality of inference

• Selection bias
• Sampling variability

## 7. Formal modelling

• Often it is useful to represent a model using mathematical notation because it is a compact notation and can be easy to interpret once you get used to it.
• They key goal of formal modelling is to develop a precise specification of your question and how data can be used to answer it.
• General framework: we can apply the basic epycicle of analysis to the formal modelling portion of the data analysis.
• Setting the expectations.
• Collecting information.
• Revising expectations.
• Associational analyses are the ones we are looking for an association between two or more features in the presence of other potentially confounding factors.
• The basic form of a model in an associational analysis will be:
$$y = \alpha + \beta \cdot x + \gamma \cdot z + \epsilon$$
• Where
• $$y$$ is the outcome.
• $$x$$ is the key predictor.
• $$z$$ is a potential confounder.
• $$\alpha$$ is the intercept (the value of $$y$$ when $$x = 0, z = 0$$).
• $$\beta$$ is the change in $$y$$ associated with a 1-unit increase in $$x$$.
• $$\epsilon$$ is the independent random error.
• $$\gamma$$ is the change in $$y$$ associated with a 1-unit incease in $$z$$.
• Prediction analyses lets us use all of the information available to you to predict $$y$$.
• For many prediction analyses it is not possible to literally write down the model that is being used to predict because it cannot be represented using standard mathematical notation.

## 8. Inference versus prediction

• In any data analysis, you want to ask yourself “Am I asking an inferential queston or a prediction question?”. The answer to this question can guid the entire modelling strategy.