# The Art of Data Science

My personal notes on the book The Art of Data Science, by Roger D. Peng.

## 2. Epicyles of analysis

- The five activities:
- State the questions
- Exploratory data analysis
- Model building
- Interpret
- Communicate

- These five activities occur at different time scales.
- For each of these core activities, it is critical that you engage in the following steps:
- Setting expectations.
- Collecting data (information), comparing it to your expectation, and if the expectations match.
- Reviewing your expectations or fixing the data so your data and your expectations match.

**Setting expectations:**deliberately think about what you expect before you do anything.**Collecting information:**collection information about your question or your data.**Comparing expectations to data:**compare the results and see if they match your expectations; otherwise, iterate.

## 3. Stating and refining the question

- Types of questions
**Descríptive:**it seeks to summarize a characteristic of a set of data.**Exploratory:**you analyze the data to see if there are patterns, trends or relationships between variables. This is also called*hypothesis-generating*, because you look for patterns that would support proposing a hypothesis.**Inferential:**it would be a restatement of this proposed hypothesis as a question which would be answered by analyzing a different set of data.**Predictive:**you are interested in predicting a certain value or event.**Causal:**it asks about whether changing one factor will change another factor.**Mechanistic:**how a factor change affects other factors.

**Characteristics of a good question**- The question should be of interest to your audience.
- The question has not already been answered.
- The question should stem from a plausible framework.
- The question should be answerable.
- The question should be specific.

**Translating a question into a data problem**- Every question must be operationalizated as a data analysis that leads to a result.
- Some questions do not lead to interpretable results. The typical type of questions that does not meet this criterion is a question that uses innapropriate data.

## 4. Exploratory data analysis

- It is the process of exploring your data.
- The most used tool for exploratory data analysis is
**data visualization**. - Goals
- Determine if there are any problems with the dataset.
- Determine whether the question you are asking can be answered by the data you have.
- To develop a sketch of the answer to your question.

- The epycicle of analysis still applies to exploratory data analysis.
- If you do not find evidence of a signal in the data using just a single plot or analysis, then often it’s unlikely that you will find something using a more sophisticated analysis.
**Follow-up questions**- Do you have the right data?
- Do you need other data?
- Do you have the right question?

## 5. Using models to explore your data

- A model is something we construct to help us understand the real world.
- A statistical model serves two key purposes on data analysis, which are to provide a
**quantitative summary**of your data and to impose a specific**structure**on the population from which the data was sampled. **Data reductions:**you take the original set of numbers contained on your dataset and transform them into a smaller set of numbers. This typically ends up with a statistic, a summary of data as the mean, median, etc.

### 5.1 Models as expectations

- A statistical model must also impose some structure on the data.
- A statistical model provides a description of how the world works and how the data was generated.
**Normal model:**it says that the randomness in a set of data can be explained by the*normal distribution*, as a bell-shaped curve.- The normal distribution is fully specified by two parameters: the mean and the standard deviation.

### 5.2 Comparing model expectations to reality

- The usefulness of a model depends how closely it mirrors the data we collect in the real world.
- We can do it by plotting the model on a histogram.

### 5.3 Reacting to data: refining our expectations

- If the model and the data do not match very well, we need to get a different model, different data, or both.
**Gamma distribution:**it allows only positive values.

### 5.4 Examining linear relationships

**Linear regression**is an useful statistical technique that allows us to understand linear relationships between variables of interest.- In most real-life cases, we need to run many iterations of the data analysis process.
- If possible, try to replicate the analysis using a different, possibly independent, dataset.
**It is always important to be hyper-critical of your findings and to challenge them as much as possible.**

## 6. Inference

- The goal of inference is to be able to make a statement about something that
*is not observed*, and ideally to be able characterize any uncertainty you have about that statement.

### 6.1 Identifying the population

- We refer to the things you cannot observe as the
**population**, and the data we observe as the**sample**.- The goal is to use the sample to make a statement about the population.
- First of all, you must figure out what the population is and which feature of the population you want to make a statement about.
- If you cannot coherently identify or describe the population, then you cannot make an inference.

### 6.2 Describe the sampling process

- Being able to describe the sampling process is important for determining whether the data is useful for making inferences about features of the population.

### 6.3 Describe a model for the population

- We need to have an abstract representation of how the elements of the population are related to each other.

### 6.4 Factores affecting the quality of inference

**Selection bias****Sampling variability**

## 7. Formal modelling

- Often it is useful to represent a model using mathematical notation because it is a compact notation and can be easy to interpret once you get used to it.
- They key goal of formal modelling is to develop a precise specification of your question and how data can be used to answer it.
**General framework:**we can apply the basic epycicle of analysis to the formal modelling portion of the data analysis.- Setting the expectations.
- Collecting information.
- Revising expectations.

**Associational analyses**are the ones we are looking for an association between two or more features in the presence of other potentially confounding factors.- The basic form of a model in an associational analysis will be:

\(y = \alpha + \beta \cdot x + \gamma \cdot z + \epsilon\) - Where
- \(y\) is the outcome.
- \(x\) is the key predictor.
- \(z\) is a potential confounder.
- \(\alpha\) is the intercept (the value of \(y\) when \(x = 0, z = 0\)).
- \(\beta\) is the change in \(y\) associated with a 1-unit increase in \(x\).
- \(\epsilon\) is the independent random error.
- \(\gamma\) is the change in \(y\) associated with a 1-unit incease in \(z\).

- The basic form of a model in an associational analysis will be:
**Prediction analyses**lets us use all of the information available to you to predict \(y\).- For many prediction analyses it is not possible to literally write down the model that is being used to predict because it cannot be represented using standard mathematical notation.

## 8. Inference versus prediction

- In any data analysis, you want to ask yourself “Am I asking an inferential queston or a prediction question?”. The answer to this question can guid the entire modelling strategy.

## 9. Interpreting your results

**Principles of interpretation**- Visit your original question.
- Focus on the nature of the result: its directionality, magnitude and uncertainty.
- Develop an overall interpretation based on the totality of your analysis and the context of what is already known about the subject matter.
- Consider the implications of the result.

## 10. Communication

- Communication is both one of the tools of data analysis, and also its final product.