The Art of Data Science

My personal notes on the book The Art of Data Science, by Roger D. Peng.

2. Epicyles of analysis

  • The five activities:
    1. State the questions
    2. Exploratory data analysis
    3. Model building
    4. Interpret
    5. Communicate
  • These five activities occur at different time scales.
  • For each of these core activities, it is critical that you engage in the following steps:
    • Setting expectations.
    • Collecting data (information), comparing it to your expectation, and if the expectations match.
    • Reviewing your expectations or fixing the data so your data and your expectations match.
  • Setting expectations: deliberately think about what you expect before you do anything.
  • Collecting information: collection information about your question or your data.
  • Comparing expectations to data: compare the results and see if they match your expectations; otherwise, iterate.

3. Stating and refining the question

  • Types of questions
    • Descríptive: it seeks to summarize a characteristic of a set of data.
    • Exploratory: you analyze the data to see if there are patterns, trends or relationships between variables. This is also called hypothesis-generating, because you look for patterns that would support proposing a hypothesis.
    • Inferential: it would be a restatement of this proposed hypothesis as a question which would be answered by analyzing a different set of data.
    • Predictive: you are interested in predicting a certain value or event.
    • Causal: it asks about whether changing one factor will change another factor.
    • Mechanistic: how a factor change affects other factors.
  • Characteristics of a good question
    • The question should be of interest to your audience.
    • The question has not already been answered.
    • The question should stem from a plausible framework.
    • The question should be answerable.
    • The question should be specific.
  • Translating a question into a data problem
    • Every question must be operationalizated as a data analysis that leads to a result.
    • Some questions do not lead to interpretable results. The typical type of questions that does not meet this criterion is a question that uses innapropriate data.

4. Exploratory data analysis

  • It is the process of exploring your data.
  • The most used tool for exploratory data analysis is data visualization.
  • Goals
    • Determine if there are any problems with the dataset.
    • Determine whether the question you are asking can be answered by the data you have.
    • To develop a sketch of the answer to your question.
  • The epycicle of analysis still applies to exploratory data analysis.
  • If you do not find evidence of a signal in the data using just a single plot or analysis, then often it’s unlikely that you will find something using a more sophisticated analysis.
  • Follow-up questions
    • Do you have the right data?
    • Do you need other data?
    • Do you have the right question?

5. Using models to explore your data

  • A model is something we construct to help us understand the real world.
  • A statistical model serves two key purposes on data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data was sampled.
  • Data reductions: you take the original set of numbers contained on your dataset and transform them into a smaller set of numbers. This typically ends up with a statistic, a summary of data as the mean, median, etc.

5.1 Models as expectations

  • A statistical model must also impose some structure on the data.
  • A statistical model provides a description of how the world works and how the data was generated.
  • Normal model: it says that the randomness in a set of data can be explained by the normal distribution, as a bell-shaped curve.
    • The normal distribution is fully specified by two parameters: the mean and the standard deviation.

5.2 Comparing model expectations to reality

  • The usefulness of a model depends how closely it mirrors the data we collect in the real world.
  • We can do it by plotting the model on a histogram.

5.3 Reacting to data: refining our expectations

  • If the model and the data do not match very well, we need to get a different model, different data, or both.
  • Gamma distribution: it allows only positive values.

5.4 Examining linear relationships

  • Linear regression is an useful statistical technique that allows us to understand linear relationships between variables of interest.
  • In most real-life cases, we need to run many iterations of the data analysis process.
  • If possible, try to replicate the analysis using a different, possibly independent, dataset.
  • It is always important to be hyper-critical of your findings and to challenge them as much as possible.

6. Inference

  • The goal of inference is to be able to make a statement about something that is not observed, and ideally to be able characterize any uncertainty you have about that statement.

6.1 Identifying the population

  • We refer to the things you cannot observe as the population, and the data we observe as the sample.
    • The goal is to use the sample to make a statement about the population.
    • First of all, you must figure out what the population is and which feature of the population you want to make a statement about.
    • If you cannot coherently identify or describe the population, then you cannot make an inference.

6.2 Describe the sampling process

  • Being able to describe the sampling process is important for determining whether the data is useful for making inferences about features of the population.

6.3 Describe a model for the population

  • We need to have an abstract representation of how the elements of the population are related to each other.

6.4 Factores affecting the quality of inference

  • Selection bias
  • Sampling variability

7. Formal modelling

  • Often it is useful to represent a model using mathematical notation because it is a compact notation and can be easy to interpret once you get used to it.
  • They key goal of formal modelling is to develop a precise specification of your question and how data can be used to answer it.
  • General framework: we can apply the basic epycicle of analysis to the formal modelling portion of the data analysis.
    • Setting the expectations.
    • Collecting information.
    • Revising expectations.
  • Associational analyses are the ones we are looking for an association between two or more features in the presence of other potentially confounding factors.
    • The basic form of a model in an associational analysis will be:
      \(y = \alpha + \beta \cdot x + \gamma \cdot z + \epsilon\)
    • Where
      • \(y\) is the outcome.
      • \(x\) is the key predictor.
      • \(z\) is a potential confounder.
      • \(\alpha\) is the intercept (the value of \(y\) when \(x = 0, z = 0\)).
      • \(\beta\) is the change in \(y\) associated with a 1-unit increase in \(x\).
      • \(\epsilon\) is the independent random error.
      • \(\gamma\) is the change in \(y\) associated with a 1-unit incease in \(z\).
  • Prediction analyses lets us use all of the information available to you to predict \(y\).
    • For many prediction analyses it is not possible to literally write down the model that is being used to predict because it cannot be represented using standard mathematical notation.

8. Inference versus prediction

  • In any data analysis, you want to ask yourself “Am I asking an inferential queston or a prediction question?”. The answer to this question can guid the entire modelling strategy.

9. Interpreting your results

  • Principles of interpretation
    1. Visit your original question.
    2. Focus on the nature of the result: its directionality, magnitude and uncertainty.
    3. Develop an overall interpretation based on the totality of your analysis and the context of what is already known about the subject matter.
    4. Consider the implications of the result.

10. Communication

  • Communication is both one of the tools of data analysis, and also its final product.