Probability and statistics
A study guide through simulation (WIP)
This study guide is designed as a complementary resource to “Simulation of economic processes”. It provides additional references for studying and practicing the fundamentals of probability and statistics.
It’s possible you didn’t enjoy your statistics classes. This happened in my case, despite a passion for the field. In this journey of decision science, we have to give statistics another chance, but we dramatically change the strategy of how we learn it. We’ll heavily use simulations, stories, and case-studies which highlight the practical relevance of key theoretical ideas.1 Speegle’s Probability, Statistics, and Data: A fresh approach using R and Cetinkaya-Runde’s Introduction to Modern Statistics are excellent examples of this new way of teaching.
1 Just flip through a modern book which takes a similar approach to get a sense of how different it looks from traditional teaching.
There is absolutely no shame in going back to a subject again with a mindset of mastering the fundamentals. For me, this exercise turned out to be one of the most valuable things I’ve done.
Probability theory
First, read the introductory chapter in which I explain how probability fits in the larger context of decision science and how can we use simulation to learn and appreciate it better. I highly recommend these lectures by Santosh A. Venkatesh (SV) and Joseph Blitzstein’s Probability 110 course (JB) as your main references for probability theory.
Combinatorics and sampling
In the first lab, where we simulated the birthday problem, I mentioned the importance of knowing the stories behind core concepts in combinatorics, like \({n \choose k}\) “n choose k”, factorial, falling factorial, and the multiplication rule. They translate into urn models and sampling procedures (ordered, unordered, with and without replacement).
The first three lectures from SV and JB should be more than sufficient as a review of combinatorics and naive probability.2 In the latter you will see how urn models were successfully applied in modeling different phenomena in physics.
2 Pay close attention to their story proofs and the effort they put in to get you over the fear of mathematical abstraction
In S. Venkatesh, Tableau 3, there is an elegant proof of unordered sampling with replacement, which was used by Bose and Einstein to describe the behavior of bosons. The result will be useful to us when we’ll introduce bootstrap, but you can check out a neat explanation by Riaz Khan.
I understand if your reaction at this point is: “Are you kidding me, this is just a warm-up?”. I would like to convince you that the time investment is not that big and the payoff is huge. You will encounter fun stories and applications in multiple domains, get comfortable with mathematical abstraction, and see that probability doesn’t have to be a chore or just a prerequisite you have to get over with.
Probability Triple and Random Variables
Read the second lecture, which is a bit abstract and theoretical, but essential for understanding probability. By understanding how the probability triple is constructed and why we need the abstraction of random variables, the rest of machinery in probability will start making much more sense.
Some of the biggest debates in science, spanning across decades, causing much confusion and controversy could’ve been resolved much quicker by having this explicit distinction between Scientific Hypotheses, Process Models, and Statistical Models or estimators.
I highly recommend this first lecture by Richard McElreath, where he shows how tricky could it be to map the correspondences between these three.
This is a good place for you to understand the distinction between collectivity (physical structure / entities), statistical population, and sampling procedures.3 The second distinction is concerned with the data generating process: parameters (estimands), estimators, and estimates (statistic). Never confuse those!
Binomial distribution. Simulation
In the second lab, I introduce the Bernoulli (coin flip) and Binomial distribution with a story of people booking and showing up to a safari trip. We will see how our conclusions and calculations change when people are not deciding to show up independently.
You might’ve heard about the hot hand fallacy. We model streaks of successful shots in basketball and ask whether streaks of \(k\) are surprising. Read the original paper A. Tversky and ignore the statistical test they designed.
Their insight that fans and experts in basketball overestimate the magnitude of this effect is fundamentally correct. However, what doesn’t follow from their research is that there is no effect, which is pointed out by a few Bayesian statisticians in the recent years. The debate goes on, but the limitations of the original statistical test are undeniable.
This is a true rabbit hole. Right now, we have the tools to simulte what can we expect if there was no effect or a small effect, but we can’t design a statistical test yet. For that, refer to the last section in Tableau 10 of SV, A. Tversky’s original paper, and recent papers which challenge the statistical methodology and modeling approach based on which the original conclusion was reached.4
4 You can look at this article by Kevin Ross for a good overview of the original permutation test and about the ongoing debate
Despite the ongoing debate about the magnitude of the effect in the hot hand phenomenon, it is undeniable that such simulations are extremely valuable – they prevent us from being fooled by randomness. Let’s look at a few more examples in which simulation can be valuable in figuring out what’s going on:
- Amateur traders might see “winning” patterns in stock markets which can be explained by a random walk. Be suspicious any time you see a “technical analysis”
- Policy-makers might not realize that amazing SAT performances in some schools with few students could be just due to small sample sizes
- In chess, some might attribute an observed pattern of ELO by gender to nonsensical explanations, when a simple simulation will show that an initial disparity in the number of players, combined with estimating a proportion in top k overall players results in similar outcomes
- Pollers and political scientists might underestimate the support for a political party due to differential non-response
- Medical research might overestimate the effectiveness of a treatment or underestimate side-effects due to drop-out and right-censoring
- We might think there is an association between variables, when both can be explained by a common cause, a confounder
Three core theorems: CLT, LLN, Uniform
In the third lecture, I introduce the most important theorems in probability: central limit theorem, law of large numbers, and the universality of uniform. The latter two justify why simulation works and gives us an immediate practical tool for numerical integration. Moreover, we’ll get to know the 3 convergence types the theorems are referring to.
In both the lecture and lab, I use simulation to show graphically how the theorems work. We also get to know the stories and use-cases behind common distributions used in statistical modeling and the relationships between them.5 The accept-reject method will serve us as an useful tool for sampling from complex distributions (including mixtures) not available out-of-the-box in R or Python.
5 You will also encounter the empirical cumulative density function, which will be a very important tool in nonparametric approaches to statistics
I think that “The most dangerous equation” is a must read for anyone, not just practicing scientists and statisticians. The example I use for simulation is about the dubious U.S. policy in 90’s of splitting big schools into smaller ones. The key theoretical idea we will encounter for the rest of our journey is sampling distribution of sample mean (or any other estimator) – and we will always have to worry about sample size, especially when comparing groups.
I highly recommend the case-study from Bergstrom and West about the best barbecue in US, which is a great exercise in critical and statistical thinking. This problem is prevalent in online platforms which have to rank products, posts, and comments. Moreover, they face the challenge of how to take the sample size into account. It depends on their objectives, but for inspiration, I recommend you check out the HackerNews ranking algorithm.
Out of this lecture and lab, there are two directions we have to investigate further: probability distributions and properties of estimators. The latter is a topic in mathematical statistics which is relevant even for machine learning.
Newsvendor problem. Distribution stories
You probably studied the properties of a whole range of probability distributions, but then, in statistics, encountered just a few – especially in the context of hypothesis testing. In the previous lecture we saw the motivation behind \(\chi^2_k\), \(t_k\), \(N(\mu, \sigma)\), \(F(d_1, d_2)\) and the relationships between them.6
6 There are particular physical processes and phenomena (stories, in general) which underly the patterns we observe. This is the reason they can be accurately described by particular probability distributions
7 For more details, look at more examples with simulations and stories / applications
We also did simulations for Beta, Gamma, Exponential, Negative Binomial and saw how mixtures are helpful in modeling various business outcomes, like customer purchasing patterns.7 These distributions will be our building blocks when developing more complicated and realistic Bayesian models.
If you empathise with this statement, you’re either aware that it’s important and wonder why it didn’t come up in practice or believe it was a tedious academic exercise. The point is not about Poisson distribution, but about probability and statistics theory in general. First, let me assure you that it is helpful in practical applications.
Poisson distribution and process can be a good choice to model counts of events per unit of time, space, with a large number of “trials”, each with a small probability of success.
\[ P(X=k) = \frac{e^{−\lambda} \lambda^k}{k!}; \space k=0, 1, ... \]
- Arrivals per hour: requests in a call center, arrivals at a restaurant, website visits. We can use it for capacity planning.
- Bankrupcies filed per month, mechanical piece failures per week, engine shutdowns, work-related accidents. We can use these insights to assess risk and improve safety.
- Forecasting slow-moving items in a retail store, e.g. for clothing
- A famous example is of L. Bortkiewicz: in Prussian army there were 0.70 deaths per one corps per one year caused by horse kicks. (“Law of small numbers”). Here is the historical data and a blog post telling the story.
- Number of asthma or kindey cancer related deaths per US county (examples from Gelman’s Bayesian Data Analysis, 3rd ed., recommended in next appendix)
Just before you get all excited about these applications, keep in mind that every distribution also has a set of assumptions that have to be met.
Next, we’re going to simulate the newsvendor problem which I outlined in the introduction and use different probability distributions to represent different types of demand patterns which we might encounter in practice (slow moving, fast moving, intermittent).8
8 Demand forecasting and inventory optimization is a large field, and we’ll need to come back to it with more powerful statistical models
If you’re into sports betting, you have to understand the idea of point spreads. You can find the following example of using the normal distribution as a population model, instead of empirical frequenties in Chapter 1 of “Bayesian Data Analysis, 3rd ed”. You can use this data about matches to compute the probability a team wins. You can also investigate if experts (and subjective point spreads) are right, on average.
Conditioning and Bayes Rule
I like a lot the quote “Conditioning is the soul of statistics”. The Bayes rule, which follows from the axioms of probability, is an essential tool for reasoning under uncertainty and updating your beliefs. It is worth understanding not only the details of its computation, but also conceptually, as a tool in decision-making.
There are a few great resources which will teach you about conditioning, marginalization, priors, and how to think like a Bayesian. Check out Chapter 1 and 2 from A. Gelman’s “Bayesian Data Analysis, 3rd ed.”,9 A. Johnson’s Bayes Rules!, and R. McElreath’s Statistical Rethinking. If you prefer videos, enjoy G. Sanderson’s (3Blue1Brown) visual guide to the Bayes rule and its alternative formulation for medical testing.
9 We’ll refer to it as BDA3 from now on
We will code up and simulate an example of medical testing for rare diseases, then use the same idea to reason about how confident are we that our code has no bugs. If you remember the Covid-19 rapid tests and their confusion matrices printed on instruction sheet, you could’ve applied the same idea, but had to pick your priors very thoughtfully!
Or maybe you’re passionate about biology, where you could apply it for Mendelian genetics and think about the mystery of deadly genes persistence (SV, Tableau 10). This case study is more complex than the others recommended here.
The Simpson’s paradox is usually introduced to highlight the importance of conditioning. However, the only resource I found which gets to the core of the problem is B. Neal’s first lecture on causal inference and R. McElreath’s note on it in “Statistical Rethinking”.
The “paradox” is resolved only when when we think about the underlying causal process and the DAG of influences (i.e. probabilistic story of the use-case).
In order to make a small step from Bayes rule towards the Bayesian approach to statistics, I recommend the following provocative talk by Richard McElreath, “Bayesian inference is just counting” (slides and recording).
(BDA3, Ch1): You can implement a rudimentary spell checker, based on empirical frequencies provided by Peter Norvig. You will have to code it up for yourself, but it’s an excellent exercise.
In Chapter 1 of John Winn’s “Model-based Machine Learning”, you will use the Bayes rule and conditioning to solve a hypothetical murder mystery and update your opinions based on the increasing amount of evidence collected.
For the simplest models, one approach of comparing the “relative evidence” of different hypotheses is Bayes Factors, explained well by D. Lakens in Chapter 4 of “Improving your Statistical Inferences”. However, it doesn’t translate well in practice for more sophisticated models and is generally not recommended.
Mathematical statistics
For the fundamentals of hypothesis testing, our go-to resource will be D. Lakens’ “Improving your statistical inferences” (LI). D. Speegle’s “Probability, Statistics, and Data: A Fresh Approach Using R” is perhaps the most accessible book. There are many excellent and well-established resources, but they get heavy and mathematical really fast, for example, Casella & Berger’s “Statistical Inference”, M. Schervish’s “Theory of statistics”.10
10 Perhaps after gaining an intuition from our lectures and doing some statistical modeling, you will go back and study probability and mathematical statistics more rigorously
Properties of estimators
We will worry a lot about causal and methodological aspects when we’ll learn statistical modeling, but everything else being equal, there are some estimators which make better use of our data than others (if assumptions hold).
For an accessible explanation of bias, consistency, efficiency, showcased with the corresponding R code, see Chapter 6 of I. Svetunkov’s “Statistics for Business Analytics”. An important idea in the likelihood approach to estimation is Fisher’s information, which is presented clearly in Chapter 4 of B. Efron’s “Computer age statistical inference”,11 along with an important result of Rao-Cramer lower bound.
11 Unfortunately, “Computer Age Statistical Inference” is no longer available online as a free pdf
In modern statistics, we often have to deal with highly multidimensional parameter spaces, where unbiasedness is no longer the most important criteria for a good estimation. Therefore, knowing about the curse of dimensionality and the bias-variance tradeoff is tremendously important in applied statistics. For this topic, I recommend Lecture 8 of Y. Abu-Mostafa’s “Learning from Data” course.
There are objections in the context of Deep Learning to the traditional bias-variance decomposition. See how this tradeoff needs an update for the modern deep learning. However, in the most general sense, it is a universal problem not only in statistics, but also for human cognition.12
12 When it comes to notions of rationality, it’s useful to think about the tradeoff between efficiency and resilience
Bootstrap and Nonparametric Statistics
Often, we can’t affort to make strict distributional assumptions about our data-generating process, functional relationship, or population model. This means that we sometimes resort to more flexible or robust models, but we’re out of luck with having theoretical, closed-form sampling distributions. Thus, we need computational methods for calculating the confidence intervals.
Bootstrap was a revolutionary technique at the time, which enabled people to use estimators which were too hard or impossible to use before. Chapter 12 of “Introduction to Modern Statistics, 2nd ed.” is a good introduction to it.
Hypothesis testing. Neyman-Pearson
In order to make sense of hypothesis testing, we need to go back to the original idea of Neyman and Pearson – error control in the long run. This is a “path of action” perspective on statistics, in contrast with the Bayesian updating of beliefs and Fisherian likelihood or relative evidence. It’s a powerful approach when used well for the appropriate problem, but also can backfire spectacularly when misused or misunderstood.13
13 This is one of the core reasons for the replication crisis in multiple disciplines like psychology and social science
As you have seen in Chapter v. (12 steps of statistics), we will have to make a lot of choices in the phase of study / experiment design. You have to justify why you picked the default action, population, sampling scheme, measurement, minimal effect size, sample size, etc. Read the following chapters from D. Lakens, which tackles most of the statistical choices:
- Chapter 1 debunks common misconceptions about p-values and simulates its distribution under repeated studies. You can find an interactive visualization of the p-curve here.
- Chapter 2 tells you how to justify type I and type II error rates and which outcomes can you expect apriori from your study
- Chapter 6 deep-dives into effect sizes and Chapter 7 focuses on the correct interpretation of confidence intervals
- Chapter 8 puts it all together into power analysis and how to justify sample sizes, given the practical constraints of your problem
All of the above is pretty standard and well-known material among statisticians, but you will avoid tons of pitfalls as a practitioner by knowing it. Perhaps, D. Lakens’ most important contribution is the idea of asking better and riskier questions, in a very specific, statistical sense. This involves using one-sided, equivalence, non-inferiority tests when appropriate (range predictions) – otherwise we risk getting the right answer to the wrong question.
The Neyman-Pearson approachs requires a little detour into epistemiology and K. Popper’s idea of falsification. You can read Chapter 5, section 6 and watch this lecture on Popper and Lakatos from Z. Dienes. The same author has another comprehensive overview of frequentist statistics – you can watch the lecture here.
It is also good to know that there are alternatives to the p-value approach to the final decision and reporting of results. Two of the most notable are W. Stahel’s “Relevance” paper and A. Gelman’s “Sign and Magnitude”.
There is a zoo of different statistical tests and procedures, which might be very confusing – especially trying to remember their particularities. It’s important to realize that a lot of seemingly unrelated statistical tests in frequentist statistics are particular versions of linear models. The following book will save you lots of headache: Common statistical tests are linear models and its python port
Frequentism vs Likelihood vs Bayes
There are three main schools of thought in statistics, which have their respective metaphors: “path of action” (Neyman-Pearson frequentism), “path of devotion” (Fisherian Likelihood), and “path of belief / knowledge” (Bayesian). I like very much the presentation of each school in “Computer Age Statistical Inference”, chapter 2, 3, 4.
Each one has their strenghts, weaknesses, contribute tools and insights for our future use-cases. For example, Neyman-Pearson approach is excellent for quality control at scale, the likelihood approach is widely used in machine learning, and Bayesian inference is a good alternative to frequentism for statistical modeling.
Replication Crisis
At last, we can’t avoid a conversation about the replication crisis happening in multiple disciplines, but especially in social sciences. What scientific literature can we trust? This is relevant not just for research, but will help you avoid many pitfalls in business practice. Here are some of the topics you will encounter in this conversation:
- Sloppy science: multiple testing, p-hacking, HARKING, snooping
- Ethics and research integrity, increasing rates of scientific fraud
- Publication bias, open science, and pre-registration
- False-discovery rate, underpowered studies and vague questions
- Confounding, mediation, and all that causal jazz
- Computational reproducibility vs replication, meta-analysis
There is a famous study in which researchers put a dead salmon in a fMRI machine and lo and behold – the statistical software showed some neural activation. It put into question standard practices of applied statistics in neuroscience and led to a larger conversation on controversies in medicine, psychology, and social science.