Probability and Statistics: Study Guide
Many mistakes in statistical modeling and inference come from a misunderstanding about the nature of statistical inference. This is why we start from the very beginning, with counting. But first, read the motivation from course introduction.
In contrast with the previous module, the lectures here are generally more mathematical, hands-on, and time-consuming. But don’t worry, I keep the promise of stories and graphs over proofs.
Combinatorics. Sampling. Urn Models
There are several excellent courses which start from counting, and I think there are some useful and practical things ideas you missed about combinatorics. There are use-cases and stories with which you will never have to remember important results about sampling with/without replacement by heart. 1
1 There are three resources I recommend: Joe Blitzstein’s Lecture 1/2 of Probability 110, Santosh Venkatesh’s Probability Lectures, Tableau 2, and Richard McElreath’s “Bayesian inference is just counting”.
- Multiplication rule and the “garden of forking paths”
- Sampling with and without replacement. Ordered vs unordered
- Naive Probability and motivation for Kolmogorov’s breakthrough
- Read this short article about the difference between probability and statistics
One of the reasons we’re not intuitively good at probability, is that in many cases it is difficult to count the way favorable/possible outcomes could occur. The birthday problem is one of such examples and we’ll do simulations and derive the analytical solution – suggesting unexpected applications in computer science and engineering.
Probability Triple. Random Variables. Models
If you understand and can explain the above ideas in a simple, yet rigorous way – you’re ready for the journey. Otherwise, if it feels shaky2, here are some readings:
2 Some of you might find reviewing this insulting because it’s “trivial”, or useless theory, or a frustrating reminder of probability classes. Please, bear with me – because we will eliminate a whole class of errors practitioners make by not keeping these things in mind.
- Probability Triple and Random Variables - a quasi formal introduction is written in this chapter of the course website. From my experience, not many students have this understanding after their probability theory classes.
- Collectivity (“physical” structure), Statistical Population, Sample. We need to be a bit more precise in what we mean by a statistical model and a DGP.
- Parameter (Estimand), Estimator, Estimation/Statistic. Never confuse those!
Some of the biggest debates in science, spanning across decades and causing much confusion and controversy could’ve been resolved much quicker by having this explicit distinction about Scientific Hypotheses, Process Models, and Statistical Models or procedures.
I highly recommend this first lecture by Richard McElreath, showing how tricky could it be to map the correspondences between these three.
Stories behind distributions
You studied the properties of a whole zoo of probability distributions, but then, in statistics, encountered just a few – especially in the context of hypothesis testing. In Module 3 (Applied Bayesian Statistics) we will need to know the stories3 behind most of them, since the goal will be to build custom models for each application.
3 There are particular physical processes and phenomena (stories, in general) which underly the patterns we observe. Often, those patterns can be accurately described by a particular probability distribution, governed by its parameters
- Bernoulli, Binomial, Hypergeometric, Negative Binomial 4
- Poisson. Limiting and Conditioning. Overdispersion
- Beta, Gamma, Exponential. Exponential Family and Information Theory
- Statistical superstars: \(\chi^2_k\), \(t_k\), \(N(\mu, \sigma)\), \(F(d_1, d_2)\)
- Weirdos: Mixtures, Dirichlet, Multinomial, Weibull, heavy tails
- Remember the differences between PMF, PDF, CDF, MGF, \(\mathbb{E}\), \(\mathbb{V}\), \(\mathbb{E}g(x)\)
4 Look at some examples with simulations and stories / applications with more math from Joe Blitzstein.
We will analyze and simulate a simple, but tricky example to find out what is the probability of a given number of people who signed up to actually go for the safari.
We model streaks of successful shots in basketball and ask whether streaks of \(k\) are surprising. I review the original paper by Kahneman and Tversky, what they did very well, but also potential problems with the statistical model
Originally, Poisson distribution was used to estimate deaths by horses in the Prussian Army. Here is the historical data and a blog post telling the story.
I limit the number of hands-on applications for this chapter/lecture, not only because of time constraints, but also because most use-cases come in Module 3, in the context of more realistic problems. I hope, however, that I sparked an interest about how to approach Probability, especially when we draw DAGs to tell stories.
Conditioning and Bayes Rule
There is a quote I like a lot: “Conditioning is the soul of statistics”. The Bayes rule, which follows directly from the axioms of probability, is an essential in decision-making and the most important tool in this course – both conceptually and technically. Any introduction to the subject will work out:
- A few excellent resources are Chapter 1/2 of BDA3, or Chapter 1/2 of Bayes Rules, or Chapter 1/2 of Statistical Rethinking. They will teach you about:
- Conditioning, Marginalization, Priors, and Updating
- If you prefer videos, enjoy the 3Blue1Brown visual masterpiece on how to think like a Bayesian or the explanation here.
- I introduce the idea of Likelihood, which would serve us in future use-cases. It is another important perspective over statistical modeling to consider
Medical testing for rare diseases, hypothetical example with code in my course repository. We use the same idea to reason about how confident are we our code has no bugs.
If you remember the Covid-19 rapid tests and their confusion matrices printed on instructions, you could’ve applied the same idea!
Or maybe you’re passionate about biology, where you could apply it for Mendelian genetics and think about the mystery of deadly genes persistence
For the simplest models, one approach of comparing different hypotheses is Bayes Factors. However, these do not translate well in practice for more sophisticated, multilevel models. You can look it up in the following courses here and here for the theory and examples.
(BDA3, Ch1): Football spreads, that can be estimated from data about matches. What is the probability that a team wins? Are experts right, on average?
- If you’re into betting and sports, can you replicate the analysis on other datasets? What are your options for data collection?
- For brevity, I won’t elaborate much from now on, how to take an use-case and example to its limit. If you’re passionate about a particular topic – go for it!
(BDA3, Ch1): Spelling correction, based on empirical frequencies provided by Peter Norvig. As in the previous case-study, you will have to code it up and figure it out for yourself – it is good for a warm-up, but challenging enough to keep you occupied.
The Simpson’s paradox is usually introduced to highlight the importance of conditioning. However, the only resource I found which gets to the core of the problem is Bradley Neal’s first lecture on causal inference.
The “paradox” part of it is resolved (or at least not puzzling), when we think about the causal structure of the problem (or the DAG of influences).
LLNs, CLTs. Estimator properties
Halfway in the module, we switch from Probability Theory to Mathematical Statistics. The goal is to develop the fundamentals needed for applied statistics, designing randomized experiments, and even machine learning. 5
5 Although the perspective I take in Module 3 is Bayesian, I will take time in Module 2 to cover and re-contextualize the Neyman-Pearson frequentism
- We continue with the key idea of estimators and sampling distributions, review laws of large numbers and the central limit theorem. See simulations here.
- If you’re interested in the underlying theory, I go on a technical detour about convergence types: in probability, in distribution, and almost-sure
- What does a statistician want? Review important properties of estimators.
- For an accessible explanation of Bias, Consistency, Efficiency – showcased with the corresponding R code, see openforecast
I think that “The most dangerous equation” is a must read for anyone, not just practicing scientists and statisticians. The example I usually do a demonstration on is about the dubious U.S. policy of splitting the bigger schools.
Continuing on the reddit examples, there are some amazing case-studies in the “Calling Bullshit” website and book. One of them is exactly such a ranking problem: best barbecue in the states. I recommend you watch the whole playlist and work through the case studies: it is fun and an essential skill – to call out the bullshit.
Online platforms which have to rank posts and comments, face the challenges of how to take the sample size into account. It depends, but for inspiration, see the hackernoon ranking algorithm.
Bias-Variance. Fisher Information
I spend another lecture to deep-dive into estimators, because the concepts of bias-variance tradeoff and Fisher information have far-reaching consequences in a myriad of tools, applications and fields – especially machine learning. It is also an appropriate point in time to introduce a technique which was revolutionary at its time: bootstrap.
There are objections to the Bias-Variance decomposition when seen as a tradeoff, in the context of Deep Learning – however, in the most general sense, it is a universal problem not only in statistics, but also for human cognition. For an intuitive explanation, watch lecture 8, slides. See how this tradeoff needs an update for the modern deep learning.
This lecture is highly mathematical, but we will get some powerful intuitions about some fundamental tradeoffs we make in statistics, when selecting a model or estimator.
- Bias-Variance decomposition and the curse of dimensionality
- Fisher Information and Rao-Cramer lower bound
- The Bootstrap scheme: motivation, applications, and limitations
Bias-variance can be made more relatable in code, simulations, and visualization. However, I will not leave you hanging without introducing a technique you can use for solving practical and concrete problems, namely – bootstrap.
Hypothesis testing. Neyman-Pearson
In order to make sense of frequentist hypothesis testing, I strongly recommend you read about the original idea of Neyman and Pearson (error control – don’t make a fool of yourself too often in the long run). It is a “path of action” perspective of statistics.
I start from the first principles and will let go of mechanical application of procedures and conventions (p-values, \(\alpha, \beta\), test choice). You should to be able to justify all the choices you make during the phase of experiment design.
- Picking a default action. Type I, II errors. How costly is each type of mistakes?
- Minimal effect size of interest, Cohen’s \(d\)
- Power Analysis and Sample Size justification. How surprising are significant findings under each hypothesis? Positive Predictive Value
- p-values simulation, p-curve under \(H_0, H_A\).
- Confidence Intervals - first check out this simulation. Also chapter 12, uses bootstrap to estimate those. The tricky idea of “capture percent”
In order to put everything together, there are four resources I can recommend:
- Speegle’s book on data+probability+R
- Huber’s Chapter 6 of Modern Statistics
- Statistical thinking for 21st century
- Improving your statistical inferences
The most complicated part of hypothesis testing is asking better questions. I mean that in a highly technical sense, and whole-heartedly recommend you the following course from a TU Eindhoven professor, named “Improving your statistical questions”.
- Make riskier predictions: Non-Inferiority testing, Equivalence Testing, Range predictions
- Publication bias, open science, pre-registrations
- Minimal Effect Size of interest: telescope method and resource-based
- Type 3 errors (solving the wrong problem)
- Read Werner Stahel’s “Relevance” paper and Gelman’s “Sign and Magnitude” paper
- Understanding the philosophy of falsification and how it applies to hypothesis testing. Week2 of this course has a great 20 minute explanation.
- Philosophy of science: Popper and Latakos, in this lecture. “The null is always false”
There is a zoo of different statistical tests and procedures, which might be very confusing – especially trying to remember their particularities. It’s important to realize that a lot of seemingly unrelated statistical tests in frequentist statistics are particular versions of linear models.
- Common statistical tests are linear models and the python port
- Choosing a statistical test: difference in proportions and means, test of \(\sigma\), correlations
- For a bayesian alternative to t-tests, see Krutsche’s example
- If you’re not clear if your distributional assumptions hold, use a nonparametric test
Frequentism vs Likelihood vs Bayes
There are three main schools of thought in statistics, which have their respective metaphors: “path of action” (Neyman-Pearson frequentism), “path of devotion” (Fisherian Likelihood), and “path of belief / knowledge” (Bayesian). I like very much the presentation of each school of thought in the book of Hastie/Efron “Computer Age Statistical Inference”, chapter 2, 3, 4.
Each one has their strenghts, weaknesses, and contribute tools & insights for our future use-cases. When we got into the topic of A/B testing and experiment design, we unavoidably stumbled upon a few fascinating philosophical questions in relation to the nature of evidence. The philosophical debate is fierce, but in statistical practice, less so. I suggest a level of pragmatism to pick the right tool/perspective for the particular job. In the courses I teach, I dedicate quite a lot of time on how not to fall into the most common pitfalls when applying frequentist methods. It’s an useful skill when critically reading the literature.
By now, you encountered the Neyman-Pearson (frequentist) approach. If you want another presentation, watch this lecture by Zoltan Dienes to get a sense of the orthodox approach: its power and limitations.
The likelihood approach is widely used in Machine Learning / Statistical Learning teaching and practice. This lecture by Zoltan Dienes contrasts Bayes Factors vs classical methods in t-test situations.
We can pick a simple example of inferring a proportion, which has many practical applications that you might remember from “Distribution Stories”. We care not just about the estimation, but also about confidence/credible intervals and the practical workflow.
- Frequentist: Normal Approximation, Agresti-Coull intervals
- Likelihood: Maximum likelihood, point estimates, bootstrapping. Check out this interactive visualization an lecture / lab.
- Bayes: The full posterior distribution, the tricky business of prior choice
Dead Salmon Experiment. Replication Crisis
Lastly, we can’t avoid a conversation about the replication crisis happening in multiple disciplines, but especially in social sciences. What scientific literature can we trust? This is relevant not just for research and science, but will help you avoid many pitfalls in the business practice – therefore, you will be less likely to be fooled by randomness.
- Multiple testing, p-hacking, HARKING, snooping. Ethics and Integrity
- Underpowered studies and vague questions
- Publication Bias, Open Science, Pre-registration and simulation
- False-discovery rate, Bonferoni correction
- Confounding, Mediation and all that causal jazz
- Computational Reproducibility vs Replication. Meta-Analysis
An examination of a famous experiment in neuroscience, putting into question standard/current statistical practices, leads to a conversation of controversies in medicine, psychology, and social science.
Just think about how important this experiment was for the field of medicine – it won the nobel prize!