Module I: Business School for Data Scientists
The “business school” emphasizes again and again the idea of Decision-Making Under Uncertainty at Scale, across many industries and use-cases. We walk through three perspectives: Analytics, Machine Learning, and Statistics – then develop methodologies for each one. There are two representative case-studies in which we go in-depth: the Newsvendor Problem and Price Optimization.
By convention, lectures whith mathematical inclination are marked by (*) and programming in bold and (**). Often we do both.
Lecture Agenda | Keywords | Case Studies / Activities | |
---|---|---|---|
1 | Introduction & Course Philosophy. Roadmap | Decisions, VUCA, Modeling, prerequisites, project requirements | survey of applications in various domains, admin stuff |
2 | Data Science in Business Context. Strategy | SWOT, Tradeoffs, Strategy, Business Analyst’s Workflow, Value Chain, Fermi Estimation | case-studies: LRB subscriptions, DTC fashion e-commerce |
3 | Decision-Making Under Uncertainty at Scale | AI, Cybernetics, Analytics vs Stats vs ML, Interdisciplinarity, Uncertainty | data roles in firms, Uber and dynamic pricing, open datasets overview |
4* | Newsvendor Problem and Demand Planning | DAGs, model choice, inference and optimization, constraints. The general demand planning problem | TIME INC. printing quantity decisions. Simulations and restaurant data (ddop) |
5* | Learning, Intuition, and Bias. What is ML? | intelligence, bias, learning problem, on bullshit, ERM, bias-variance tradeoff | credit defaults, recommender systems, demand forecasting |
6** | 12 Steps of ML | CRISP-DM, Feasibility, Validation | case-study: Mercari challenge |
7* | Price optimization under limited capacity | demand elasticity, dynamic pricing, regression, pricing strategy & revenue management | case-studies: concerts, airlines, taxis/rides |
8* | 12 Steps of Statistics. A/B Testing Scheme | default action, hypotheses, confidence intervals, relevance, effect sizes, power | case-study: conversion rate optimization and marketing |
9** | Churn and Lifetime Value. Open datasets | BTYD, Contractual & PAYG, Survival Models, EDA & Data Mining. BI & Analytics | kaggle e-commerce relational database, duckDB and SQL |
Module II: Probability and Statistics
There are so many pitfalls in statistics and many of them come from a misunderstanding about the nature of statistical inference, a mechanical application of methods, and an insufficient grasp of fundamentals.
I prefer to fill in the gaps with stories and simulations, however, in some cases the mathematical formalism and abstractions can’t be avoided and actually help understanding
Therefore, fundamental – doesn’t mean easy, nor basic, nor trivial. Most courses make you solve puzzles, but I ask you to appropriately define, justify, and apply the choice of method and tool in the context of business applications.
Lecture Agenda | Keywords | Case Studies / Activities | |
---|---|---|---|
1* | Combinatorics. Sampling & Urn Models | replacement, multiplication rule, naive \(\mathbb{P}\) | Birthday Paradox, Bootstrap |
2* | Probability Triple. Random Variables. What is a model? | experiment, sample space, measure, population, estimand, estimators, golemns | Probability vs Statistics, Kolmogorov, Garden Forking Paths |
3** | Binomial, Poisson, Beta, Gamma. Stories behind distributions | PMF, PDF, CDF, Hypergeometric, Negative Binomial, \(\chi^2_k\), mixtures, \(t_k\), \(N\), Weibull | couples showing up to safari, arrival times, basketball shots, hot hand |
4* | Conditioning and Bayes Rule. Likelihood Ratios | DAGs, Conditioning, Marginalization, Priors, Updating | Monty Hall, Simpsons’ Paradox, Football Spreads, Medical testing |
5* | CLT and LLNs. Convergence types. Estimator properties | sample size, convergence in probability, distribution, almost-sure | The most dangerous equation, hackernoon ranking algo |
6* | Bias-Variance, Fisher Information, Rao-Cramer | Efficiency, Bias, James-Stein, curse of dimensionality | Bootstrap, simulations and visualizations |
7* | Neyman-Pearson frequentism. Popper and falsification | Type I, II errors, confint, p-value, PPV, effect sizes, power analysis, range predictions, \(H_0\), \(H_A\) | How to ask better questions. Stahel’s relevance |
8 | Frequentism vs Likelihood vs Bayes | philosophical roots, agreements, weaknesses | The path of action vs devotion vs belief |
9 | Dead Salmon Experiment. Replication Crisis | multiple testing, harking, snooping, publishing bias, open science, underpowered studies | Examination of controversies in medicine, psychology, and social science |
Module III: Applied Bayesian Statistics
After the first two modules, we should have a solid foundation for building more realistic models. As discussed in the introduction, randomized experiments and A/B tests have obvious limitations: might be unfeasible, unethical, or just not the right tool.
We are not wasting time by switching to Bayes (7 lectures to get to GLMs), because we treat all models under the same framework.
Instead of taking regression as the starting point, then going into advanced econometrics – I prefer we switch to a Bayesian perspective, so we can build custom models taking into account the particularities of the phenomena of interest. The course culminates in Hierarchical GLMs, which should be sufficient for the majority of problems you encounter in practice, or at least a good starting point.
Lecture Agenda | Keywords | Case Studies / Activities | |
---|---|---|---|
1 | Introduction to Bayesian Statistics. Simple Models and Conjugate priors | Beta-Binomial, Gamma-Poisson, Mixtures, Link functions, IG-Normal | Elections, Football Goals, Coal Mining disaseetrs, Golf accuracy, Speed of Light |
2 | Multiple Groups and Hierarchical Models | Complete, No, Partial pooling | A/B testing, baseball, review ranking, Radon, Starbucks wait times, Customer LTV |
3 | Regression and Bayesian Workflow | Categorical variables, Results summarization, Modeling assumptions | OCEAN, Education, height, waffles |
4 | Regression and Model Critique | Validation, Prior and Posterior Checks, LOO-CV, LOO-PIT, overfitting | Golf example continued, Flight Delays |
5 | Regression and Model Selection | Variable selection | McElreath example: height, waffles |
6 | Regression and nonlinearities | Regularization, Variable selection, Splines, Robustness | Applying the concepts on same examples as before. Missing Data |
7 | Generalized Linear Models | Logistic regression, Poisson and overdispersion | Income, Voting, Campus Crime, Student absence |
8 | Introduction to Hierarchical Models | Hierarchical GLMs. Priors informed by prior knowledge | Radon continued, Stage Anxiety, Plumonary Fibrosis |
9 | Hierarchical Models and Longitudinal Data | Stochastic processes. Challenges of Hierarchical time series | Teenage drinking, sleep study, charter schools, Graduate Admissions |
10 | Hierarchical Models case studies. BART | Basketball Fouls, AirBnb reviews, Seat Belts | Bike Shares, Agriculture and Seed Germination |
Python/R Computational Toolbox
In order to make the most out of the three modules, we need to be skilled (or at least competent) programmers, from the perspective of data science. It is a skill which can be developed independently – I give a few recommendations in the introduction (prerequisites section).
In this course I heavily rely on the principles of reproducible research, which have 3 key aspects: data, computational environment, and code.
A second aspect that we need for the programming part not to slow us down and stand in the way, is to have a good, pleasant, and reliable development environment setup. Here is a starting, non-exhaustive list of tools you will need.
For Windows 11+ users, I strongly recommend you use WSLII – and get familiar with the command line & linux. Between R and Python, I choose technologies which make it easy to draw parallels and transition from one to another.
- RStudio as our main IDE
- Quarto for literate programming
- git & github
- renv for managing environments
- database: duckdb (analytical, in-process)
- data wrangling: tidyverse, data.table (bigger data), arrow
- data visualization: ggplot, plotly
- interactive applications: shiny
- APIs: plumbr
R has many gotchas, which is the main reason that makes the language hard. Therefore, we need to some additional concepts in R, especially functional programming and understanding the S3 object system. Tidyverse is an important ecosystem which tries to solve a lot of the issues in base R.
- VScode as our main IDE
- Jupyter Notebooks or Quarto for literate programming
- git & github
- conda and poetry for managing environments
- database: duckdb (analytical, in-process), sqlite (transactional, in-process)
- data wrangling: pandas 2.0, pypolars (bigger data), arrow
- interactive applications: streamlit, shiny, or dash
- APIs: fastapi