Module I: Business School for Data Scientists
The “business school” emphasizes again and again the idea of Decision-Making Under Uncertainty at Scale, across many industries and use-cases. We walk through three perspectives: Analytics, Machine Learning, and Statistics – then develop processes for each one. This module is heavy on methodology, since I want to bring back science into “data science”, but don’t worry, there will be enough case-studies.
Lectures with more maths are marked by (*), programming in bold, and methodology with (~).
Lecture | Key ideas | Case Studies / Activities | |
---|---|---|---|
1 | Introduction & Course Overview | Question \(\rightarrow\) Outcomes, Modeling, 3 challenges of statistics, Prerequisites, Project requirements. Inferential vs Causal | purpose of a business, survey of applications and domains |
2 | Decision-Making Under Uncertainty at Scale | Weak AI and Cybernetics. Analytics vs Stats vs ML. Scientific process is all you need – why methodology is important | Fermi estimation, demystify buzzwords, Wikipedia A/B test |
3 | Data Science in Business Context | Performance trajectories, Strategy, VUCA, Business Analyst’s Workflow, Value Chain meets DS use-cases | Fermi estimation process, overview of realistic datasets |
4* | Newsvendor Problem and Demand Planning | Objectives, decisions, uncertainty. Tradeoffs. Stories behind distributions: traffic, slow-moving, intermittent, luxury | TIME INC. printing, simulations, Yaz restaurant data |
~ | Study Design | Question-first vs data-first. Estimands. Sampling, measurement, validity and reliability. Biases, artifacts, and confounds | FINER, PICOT. Confounds, mediators, colliders, and chains |
~ | Scientific writing | Writing as thinking and communication. Structure of a paper | Good titles, abstracts |
5* | What is ML? Learning, Intuition, and Bias. | Patterns, learning problem, bias-variance tradeoff. Experience and expertise. Model, algorithm, data. Generalization | overview of learning paradigms, models, and use-cases |
6* | 12 Steps of ML | Feasibility, objectives, splitting data, training/tuning, validation, testing, deployment and launch | E2E example of the Mercari pricing challenge |
7* | 12 Steps of Statistics. A/B Testing Scheme | default action, hypotheses, data collection strategies, simulation, confidence intervals, relevance, effect sizes, power | conversion rate optimization and marketing |
8* | Churn and Lifetime Value | Churn, repurchase, contractual and non-contractual settings | BTYD and survival models |
9* | Price optimization and revenue management | positioning, limited supply, demand elasticity, dynamic pricing, promotion, willingness to pay | stories of concerts, airlines, taxis/rides |
~ | Stories from the trenches | Personal exposition of challenges in marketing, demand planning, CRM, and recommender systems | building a thriving data ecosystem and growing competence |
Module II: Probability and Statistics
There are so many pitfalls in statistics and many of them come from a misunderstanding about the nature of statistical inference, a mechanical application of methods, and an insufficient grasp of fundamentals. I prefer to fill in the gaps with stories and simulations, however, in some cases the mathematical formalism and abstractions can’t be avoided and actually helps understanding.
I think that the best way to start understanding inferential statistics is Daniel Lakens’ book on “Improving your statistical inferences”, combined with another resource like Speegle’s “Probability, Statistics, and Data” for a deep-dive on technicalities.
Therefore, fundamental – doesn’t mean easy, nor basic, nor trivial. Most courses make you solve puzzles, but I ask you to appropriately define, justify, and apply the choice of method and tool in the context of business applications.
Lecture Agenda | Key Ideas | Case Studies / Activities | |
---|---|---|---|
1* | Combinatorics. Sampling & Urn Models | Sampling with and without replacement, multiplication rule, naive \(\mathbb{P}\), probability trees | Birthday Paradox simulation, Why bootstrap works |
2 | Probability Triple. Random Variables. What is a model? | collectivity, population, sample; estimand, estimator, estimation; \((\Omega, \mathcal{F}, \mathbb{P})\), \(X\) r.v., \(H(y | F(u), \phi(y, u))\) | Probability vs Statistics. \(\mathcal{H}\), Process models, Statistical models, Inference |
3 | Stories behind distributions | Binomial, Poisson, Mixtures, \(N(\mu, \sigma)\), \(\chi^2_k\), \(t_k\), F, Beta, Gamma, Exponential, NBin, HGeom | couples showing up to safari, arrival times, basketball shots, hot hand |
4* | Conditioning and Bayes Rule. Likelihood Ratios | Updating prior beliefs, Bayesian vs frequentist interpretation of probability. Graphical models | Monty Hall, Simpsons’ Paradox, Football Spreads, Medical testing |
5* | CLT and LLNs. Convergence types. Estimator properties | Never forget about sample size. Why simulation works? Limitation: rare events, fat tails, mixtures, non-iid | The most dangerous equation: US schools, gold coins. Simulation |
6* | Fisher Information. Bias-Variance tradeoff | Efficiency, bias, James-Stein paradox and the curse of dimensionality. Rao-Cramer. Likelihood approach | Overfitting, underfitting, and mis-specification. Bootstrap, simulations |
7* | Neyman-Pearson frequentism. Long-run action | Type I, II errors, confint, p-value, PPV, effect sizes, power analysis, range predictions, \(H_0\), \(H_A\) | How to ask better questions. Stahel’s relevance, Interval \(\mathcal{H}\) testing. Default action |
8 | Frequentist vs Likelihood vs Bayesian inference | Popperian philosophical roots, practical agreements, strengths and weaknesses | Discussion on practical interpretation: path of action vs devotion vs belief |
9 | Dead Salmon Experiment. Replication Crisis | multiple testing, harking, snooping, publication bias, open science, underpowered studies | Examination of controversies in medicine, psychology, and social science |
Module III: Applied Bayesian Statistics
After the first two modules, we should have a solid foundation for building more realistic models. As discussed in the introduction, randomized experiments and A/B tests have obvious limitations: might be unfeasible, unethical, or just not the right tool. Even when we design experiments or surveys, we still need to model non-response, correct for biases, and account for confounds – meaning, engage in modeling.
In my opinion, among many excellent resources, Richard McElreath’s 2023 lectures on statistical rethinking is the best way to learn the kind of modeling which fits our course and problem space.
I think that the best way to study regression and generalized linear models is to think very carefully about the causal processes underlying the phenomenon we’re studying. The big advantage of switching to a Bayesian perspective and GLMs, is that we treat all models under an unified framework. This means, we can focus on science and much less on the mathematical intricacies of advanced econometrics.
Lecture Agenda | Keywords | Case Studies / Activities | |
---|---|---|---|
1 | Introduction to Bayesian Statistics | Estimating proportions with Beta-Binomial, counts with Gamma-Poisson, defining priors | Elections, Football Goals, Speed of Light |
2 | Regression and the Bayesian Workflow | DAGs of influence, categorical variables, assumptions, nonlinearities, splines, simulation | E2E example on Howell’s study on height and weight. Summarizing estimands |
3 | Confounds. Good and Bad controls | Deep-dive on translating DAGs into statistical models. Adjustment set, backdoor criterion | E2E example of Waffle houses and divorce rates. Spurious correlations |
~ | Overfitting and Validation | Prior and posterior checks, LOO-CV, LOO-PIT, regularization, robust regression, simulation-based calibration | TBD |
4 | Multiple Groups and Hierarchical Models | Complete pooling, No pooling, Partial pooling. Post-stratification and non-representative samples | Radon concentration, Starbucks wait times, Customer LTV, Elections polls |
5 | Binomial GLMs | Link functions, Simpson’s paradox | 1973 Berkeley admissions |
6 | Poisson GLMs | Overdispersion, zero-inflated, Negative Binomial | Campus Crime, Student absence |
7 | Hierarchical GLMs | Deep-dive into model formulation. Studies on stage anxiety and plumonary fibrosis. | Basketball Fouls, AirBnb reviews, Seat Belts |
8 | Longitudinal Data | Challenges of multilevel time series | Teenage drinking, sleep study |
~ | Special topics | BART and Gaussian Processes | TBD |
~ | Special topics | Missing data and measurement error | TBD |
Python/R Computational Toolbox
In order to make the most out of the three modules, we need to be skilled (or at least competent) programmers, from the perspective of data science. It is a skill which can be developed independently – I give a few recommendations in the introduction (prerequisites section).
In this course I heavily rely on the principles of reproducible research, which have 3 key aspects: data, computational environment, and code.
A second aspect that we need for the programming part not to slow us down and stand in the way, is to have a good, pleasant, and reliable development environment setup. Here is a starting, non-exhaustive list of tools you will need.
For Windows 11+ users, I strongly recommend you use WSLII – and get familiar with the command line & linux. Between R and Python, I choose technologies which make it easy to draw parallels and transition from one to another.
- RStudio as our main IDE
- Quarto for literate programming
- git & github
- renv for managing environments
- database: duckdb (analytical, in-process)
- data wrangling: tidyverse, data.table (bigger data), arrow
- data visualization: ggplot, plotly
- interactive applications: shiny
- APIs: plumbr
R has many gotchas, which is the main reason that makes the language hard. Therefore, we need to some additional concepts in R, especially functional programming and understanding the S3 object system. Tidyverse is an important ecosystem which tries to solve a lot of the issues in base R.
- VScode as our main IDE
- Jupyter Notebooks or Quarto for literate programming
- git & github
- uv (recommended) or conda for managing environments
- database: duckdb (analytical, in-process), sqlite (transactional, in-process)
- data wrangling: pandas 2.0, pypolars (bigger data), arrow
- interactive applications: streamlit, shiny, or dash
- APIs: fastapi
A selection of realistic datasets
Deciding what topic, domain, and problem to choose for your project is a daunting task in itself. During the course, I present new tools to formulate good questions and how to design your research study. Unfortunately, in most cases, we will be limited by what open data is available. We encounter the same problem of “too much content”: there is so much data to work with, but most of it isn’t any good for our purposes.
You can rely entirely on simulation, but the way you desing the underlying data-generating process has to be very well thought out, informed by the specificities of the problem and an expectation of what data we’ll encounter in practice.
Finding good data about business challenges is incredibly hard, since few firms will be open to sharing it. Even when we get our hands on a good kaggle competition, the problem is that the data has been already framed, curated, and put together for us. This means that we don’t get the real experience of end-to-end problem solving that we would encounter in practice. Therefore, I collected a list of datasets which are realistic and diverse enough. You can consider them as an inspiration and starting point.
Dataset name | Domain | Problem / Area | Comments | |
---|---|---|---|---|
1 | Yaz restaurant demand | business | Newsvendor problem | A real dataset used for benchmarking inventory optimization algorithms from a restaurant in Stuttgart. Source: ddop |
2 | Mercari vintage clothes marketplace | business | Price suggestion | An excellent dataset for GLMs and machine learning, where the text data is important. Source: kaggle / Mercari |
3 | Avocado prices and volumes from Hass board | agriculture | Market research | This is a good opportunity to understand the dynamics of a whole industry. Source: kaggle / hass |
4 | Olist e-commerce | business | EDA, databases | A very rare example of freely available data published as a relational database. Good for open-ended projects. Source: kaggle |
5 | Corporacion favorita | business | Demand planning | One of the best datasets to practice demand forecasting for groceries. Source: kaggle |
6 | Tevec retail sales | business | Inventory management | Good for short-term demand forecasting for the purposes of inventory optimization. Source: kaggle |
7 | Lending club loans | finance | Credit risk | A large dataset of peer-to peer, graded loans, with data about clients, loan, and interest. Source: kaggle |
8 | DataCo orders | business | Logistics and fulfillment | One of very few datasets for you to get a better grasp over outbound and inbound logistics. Source: kaggle |
9 | Criteo Campaign | business | Marketing | The biggest dataset of randomized control trial marketing campaign for Uplift modeling. Source: kaggle |
10 | Ease my trip | airlines | Price prediction | This dataset has a few gotchas related to how the flights were selected and unavailability of seats remaining. Source: kaggle |
11 | Amazon reviews for beauty products | business | NLP, Customer analytics | Analyzing customer reviews and feedback is a widespread use-case in businesses. Lots of data is available. Source: nijianmo |
12 | Expresso churn prediction | telekom | Churn | We will discuss the challenges around modeling customer churn, especially survival models and interventions. Source: kaggle |
13 | Santader customer satisfaction | banking | Churn | Since the data is anonymized and mostly numeric, we would have to take a ML approach. Source: kaggle |
14 | Supply chain allocation | business | Logistics and shipments | Another rare dataset on logistics, in which you assign routes to purchase orders from manufacturers. Source: kaggle |
15 | Hospital customer satisfaction | healthcare | Customer analytics | A pretty large-scale, general-purpose survey on client satisfaction. A lot of EDA and data cleaning is needed. Source: kaggle |
16 | Bike sharing demand | transportation | Demand planning | Several datasets from different cities and ride sharing firms: Washington, Boston, London |
17 | NYC subway rides | transportation | Demand planning | Another aspect of demand planning is load prediction and minimizing delays: NYC traffic, Toronto subway delays, NYC entry and exits |
18 | Taxi trips | transportation | Demand planning, pricing | Multiple datasets from different firms and cities: Chicago, NYC taxis, NYC uber |
The curated list of datasets doesn’t appear in most books and resources that I recommend. The challenges they present are not easy and require quite a lot of work. Some of them will also teach you how to work with larger amounts of data. That said, there is still a lot of value in the didactic examples, so here are a few more directions I recommend to look into:
- P. Fader, B. Hardie, and E. Ascarza research on BTYD (buy till you die) models of customer repurchase behavior, churn, and LTV
- This website on Bayesian networks has a few amazing case-studies on self-worth and depression, which are perfect for practicing causal thinking
- Facebook has synthesised the recent research in Marketing Mix Modeling into their open-source project called Robyn.
- Rohan Alexander has a good example of multilevel modeling with post-stratification on US 2020 elections. Andrew Gelman in “Regression and other stories” has lots of great examples from political and social science.