Module I: Business School for Data Scientists

Author

Mihai Bizovi, VP of Decision Science

The “business school” emphasizes again and again the idea of Decision-Making Under Uncertainty at Scale, across many industries and use-cases. We walk through three perspectives: Analytics, Machine Learning, and Statistics – then develop processes for each one. This module is heavy on methodology, since I want to bring back science into “data science”, but don’t worry, there will be enough case-studies.

Lectures with more maths are marked by (*), programming in bold, and methodology with (~).

	Lecture	Key ideas	Case Studies / Activities
1	Introduction & Course Overview	Question \(\rightarrow\) Outcomes, Modeling, 3 challenges of statistics, Prerequisites, Project requirements. Inferential vs Causal	purpose of a business, survey of applications and domains
2	Decision-Making Under Uncertainty at Scale	Weak AI and Cybernetics. Analytics vs Stats vs ML. Scientific process is all you need – why methodology is important	Fermi estimation, demystify buzzwords, Wikipedia A/B test
3	Data Science in Business Context	Performance trajectories, Strategy, VUCA, Business Analyst’s Workflow, Value Chain meets DS use-cases	Fermi estimation process, overview of realistic datasets
4*	Newsvendor Problem and Demand Planning	Objectives, decisions, uncertainty. Tradeoffs. Stories behind distributions: traffic, slow-moving, intermittent, luxury	TIME INC. printing, simulations, Yaz restaurant data
~	Study Design	Question-first vs data-first. Estimands. Sampling, measurement, validity and reliability. Biases, artifacts, and confounds	FINER, PICOT. Confounds, mediators, colliders, and chains
~	Scientific writing	Writing as thinking and communication. Structure of a paper	Good titles, abstracts
5*	What is ML? Learning, Intuition, and Bias.	Patterns, learning problem, bias-variance tradeoff. Experience and expertise. Model, algorithm, data. Generalization	overview of learning paradigms, models, and use-cases
6*	12 Steps of ML	Feasibility, objectives, splitting data, training/tuning, validation, testing, deployment and launch	E2E example of the Mercari pricing challenge
7*	12 Steps of Statistics. A/B Testing Scheme	default action, hypotheses, data collection strategies, simulation, confidence intervals, relevance, effect sizes, power	conversion rate optimization and marketing
8*	Churn and Lifetime Value	Churn, repurchase, contractual and non-contractual settings	BTYD and survival models
9*	Price optimization and revenue management	positioning, limited supply, demand elasticity, dynamic pricing, promotion, willingness to pay	stories of concerts, airlines, taxis/rides
~	Stories from the trenches	Personal exposition of challenges in marketing, demand planning, CRM, and recommender systems	building a thriving data ecosystem and growing competence

A selection of realistic datasets

Deciding what topic, domain, and problem to choose for your project is a daunting task in itself. During the course, I present new tools to formulate good questions and how to design your research study. Unfortunately, in most cases, we will be limited by what open data is available. We encounter the same problem of “too much content”: there is so much data to work with, but most of it isn’t any good for our purposes.

You can rely entirely on simulation, but the way you desing the underlying data-generating process has to be very well thought out, informed by the specificities of the problem and an expectation of what data we’ll encounter in practice.

Finding good data about business challenges is incredibly hard, since few firms will be open to sharing it. Even when we get our hands on a good kaggle competition, the problem is that the data has been already framed, curated, and put together for us. This means that we don’t get the real experience of end-to-end problem solving that we would encounter in practice. Therefore, I collected a list of datasets which are realistic and diverse enough. You can consider them as an inspiration and starting point.

	Dataset name	Domain	Problem / Area	Comments
1	Yaz restaurant demand	business	Newsvendor problem	A real dataset used for benchmarking inventory optimization algorithms from a restaurant in Stuttgart. Source: ddop
2	Mercari vintage clothes marketplace	business	Price suggestion	An excellent dataset for GLMs and machine learning, where the text data is important. Source: kaggle / Mercari
3	Avocado prices and volumes from Hass board	agriculture	Market research	This is a good opportunity to understand the dynamics of a whole industry. Source: kaggle / hass
4	Olist e-commerce	business	EDA, databases	A very rare example of freely available data published as a relational database. Good for open-ended projects. Source: kaggle
5	Corporacion favorita	business	Demand planning	One of the best datasets to practice demand forecasting for groceries. Source: kaggle
6	Tevec retail sales	business	Inventory management	Good for short-term demand forecasting for the purposes of inventory optimization. Source: kaggle
7	Lending club loans	finance	Credit risk	A large dataset of peer-to peer, graded loans, with data about clients, loan, and interest. Source: kaggle
8	DataCo orders	business	Logistics and fulfillment	One of very few datasets for you to get a better grasp over outbound and inbound logistics. Source: kaggle
9	Criteo Campaign	business	Marketing	The biggest dataset of randomized control trial marketing campaign for Uplift modeling. Source: kaggle
10	Ease my trip	airlines	Price prediction	This dataset has a few gotchas related to how the flights were selected and unavailability of seats remaining. Source: kaggle
11	Amazon reviews for beauty products	business	NLP, Customer analytics	Analyzing customer reviews and feedback is a widespread use-case in businesses. Lots of data is available. Source: nijianmo
12	Expresso churn prediction	telekom	Churn	We will discuss the challenges around modeling customer churn, especially survival models and interventions. Source: kaggle
13	Santader customer satisfaction	banking	Churn	Since the data is anonymized and mostly numeric, we would have to take a ML approach. Source: kaggle
14	Supply chain allocation	business	Logistics and shipments	Another rare dataset on logistics, in which you assign routes to purchase orders from manufacturers. Source: kaggle
15	Hospital customer satisfaction	healthcare	Customer analytics	A pretty large-scale, general-purpose survey on client satisfaction. A lot of EDA and data cleaning is needed. Source: kaggle
16	Bike sharing demand	transportation	Demand planning	Several datasets from different cities and ride sharing firms: Washington, Boston, London
17	NYC subway rides	transportation	Demand planning	Another aspect of demand planning is load prediction and minimizing delays: NYC traffic, Toronto subway delays, NYC entry and exits
18	Taxi trips	transportation	Demand planning, pricing	Multiple datasets from different firms and cities: Chicago, NYC taxis, NYC uber

The curated list of datasets doesn’t appear in most books and resources that I recommend. The challenges they present are not easy and require quite a lot of work. Some of them will also teach you how to work with larger amounts of data. That said, there is still a lot of value in the didactic examples, so here are a few more directions I recommend to look into:

P. Fader, B. Hardie, and E. Ascarza research on BTYD (buy till you die) models of customer repurchase behavior, churn, and LTV
This website on Bayesian networks has a few amazing case-studies on self-worth and depression, which are perfect for practicing causal thinking
Facebook has synthesised the recent research in Marketing Mix Modeling into their open-source project called Robyn.
Rohan Alexander has a good example of multilevel modeling with post-stratification on US 2020 elections. Andrew Gelman in “Regression and other stories” has lots of great examples from political and social science.