Open Datasets
Business applications in diverse industries
It is very hard to find good, realistic datasets which map well on representative use-cases from this course. This is why I curated a list of public datasets in a variety of domains. These should help if you don’t know where to start your project, that is don’t have particular problems, hypotheses, or research questions in mind.
A selection of realistic datasets
Deciding what topic, domain, and problem to choose for your project is a daunting task in itself. During the course, I present new tools to formulate good questions and how to design your research study. Unfortunately, in most cases, we will be limited by what open data is available. We encounter the same problem of “too much content”: there is so much data to work with, but most of it isn’t any good for our purposes.
You can rely entirely on simulation, but the way you desing the underlying data-generating process has to be very well thought out, informed by the specificities of the problem and an expectation of what data we’ll encounter in practice.
Finding good data about business challenges is incredibly hard, since few firms will be open to sharing it. Even when we get our hands on a good kaggle competition, the problem is that the data has been already framed, curated, and put together for us. This means that we don’t get the real experience of end-to-end problem solving that we would encounter in practice. Therefore, I collected a list of datasets which are realistic and diverse enough. You can consider them as an inspiration and starting point.
Dataset name | Domain | Problem / Area | Comments | |
---|---|---|---|---|
1 | Yaz restaurant demand | business | Newsvendor problem | A real dataset used for benchmarking inventory optimization algorithms from a restaurant in Stuttgart. Source: ddop |
2 | Mercari vintage clothes marketplace | business | Price suggestion | An excellent dataset for GLMs and machine learning, where the text data is important. Source: kaggle / Mercari |
3 | Avocado prices and volumes from Hass board | agriculture | Market research | This is a good opportunity to understand the dynamics of a whole industry. Source: kaggle / hass |
4 | Olist e-commerce | business | EDA, databases | A very rare example of freely available data published as a relational database. Good for open-ended projects. Source: kaggle |
5 | Corporacion favorita | business | Demand planning | One of the best datasets to practice demand forecasting for groceries. Source: kaggle |
6 | Tevec retail sales | business | Inventory management | Good for short-term demand forecasting for the purposes of inventory optimization. Source: kaggle |
7 | Lending club loans | finance | Credit risk | A large dataset of peer-to peer, graded loans, with data about clients, loan, and interest. Source: kaggle |
8 | DataCo orders | business | Logistics and fulfillment | One of very few datasets for you to get a better grasp over outbound and inbound logistics. Source: kaggle |
9 | Criteo Campaign | business | Marketing | The biggest dataset of randomized control trial marketing campaign for Uplift modeling. Source: kaggle |
10 | Ease my trip | airlines | Price prediction | This dataset has a few gotchas related to how the flights were selected and unavailability of seats remaining. Source: kaggle |
11 | Amazon reviews for beauty products | business | NLP, Customer analytics | Analyzing customer reviews and feedback is a widespread use-case in businesses. Lots of data is available. Source: nijianmo |
12 | Expresso churn prediction | telekom | Churn | We will discuss the challenges around modeling customer churn, especially survival models and interventions. Source: kaggle |
13 | Santader customer satisfaction | banking | Churn | Since the data is anonymized and mostly numeric, we would have to take a ML approach. Source: kaggle |
14 | Supply chain allocation | business | Logistics and shipments | Another rare dataset on logistics, in which you assign routes to purchase orders from manufacturers. Source: kaggle |
15 | Hospital customer satisfaction | healthcare | Customer analytics | A pretty large-scale, general-purpose survey on client satisfaction. A lot of EDA and data cleaning is needed. Source: kaggle |
16 | Bike sharing demand | transportation | Demand planning | Several datasets from different cities and ride sharing firms: Washington, Boston, London |
17 | NYC subway rides | transportation | Demand planning | Another aspect of demand planning is load prediction and minimizing delays: NYC traffic, Toronto subway delays, NYC entry and exits |
18 | Taxi trips | transportation | Demand planning, pricing | Multiple datasets from different firms and cities: Chicago, NYC taxis, NYC uber |
The curated list of datasets doesn’t appear in most books and resources that I recommend. The challenges they present are not easy and require quite a lot of work. Some of them will also teach you how to work with larger amounts of data. That said, there is still a lot of value in the didactic examples, so here are a few more directions I recommend to look into:
- P. Fader, B. Hardie, and E. Ascarza research on BTYD (buy till you die) models of customer repurchase behavior, churn, and LTV
- This website on Bayesian networks has a few amazing case-studies on self-worth and depression, which are perfect for practicing causal thinking
- Facebook has synthesised the recent research in Marketing Mix Modeling into their open-source project called Robyn.
- Rohan Alexander has a good example of multilevel modeling with post-stratification on US 2020 elections. Andrew Gelman in “Regression and other stories” has lots of great examples from political and social science.
I use a relational dataset from a real e-commerce which was made public in Kaggle, in order to showcase how to interact with databases and highlight the importance of knowing SQL.
We use a recent innovation in databases, duckDB, which allows us to have an analytic, in-process database, with almost zero setup or dependencies. Once the e-commerce data is loaded and modeled inside the DB, we can start cleaning it, putting it all together, and extracting insights from it. This particular dataset is very rich in information and lends itself well to open-ended investigations.