Are you ready for an adventure?
There is one question I lose sleep over: “How should data science be taught?” The vision and roadmap outlined on this website emerges out of:
As the author of the course (book) – all views and mistakes are entirely my own and do not represent any organization.
- my industry experience as a data scientist, ML engineer, then leader
- teaching to creative, motivated, but confused students
- my frustrations about the shortcomings of statistical education
I aspire to contribute to the understanding of this complex landscape and teach people how to navigate it, how to develop valuable skills, and become more effective at problem-solving. To achieve this, we need to master “the fundamentals.”
I believe that improving decision-making is the reason why we practice “AI”.1 The danger when a program lacks unified vision and a coordinated roadmap, is that we miss the interdisciplinary, big picture. Local optimization, global valley. We forget the why.
1 We will have to clarify all the loaded terms like AI, Cybernetics, Data Science, Analytics, ML, and Data Mining
The commitment needed to get started in the field is so high, that a student is guaranteed to get lost in details, be overwhelmed by all those 800 page hardcore textbooks. On the other hand, seemingly pragmatic, cookbook and bootcamp-style approaches miss depth, nuance, key theoretical ideas, and methodological aspects.
Decision-Making for the Brave and True
This is an introductory, free, and open-source course which bridges the gap between theory and practice,2 cultivating the skills and understanding necessary to bring value to organisations by improving decision-making. It is an attempt to find a golden mean of both worlds – so that you choose your own path intentionally. Here’s the metaphor:
2 There is a similar gap between the “math world” of elegant models / abstractions and messy “real world” of decisions in businesses
- Illuminating theoretical ideas (contemplating in the library)
- Practicing battle-tested technologies (engineering in the trenches).
This is the best advice I ever got as a novice data scientist. It is easy to lose track of this idea, especially if inamored with the technicalities of statistics and machine learning.
This is the course I wish I had when starting my journey in data science, which would prepare me for the realities of industry, often very different from the academic world. In my teaching, I achieve this by putting business decisions, understanding the domain, and working software at the forefront of each (statistical) tool we learn.
Anyone lost, confused, stuck or overwhelmed by Data Science and Machine Learning complexities, who wants to see the big picture, a path forward, and the possibilities.
If you stumbled upon this website, you’re probably a student in Business Analytics, or know me personally – well, because I shamelessly promoted it.
Maybe, you’re an engineer getting curious about ML or an analyst with a knack for the business, looking to improve your workflow and expand the quantitative toolbox. Maybe you’re a product manager or an entrepreneur who wants to infuse AI into your startup.
In my opinion, junior data scientists and ML practitioners a few years into their journey will benefit the most from the re-contextualization of fundamentals that I’m doing here, which could enable them to take another leap in career.
The course is challenging due to its breadth across disciplines and depth (starting from fundamentals).3 In service of understanding, it becomes conceptual at times, but I hope you bear with me until you see the benefits of those abstractions. Keep in mind that:
3 Of course, you can just listen, read and extract valuable insights, but the course ultimately fails without deep self-study.
- It is NOT a theoretical course, nor a replacement for probability, mathematical statistics, and other prerequisites. There are two possibilities:
- You already had those courses and will look at them with a fresh perspective
- Otherwise, treat it as learning roadmap of the most relevant ideas
- At the same time, this is NOT a bootcamp, nor a replacement for learning programming in Python/R/SQL for data science
- Therefore, it is not enough by itself to land you a job
- For that, there are many wonderful resources for practice and study
Think of it as a skeleton, a conceptual frame which ties together everything you have learned so far and can be built upon as you progress in your career and studies. You will probably go back to the same idea years later, with greater wisdom and skill – to unlock its real power. I think that we should embrace the fact that learning is not linear.
This course stands on the shoulders of giants, and I can only aspire to get to the level of clarity and rigor provided by Hastie/Tibshirani or Andrew Ng when it comes to ML, and Richard McElreath or Andrew Gelman on Bayesian Statistics / Causal Inference.
Given the over-abundance of learning resources, it is too much to sift through a hundred pages of Coursera in the data science section, books, and tutorials to find optimal ones. There is a lot of great content, but it can be hard to decide where to start and our time is very limited. Even worse, poor quality resources can do more damage than good.
The roadmap provided here should help you navigate the landscape of data science and find the shortest path towards developing your skills and better decisions in your firm. Pick a topic you’re passionate about, for example, generalized linear models and look in the study guide for tips on where to read and how to practice.
I have to clarify what I mean by contemplating in the library and engineering in the trenches. In contemplation, we try to see how fundamental theoretical ideas translate into real-world stories and solutions to practical problems. We also focus on understanding the problem space in various domains, our clients’ needs and challenges – which enables us to ask the right questions, formulate the problem correctly, then pick the right model and methodology for the task. Last, but not least, we’ll gain insights into pitfalls and common mistakes to avoid.
During the course, I present stories and toy problems which motivate probability distributions, LLN/CLT, Bayes’ rule, sampling, measurement, etc. This is a way to highlight the practical relevance of otherwise dry theoretical ideas
On the other hand, engineering in the trenches is a skill. The only way to become better at modeling and data analysis is hands-on practice. It is not sufficient to know about the modeling workflow and methodology. In practice, we have to develop pragmatic solutions and test, implement our ideas in code. If we are to increase our chances of success, we’ll also have to follow best practices of reproducible research, automation, and model operationalization.
Why should you care?
You might’ve heard that data scientist is the sexiest job of 21st century, that AI is going to take over repetitive jobs, deep reinforcement learning models are beating people at go, Dota, chess, solving almost-impossible protein-folding problems. The world is awash in the newest ChatGPT and Midjourney frenzy, with new developments every month and week. There are so many cool applications of AI and modeling.
You don’t, at least if you want to have a balanced life. That’s why I choose to focus on fundamentals which stood the test of time, which are anyways a prerequisite before you dive into understanding the technicalities of those cutting-edge models and systems.
You will be surprised how far can you go with simple, even linear models. In the end, it is not a competition, nor am I against deep learning: we learn to solve a very different class of problems that businesses encounter.
What does it actually mean, if we step outside the hype and buzzwords, use a plain language, and apply these ideas in down-to-earth, day-to-day problems and challenges in businesses? Data science and AI can have a function of decision-making support or be an essential part of the system/product itself, like in the case of Uber, Amazon, Netflix, Spotify, Google and many others.
To get a better sense of what I mean by decision-making support, here is a list of challenges many data scientists are working on. Notice the common thread of optimization at a large scale, that many of these applications are related to important decisions in the value chain of a firm. These problems also can’t be solved in an optimal way by simple rule-based business logic or exploratory data analysis.
- Demand forecasting, inventory optimization, production planning
- Revenue management, pricing, and promotion optimization
- Advertisement, marketing mix, and conversion optimization
- Customer behavior and preferences:
- Churn, repurchase, engagement, LTV, willingness to pay
- Fraud, credit default, insurance risk
- Choice modeling, recommender systems, targeting and uplift models
- Improving products, assortment, merchandising
Speaking from experience, every problem listed here is challenging. The good news is that by having a good grasp of fundamentals, we can tackle any one of them. What I can’t give you though, are universal recipes – they do not exist!
I want you to take away ONE thing, that is “AI” and Data Science in Businesses boils down to: Decision-Making under Uncertainty at Scale
Even if you are not a data scientist, you will work with them in one form or another (Quant, Data Analyst, Business Analyst, ML Engineer, Data/BI Engineer, Decision-Maker, Domain Expert). Therefore, you have to understand their language, what are they doing, how to ask and make sure they solve the right problem for you.
In university years, you’ve probably been tortured by (or enjoyed) linear algebra, mathematical analysis, probability and statistics, operations research, differential equations, mathematical economics and cybernetics, algorithms and data structures, databases, object-oriented programming, econometrics and so on.
We live in a volatile, uncertain, complex and ambiguous world,4 but we still have to make decisions. Those decisions will bring better outcomes if they are informed by understanding the causal processes, driven by evidence and robust predictions.
We studied all of those prerequisites precisely in order to be well equipped for such challenges. For a more in-depth explanation of the essence of each subject and its relevance for decision science, read section iv.
When somebody asks what have you learned in this book and course, I suggest two metaphors: one of simplification and another of seeing relations
What will you learn?
I am tempted to say “act rationally” to improve business outcomes like top line, bottom line, customer satisfaction, and efficiency. However, what you will learn is much more nuanced than that – hence, the short answer below which will take a while to unpack.
\[Question \longrightarrow Modeling \longrightarrow Insight \longrightarrow Action \longrightarrow Outcome \]
Necessary, but insufficient 5 fundamentals of business economics and statistical modeling:
- in order to ask the right questions
- so that we can build custom statistical models
- which bring insight into consequences of actions and interventions
- so that we’re informed which actions that should be taken
- such that a firm can achieve its tactical and strategic objectives
5 The insufficiency comes from the fact that this course involves a lot of self-study and I can’t ever claim an exhaustive coverage of such a complex field
6 Imagine how many possible paths are there if for every minute we have a choice between 20 actions: read, eat, move, watch, etc
At this point you’ve heard the word “model” a lot of times. In the most general sense, a mental, mathematical, or statistical model is a simplified representation of reality. Reality is overwhelming and combinatorially explosive in its possibilities. 6 We build models, because we want to understand a system and capture the essential aspects of the phenomena of interest, in order to make good decisions. This notion of understanding hints to the idea of causality: if we intervene, this is what is likely going to happen.
We collect data, perform experiments, and build models in order to minimize the effect of biases and our foolishness. It’s also important to remember that all models have assumptions, pressupositions, and limitations. They are little machines, golemns of prague which follow instructions precisely, and can backfire if used outside their range of validity. Remember that all models are wrong, but some are useful.
A. Gelman highlights three different aspects of statistical inference. Always remember this when designing your study or learning about a new statistical tool! We want to generalize from sample to the population of interest, from treatment to control group, and from measurement to the underlying theoretical construct.
\[Sample \longrightarrow Population\]
\[Treatment \longrightarrow Control\]
\[Measurement \longrightarrow Construct\]
The holy grail is to build statistical models based on the causal processes informed by theories and hypotheses. If we take into account how we measured,7 and collected data, we’ll increase our chances to generalize our conclusions and will have stronger evidence.
7 We’ll dedicate one lecture on measurement, but you might have whole specialized classes on this topic
My attempt in organizing the teaching is to split the course into self-contained modules, which can be mixed and matched for diverse audiences and needs. The book will lag years behind my teaching, but the good news is that you can safely use the carefully curated lists of resources and suggested practices.
Conceptual understanding by itself is not enough. So, I curated a list of resources to practice on interesting case-studies, datasets, which directly apply the models, tools, and methodologies presented. These are written by experts in the field, are usually well thought, easy to follow, reproducible, and highlight important aspects of a problem and model.
Also, keep an eye on the course github repo, in which we’ll do a lot of simulations, some exciting projects (full stack data apps) and investigate common challenges with a fresh perspective.
Prerequisites and background
I think this course will bring the most value to masters’ students, professionals, and students in their last year of BSc, because of presumed exposure and competence in the following subjects. For more resources and explanations, see section iv.
- For Linear Algebra and Calculus – only exposure is needed, but competence and mathematical maturity will help a lot. If you need a refresher, a crash course will do – for example, Grant Sanderson’s “Essence of Linear Algebra”
- For Probability Theory – competence is needed, even though we start from the very beginning with a review of combinatorics. I suggest you read along and practice with a more comprehensive resource, like Joe Blitzstein’s “Probability 110”
- For Mathematical Statistics the story is the same as for Probability. You will need at least to be familiar with estimators, sampling distributions, LLN/CLT, hypothesis testing and regression. This course can catalyze the process.
- Python or R programming for Data Science is mandatory, including data wrangling, visualization, scientific computing, databases. I recommend two free books:
- “R for Data Science” by Hadley Wickham
- “Python for Data Analysis” by Wes McKinney.
- Of course, the more experience you have in one or both, the better
- This is completely optional, but if you have to deal with financial statements at your job or have an interest in finance, I strongly suggest you check out A. Damodaran’s crash course on accounting and corporate finance.
Besides the technical prerequisites, there is a list of concepts and ideas from research design, which can be helpful in practice, but are often not part of statistical curriculum. It is worth to invest a bit of time to get familiar with the following:
Taken in isolation, topics like measurement and sampling are very abstract. Therefore, they have to be integrated into the case-studies whenever they bring added value. This is not easy, so .. work in progress
- Measurement is especially relevant and tricky in social science. Sometimes, we do not measure what we think we do (underlying construct). For measuring abstract things like happiness or personality, we need to dive into scales, instrumentation, operationalization, validity, and reliability.
- Sampling isn’t limited to simple random samples, as these rarely occur in practice. We need to be aware what tools exist so that we can generalize to the population of interest. For example, weighting and post-stratification.
- Threats to validity, including confounds, artifacts, and biases. On the one hand we use statistics as a guardrail against foolishness, on the other hand there a lot of the same problems that can compromise our study.
- Reproducible research and literate programming has become a de-facto standard in the data science world. We have amazing tools right now, but it’s still not easy to achieve end-to-end reproducibility and automation.
- Academic writing is an important skill, even in businesses. Writing is both a way of thinking, problem-solving, and communication. In my opinion, the structure of a scientific paper is very helpful to crystalize your ideas.
Last, but not least, there is value in understanding the scientific process, in its broad sense. In section v., I adapt it to the context of data science in businesses and show how most processes and workflows like CRISP-DM, Tuckey’s EDA are a part of it.
Module 1: Business School for Data Scientists
Remember the advice “think of yourself as a business person with superpowers”? It means that as a data scientist, it’s helpful to think question, action, and outcome first – and only then about models. In other words, we have to deeply understand our client’s business domain and problem space. The best way to achieve that, is experience and collaboration on the job. Therefore, my role is to tell as many stories from the trenches, with deep-dives in our case-studies, so that you know what to expect in practice.
Often, there is a disconnect between business economics, analytics classes and ML / stats / data science. We can’t expect them to be coherent by defaut, but we can carefully design those synergies
A good starting place is thinking about the purpose of a business from different perspectives. Then, identify key decisions that firms have to make across many industries and domains. This exercise will show us how widespread are data science applications and use-cases, even in places we don’t suspect.
At this point it is important to clarify what we mean by AI, ML, analytics, data mining, statistics and when is it appropriate to choose one approach over another. You’ll be surprised, but statistics is the hardest to define and often misunderstood, because of its inferential, experimental, and causal facets.
Next, we’ll learn a few frameworks for understanding a firm’s performance evolution, value chain, strategy and tie them all together in an analyst’s workflow. These tools will help us ask good (statistical) questions and increase the chances that our modeling efforts result in actionable insights.
In time, I’ll add many more case-studies from advertisement, promotion, merchandising, engagement, CRM, conversion rate optimization, demand planning, supply chain, etc.
An essential aid in this pursuit of improving decision-making at scale, are different processes and methodologies for machine learning, experiment design, causal inference, and exploratory data analysis. It is important to mention that methodology is not a recipe, but way of structuring our work, a set of guiding principles and constraints over the space of all the things that we could do in analysis. Don’t think of these constraints as limiting your freedom, but as helpful allies in effective problem problem solving.
“What is methodology good for? Sounds like I could use that time to learn ML”
These methodolgical fundamentals are not “just theory”, it is what will make or break projects in practice. There are so many pitfalls in ML and statistics that we cannot afford to do it ad-hoc.
If you worry that the business school is too conceptual – don’t despair, as we’ll have a few hands-on, technical lectures – involving math, modeling, and programming.
- I use the newsvendor problem as a starting point in modeling for demand planning
- The pricing optimization as a starting point towards revenue management
Module 2: Probability and Statistics
I think that most students will benefit from re-framing the fundamentals of probability and statistics. A lot of mistakes in practice come from a fundamental misunderstanding of the nature of statistical inference. If we view statistics as changing our minds and actions in the face of evidence – the good ol’ t-test will shine in a new light. It will become clear why those models and procedures were invented in the first place.
Another benefit of revising these fundamentals is that we can develop an appreciation for the stories and ideas behind these methods. Also, it can be fun.
We start with combinatorics and urn models, probability trees, distributions, conditioning, laws of large numbers, central limit theorem, Bayes’ rule. For each one of these topics, I present the story, use-case, and hands-on simulations to show that even very simple tools can be useful.
Perhaps the most important lecture is also very abstract, where I formally define what is probability, a random variable, what is the key difference between probability and statistics. Understanding what is collectivity, statistical population, sample; estimands, estimators, estimates; and formally, what is a statistical model – will give you the right terminology and set you on a good path towards more advanced models.
The module culminates in practical aspects of experiment design and A/B testing, sample size justification, power analysis, and a conversation about the causes and remedies to the replication crisis. After all this effort put into the study of fundamentals, you can go with ease and confidence into Bayesian Statistics, Machine Learning, and Causal Inference.
When discussing hypothesis testing, it is very important that you recognize that most statistical tests can be framed as linear models.
In statistics’ classes, the problem is usually completely framed and students focus on computation / interpretation. Experiment design in practice is tricky and more complicated than it looks.
My preferred sequence is to go in parallel with the “business school” and statistics, since there is a great synergy. The prior focuses on the problem space and processes (over content) of analytics, ML, statistics. I also recommend in case of a very short and intensive course to study bayesian generalized models before the experiment design topics.
Module 3: Applied Bayesian Statistics
Once we have a confident grasp of the fundamentals, we continue on the path of applied Bayesian statistics. It is an extremely flexible and composable approach to building explainable models of increasing complexity and realism.
Instead of applying an out-of-the-box model, we will build custom ones for each individual application in an iterative way. Choosing appropriate probability distributions and modeling assumptions will be critical in this process, along with model critique and validation. There are many ways to teach bayesian statistics and generalized linear models, but I think R. McElreath has the best approach, by giving a lot of importance to DAGs of influence, simulation, and ideas from causal inference.
Many experts in the field argue that this should be the default way we do regression analysis and that we need a very strong justification to show that a linear regression is the appropriate model. To put it bluntly, this modeling approach will enable us to improve on most challenges in decision-making and inference that businesses face.
One notable difference between the Bayesian approach and the traditional way advanced econometrics is taught, is that we will focus on computation instead of proofs and heavy-duty mathematical statistics. We declare the model and the sampler does the job for us! If it explodes, we probably modeled it wrong.
Notes on Causal Inference and ML
Did you get comfortable with building custom statistical models for inference and prediction? What about decisions with high stakes, where we sometimes want to do randomized experiments?
Often, A/B tests and randomized controlled trials are unfeasible or unethical. Also realize that we cannot reach a causal conclusion from observational data alone – we can talk just about associations. We need a theory, which is our understanding of how the “world” works – translated into a statistical model, plus data, which will give us new insight into the causal processes (the evidence).
This motivates a deep-dive into the field of causal inference. Think of it as a link between the theoretical models and observational data, where we sometimes can take advantage of certain “natural experiments”. Causal inference requires deep thinking and understanding, which is truly challenging – an art and science, in contrast with the “auto-magic”, unreasonably effective pattern recognition of ML.
Even if you’re interested only in machine learning, most practitioners will emphasize the importance of mastering regression (often starting with generalized linear models) and doing A/B tests resulting in evidence that our new model brings an improvement (or not).
This is how we jump through various buckets, highlighting the golden thread linking them all: decisions and uncertainty. Moreover, the tools we learned in Bayesian Statistics are directly applicable in ML – the lines between these two fields are indeed a bit blurry. This is especially the case when considering flexible models like Gaussian Processes and Bayesian Additive Regression Trees.
Other times, we care not just about a single decision or developing better policies, but we have to make tons of little decisions at scale. This is when we switch to a predictive, Machine Learning perspective and walk through our workhorse models, which should serve us decades ahead in a wide range of applications: both predictive and exploratory.
The icing on the cake is miscellaneous topics dear to me and usually not covered in such a course: Demand Forecasting, Recommender Systems, and Natural Language Processing. All extremely useful in business contexts, but significant tweaks are needed to the models discussed before.
Full-Stack Data Apps
I mentioned before that a key prerequisite is competence in Python/R, SQL, data wrangling, EDA, visualization, simulation, literate programming. There are many wonderful resources to learn and practice these skills and topics. During the labs we’ll build from the ground up a tech stack for reproducible data analysis, model and data pipelines.
After finishing the first three modules, a good chance to put the knowledge to practice, is to write a 5-page research paper on a question / problem you’re passionate about.8 This will test your study design, modeling, programming, and writing skills.
We will see how easy is it to publish your article / paper / report with quarto and github pages.
8 In the labs, I give detailed instructions on how a good project should look like, starting from structure and ending in presentation
If you got to this point and want to continue, you will probably move onto Machine Learning and Causal Inference. There is, however, one huge problem when it comes to bringing added value in businesses – operationalizing models.
I have been speaking at conferences about this topic for 6 years: from data, MLOps, product management, and organizational perspectives – and it’s still very relevant. Of course, the way we operationalize a model and build a system depends on the use-case and many factors – however, the most general case to me is via a “Full Stack Data-Driven App” (with user interface, backend, database; data, training, and serving pipelines).
A full-stack data-driven app is your final project (for year two course) and something you can brag about in your portfolio and github profile. It sounds complicated, but we have the tools and frameworks to make it easy for us. This will force you to think about the user need and how your software product solves their problem.
Don’t worry about getting everything perfect, but focus on a problem and single area from the course you’re passionate about: it could be statistical modeling, ML pipelines, or even frontend development and data visualization.