A Method to Learn Mathematics required for “Introduction to Statistical Learning”

Arnuld On Data
6 min readMay 3, 2020

--

Photo by NeONBRAND on Unsplash

Step Zero

Always start with your outcome. The outcome is to understand the Mathematics I need for my work as a data scientist. Data Science requires expertise in a few things:

  1. Pandas
  2. Scikit-learn
  3. Algorithmic Modeling
  4. Domain

I wrote about why you need to focus on algorithmic modeling for data science here:

From the above post I chose what I needed to study, Introduction to Statistical Learning (ISL):

Source: https://faculty.marshall.usc.edu/gareth-james/ISL/

Step One

ISL has regression as its prerequisite. And Regression requires knowing Statistics, which in turn requires knowledge of Probability. So what method I need to study so that I don’t over-invest my time with a lesser return, what method of studying Mathematics I can use with good RoI in context to getting employable-skills as a data scientist. I use this:

and hence I replaced my beliefs of:

“Regression requires knowing Statistics”

with

“Regression requires some basic understanding of Statistics”.

“Statistics requires knowing Probability”

with

“Statistics requires some basic understanding of Probability”

Basic Understanding of Probability

Hence I went onto learn Probability from Wikipedia and learned this in a few hours. There are two primary probability interpretations: Physical and Evidential. They are also known as Frequentist and Bayesian approaches.

Frequentist approach: associated with random physical processes such as tossing a coin. If you toss a coin thousand times then how many times you gonna get heads? Will there be a pattern if you repeat this experiment a hundred times? What’s the frequency of getting heads and tails?

An experiment is called deterministic if it has only one outcome. An experiment that has exactly two mutually exclusive outcomes is known as a Bernoulli trial.

You may think that rolling a die then is not a Bernoulli trial then. Well, that depends on how you think of it. Yes, we do have six numbers here but what if we think in terms of success and failure? What if we think a six is “success” and everything else is a “failure”. In this case, there are only two possible outcomes: a six or “not a six”. This makes it a Bernoulli trial. The difference in attitude changes the interpretation of Probability :-)

Bayesian approach: an interpretation of probability, in which, instead of frequency of some phenomenon, the probability is interpreted as a reasonable expectation, representing some knowledge or some personal belief. The Bayesian can be seen as a proposition whose truth or falsity is unknown. So we assign some prior probability to the experiment and then test/check it with the observed probability from the results of the experiment a.k.a. posterior probability (it is a little bit more complex than that but let’s not get into that).

Photo by Alessandro Bianchi on Unsplash

Axioms and Subjects in Probability

Probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms (a.k.a Kolmogorov Axioms). Typically these axioms formalize probability in terms of:

  1. A probability space
  2. A probability measure: assigning values between 0 and 1
  3. A collection of all possible outcomes called the sample space or possibility space, denoted by S e.g. if we are tossing a coin, then there are only two outcomes: a head and a tail. The sample space for this is commonly written as {H, T}. Any specified subset of sample space is called an event. In our case, we have only two events, a head and a tail. if we toss two coins together then we have four possible outcomes and hence the sample space S = {HH, HT, TH, TT}. In this case {HH}, {HH, TT} or any other subset of S is an event.

Central subjects in probability theory:

  1. Random Variables
  2. Random Processes
  3. Probability Distributions

We can also write the above as:

  1. Discrete and continuous random variables
  2. Probability Distributions
  3. Stochastic Processes (stochastic is a fancy word for random)

Although it is not possible to perfectly predict random events, there is always a pattern. Two major results in probability theory describing such patterns are:

  1. Law of large numbers: according to this law, the average of the results obtained from a large number of trials should be close to some value (a.k.a. expected value or mean) and will tend to become closer to this value as more trials are performed
  2. Central limit theorem: when independent random variables (variables that don’t affect the probability of each other) are added, their sum tends toward a normal distribution (a.k.a bell curve) even if the original variables themselves are not normally distributed.
source: Wikimedia

Look at the above picture, this is the graph of the probability of which number will appear when a dice is thrown. There are six values on a dice, and all have an equal probability of coming up. Now look at this graph down here:

source: Wikimedia

n = 5 means we got 5 dice and we roll all of them together and then we add the numbers we get and then graph the sum. Look at how many times we get 17 or 18. We get a bell-like graph. This is how a normal distribution looks like.

Understanding Notation

Understanding the formulas/notations of mathematics had always been a tough nut to crack for me. I could never learn it in 16 years of education. Now I am going to take a different approach here. Probability starts with sets and to understanding sets, you can work through this very simple and straightforward Wikipedia page:

and what will be the point of learning a Mathematical concept and then not knowing how to use it with a programming language. Head to this excellent introduction to sets in Python by Mike Driscoll:

https://www.blog.pythonlibrary.org/2020/04/28/python-101-learning-about-sets/

Epilogue

This took me one and a half-day, not only to understand but also to write it down and edit this post. If you think, it wasn’t entirely about Mathematics, it is actually about probability, then you would be wrong. What I have demonstrated here is an approach, a method to learn Mathematics for data science, a method which is quite efficient. I chose probability only because I needed to give you an example of my approach and I just started learning it and you can see how I progress in this 90-day MOOC’athlon learning challenge. The method is made up of:

  1. Knowing your outcome
  2. Why you want this outcome
  3. Changing your beliefs about how much you need to learn
  4. Finding out RoI of “how much to learn” in context to your career as a whole
  5. Doing small 1–2 days actions than 4–6 weeks of MOOCs

From here you can go further to learn more topics with the same method e.g. Probability Distributions, Central Limit Theorem, Probability Mass Function, and then onto Statistics. Remember, don’t be rigid in your approach. If one does not work, then you try another. If my method of learning doesn’t work for you then you go ahead and try someone else’s. If that doesn’t work either then go ahead and change it too. Just keep on trying to crack through and one day door will open and you will begin to make sense out of this complexity. Point is never to give up and always taking tiny action steps to move ahead.

FYI, MOOC’athlon challenge was started by Rassul-Ishame Kalfane (PhD Candidate) (also check MIT Challenge by Scott H. Young). Based on these two challenges, I started this 90-day Learning MOOC’athlon challenge:

--

--

Arnuld On Data
Arnuld On Data

Written by Arnuld On Data

Industrial Software Developer turned Data Scientist. From C to Python. Linkedin: https://www.linkedin.com/in/arnuld-on-data/

No responses yet