Data As Second Nature
I have been learning for some time now with my 90-day MOOC’athlon learning challenge. I am satisfied in the sense that I am completing whatever I am putting my hands on. But I still feel this is not enough.
Why this is not enough?
The reason I feel “not enough” is because I have worked in the software industry. I have written computer programs for solving business problems for five years and as much as it was about learning, it was also about knowing what to code, having it in the memory, having it as my second nature. I had the entire C language in my head and whenever I needed to solve a problem, solution in C just used to “come to me”. Just like driving where you know how and when to speed-up, slow-down, shift gears, and brake subconsciously. You only need to learn driving once. Same way, after initial learning, I didn’t need a MOOC, a course or anything to do my everyday work. So the question I asked myself is “how do I do the same as a data scientist”, “how do I make problem-solving using data science as my second nature”.
Learning the fundamentals is good. I did the same in the beginning when I was learning to program. I wrote programs for months and months. Now in 2020, I want to make this Corona lockdown as the greatest learning period of my life ever. How can I achieve this?
Tony Robbins says, when you decide to have a goal, then the first thing you need to do is: define your outcome, it is different from the activity you do to achieve your goal. The activity can change but the outcome doesn’t. Some activities are efficient and some are not, some are effective and some are not and you need to keep on changing your approach to achieve your desired outcome. So, the outcome in my case is the actual work that a data scientist does. This is data cleaning, wrangling, visualizing, and communicating your insights to non-tech business people. Then I need to start with data cleaning and wrangling as the first thing on my list.
So I decided to reproduce Anne Bonner’s original The Ultimate Beginner’s Guide to Data Scraping, Cleaning, and Visualization. It took me a whole day to reproduce her work. An excellent piece of real-life industrial work. I just didn’t copy paste but I read and worked through each-step, modified the code, changed the functions, and methods to see how they behave. In fact, sometimes I called on the methods and functions not used in the original article but I found them on official Pandas docs and wanted to see what do I get and why. It was an exhilarating and tiring mini-project and I created a rather large jupyter notebook and wrote a very long post about the reproduction. But instead of publishing all of it here, I decided to not to post any technical-content (code and the initial Miniforge setup)
The reason for that was I wanted this experience to be useful for beginners in data science and by putting a lot of code, I could have discouraged beginners not well versed with Python and Pandas. Second, all of the code is already available in the original post and Anne did a great job of explaining the steps required to reproduce her work. I follow rule 2 of hackers.
Now everyone knows that they need to have good knowledge of major data science subjects:
- Python
- Pandas, NumPy and Scikit-learn
- Probability and Statistics
- Machine Learning and its Models
- Machine Learning projects (most popular is Kaggle)
The problem is the majority of data science beginners focus all of their energy and focus on the above. They miss the bigger picture. So, I decided to go another way, what non-coding, non-mathematical thing I learned from reproducing Anne’s post? What are the things one should be aware of when one begins the data science learning journey? What is that which is as important as knowing Python and Statistics when it comes to getting noticed by hiring managers? There are two things.
The Domain
This is what people who start learning data science don’t know. This is the skill they are going to ignore and this is the skill that you need to get a recruiter’s attention from my understanding so far. This is tough to get because it is the unfamiliar stuff. This is where I should speak the words domain knowledge.
Unlike what you think, it is not just programming skills that matter, it is also what domain you work in. When I started my job as a computer programmer, I started as a trainee, and I already knew C language but took me several months to understand the telecom/communication domain. I knew C but knew nothing about BSD Sockets. BSD Sockets are written in C language but that is half the story, the rest half is understanding how to use sockets and how it all works. I didn’t know the domain jargon like Handshake, ACK, Out-of-band data, etc. That is where I spent several months training myself. This is what we mean (in-part) when we say real-life software development. The same thing is true for data science. You know Python, you know Pandas and ML algorithms, so what domain you work in then?
Anne’s article was about the text mining domain. Specifically, it was a sub-domain called sentiment-analysis. We get some jargon there:
- Bigram: A bigram is an n-gram for n=2
- Stopwords: There is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list.
- Stemming: The process of reducing inflected (or sometimes derived) words to their word stem, base or root form — generally a written word form.
- Anhedonia: A diverse array of deficits in hedonic function, including reduced motivation or ability to experience pleasure
The question you have to ask yourself now is which domain you are going to choose: Healthcare, Finances, IoT, Genomics, to name a few. You gotta know the domain and the jargon because this is the language most of the people in your team will talk-in when you join an organization.
Importance of Data Cleaning and Wrangling
This is taken directly from Anne’s post:
the most important part of refining the model to get more accurate results would be the data gathering, cleaning, and preprocessing stage. Until the Tweets were appropriately scraped and cleaned, the model had unimpressive accuracy. By cleaning and processing the Tweets with more care, the robustness of the model improved to 97%.
I had already heard more than hundreds of times that 70% of the time is spent on data cleaning and wrangling. I did some of that using automated Azure and AWS tools too. I didn’t know it was this time consuming and head-hurting till I reproduced Anne’s post. It was all manual cleaning. From her words you can think how much important it is. Your model is as good as your data.
Companies are getting lot of data from the their customers/consumers and this data is dirty, it is not in the shape and form that that you can pull some great insight out of it. Outcome can’t be good if input is messed-up. Hence, you need to master this.
What’s next
This is what I am going to do next:
- Reproduce a project/paper.
- Choose some domain to do that. A domain that interests me and fits my background. Keep in mind that one needs to find a balance between interests and the market. If you go purely based on your interests or passion, you may find yourself unemployed, unless you are lucky. Don’t count on luck, build it instead. If you go purely based on the market demand, you will not be able to keep up when things will get tough and they will get tough. Find a balance.
- Get dirty data and then clean and wrangle it manually. All datasets on Kaggle are pre-cleaned for you. So, no cookies from Kaggle =:o)
Finally, the point is to get out of learning mode and get into building something mode. I think Dan Becker means when he says:
Learn about techniques in the context of applications
You can go ahead and read his post:
You will get several years ahead in your profession compared to anyone if you follow his advice.