Fundamental concepts in supervised machine learning

This is a memo to share what I have learnt in Machine Learning with Tree-Based Models (using Python), capturing the learning objectives as well as my personal notes. The course is taught by Elie Kawerk from DataCamp.

Photo by Keith Jonson on Unsplash

Decision trees are supervised learning models used for problems involving classification and regression.

I have learnt the following topics:

  • Use Python to train decision trees and tree-based models.
    Decision-Tree Learning, applying CART algorithm to train decision trees for classification/regression problems.
  • Generalization Error of a supervised learning model, to diagnose underfitting and overfitting using Cross-Validation.
    Ensembling can produce better results than individual decision trees.


A beginner’s guide to the basic concepts of Apache Airflow

This is a memo to share what I have learnt in Apache Airflow, capturing the learning objectives as well as my personal notes. The course is taught by Mike Metzger from DataCamp.

Photo by Jacek Dylag on Unsplash

A data engineer’s job includes writing scripts, adding complex CRON tasks, and trying various ways to meet an ever-changing set of requirements to deliver data on schedule. Airflow can do all these while adding scheduling, error handling, and reporting.

I have learnt the following topics:

  • Workflows / DAGs / Tasks
  • Operators (BashOperator, PythonOperator, BranchPythonOperator, EmailOperator)
  • Dependencies between tasks / Bitshift operators
  • Sensors (to react to workflow conditions and…


Multiple Linear Regression, R², Adjusted R², MSE, p-value

Statistics and coding are fundamentally important in the data science field. Since a lot of a data science work is carried out with code, I would highly recommend learning statistics with a heavy focus on coding, preferably in Python or R.

Photo by Michael Dziedzic on Unsplash

In my previous article, I shared about how to code summary statistics (Mean, Median, Mode, Max, Min, Range, Quartile, Inter-Quartile Range, Standard Deviation, Variance) of a dataset and the Simple Linear Regression.

In this article, I shall cover the following topics with codes in Python 3:
• multiple linear regression models
• model performance metrics: R², Adjusted R², MSE…


When your ecommerce business grows

Photo by Mark König on Unsplash

If your ecommerce business is progressing to the Cloud, you need to be familiar with these three main types of cloud computing:

  • IaaS — Infrastructure as a Service
  • PaaS — Platform as a Service
  • SaaS — Software as a Service

These are all experiencing a surge in popularity as more businesses move to the Cloud. Gartner forecasts worldwide public cloud revenue to grow 17% in 2020. With growth rates like these, cloud computing will soon be the industry norm, and many businesses are phasing out on-prem software altogether.

Utilizing cloud computing is a great way to future-proof your business.


Machine Learning from labelled data to make predictions

This is a tutorial to share what I have learnt in Supervised Learning with scikit-learn, capturing the learning objectives as well as my personal notes. The course is taught by Hugo Bowne-Anderson from DataCamp.

Photo by Andy Kelly on Unsplash

Is a particular email spam?
Will a tumor be benign or malignant?
Which of your customers will take their business elsewhere?

These questions can be answered by Machine learning algorithms, where computers learn from existing data to make predictions on new data.

I have learnt the following topics:

  • Using machine learning techniques to build predictive models
  • for both regression and classification problems
  • using real-world data
  • Concept…


Continue to speak the statistical language of your data

Previous tutorial: Statistical Thinking in Python (Part 1)

This is a tutorial to share what I have learnt in Statistical Thinking in Python (Part 2), capturing the learning objectives as well as my personal notes. The course is taught by Justin Bois from DataCamp.

Photo by ThisisEngineering RAEng on Unsplash

To build the probabilistic mindset and foundational coding stats skills to dive into data sets and extract useful information from them.

I have learnt the following statistical thinking skills:

1. Perform EDA
(a) Generate effective plots like ECDFs
(b) Compute summary statistics

2. Estimate parameters
(a) By optimisation, including linear regression
(b) Determine confidence intervals

3…


Speak the statistical language of your data

This is a tutorial to share what I have learnt in Statistical Thinking in Python (Part 1), capturing the learning objectives as well as my personal notes. The course is taught by Justin Bois from DataCamp, and it includes 4 chapters.

Photo by Chris Liverani on Unsplash

The end goal of gathering data is to make clear, summary conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference.

I have learnt the following statistical thinking skills:

  • Graphical exploratory data analysis (EDA), Quantative EDA
  • Construct (beautiful) instructive plots, including histogram, swarmplot, Empirical Cumulative Distribution Functions (ECDF), Box Plots…


How feature extraction techniques can reduce dimensionality

This is a tutorial to share what I have learnt in Dimensionality Reduction in Python, capturing the learning objectives as well as my personal notes. The course is taught by Jerone Boeye from DataCamp, and it includes 4 chapters.

Photo by Aditya Chinchure on Unsplash

High-dimensional datasets have high complexity and can be computationally expensive to process. Reduce dimensionality by dropping features that are duplicate of other features, dropping irrelevant features, and using feature extraction techniques (through the calculation of uncorrelated principal components).

I have learnt the following topics:

  • Why dimensional reduction is important and when to use it
  • How to explore high dimensional data
  • How…


Using the new Tableau version 2020.x onwards, with The World Bank GDP data preparation in Python 3

Bar chart race in action (music added): https://youtu.be/QQ9dw7gpbIM

A bar chart race has become very popular recently. At the beginning of 2020, Tableau released 2020.x version with a new Animations feature for dynamic parameters. This means that the bar chart race below can now be built easily in 6 minutes.

https://public.tableau.com/profile/blackraven#!/vizhome/Top10CountriesHistoricalGDPByYear/Top10CountriesHistoricalGDPByYear

This tutorial is a step-by-step guide to build a bar chart race based on historical Gross Domestic Product (GDP) data. To build a bar chart race is to create many discrete pages of bar charts and then string them together, just like how a traditional cartoon animation is built.

Step 1: Get ready the software and data

Download…

Black_Raven (James Ng)

perpetual student, fitness enthusiast, passionate explorer https://www.linkedin.com/in/jnyh/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store