Eric Chen's Blog

Car Crash Prediction in NZ - Machine Learning Pipeline

In this article, we will do a complete machine learning pipeline from getting data through APIs, performing exploratory data analysis and formulating a real-world problem into a machine learning model. The dataset we use for this post is New Zealand Crash Analysis Dataset which is updated on a quarterly basis by the Transport Agency. The dataset was last updated on October 2018 (from January 2000). It contains all traffic crashes as reported to the Transport Agency by the NZ police. However, not all crashes are reported NZ police. A big portion of minor car crashes are settled on site by the parties without reporting to the police. The level of reporting increases with the severity of the crash. Due to the nature of non-fatal crashes it is believed that these are under-reported.

more ...

Valuable Matplotlib & Seaborn Visualization Handbook, Part III

This post summarizes the top 50 most valuable Matplotlib & Seaborn data visualizations in data science.
It can be taken as a data visualization handbook for you to look up for useful visulaization. The 50 visualizations are categorized into 7 different application scenarios, and this post would mainly focuses on the first two categories, shown as follows: Correlation, Deviation, Ranking, Distribution, Composition, Change, and Groups. The whole content is divided into three parts, and this post is Part III. We will cover the last three categories.

more ...

Valuable Matplotlib & Seaborn Visualization Handbook, Part II

This post summarizes the top 50 most valuable Matplotlib & Seaborn data visualizations in data science. It can be taken as a data visualization handbook for you to look up for useful visulaization. The 50 visualizations are categorized into 7 different application scenarios, and this post would mainly focuses on the first two categories, shown as follows: Correlation, Deviation, Ranking, Distribution, Composition, Change, and Groups. The whole content is divided into three parts, and this post is Part II. We will cover Ranking and Distribution in this post.

more ...

Valuable Matplotlib & Seaborn Visualization Handbook, Part I

This post summarizes the top 50 most valuable Matplotlib & Seaborn data visualizations in data science. It can be taken as a data visualization handbook for you to look up for useful visulaization. The 50 visualizations are categorized into 7 different application scenarios, and this post would mainly focuses on the first two categories, shown as follows: Correlation, Deviation, Ranking, Distribution, Composition, Change, and Groups. The whole content is divided into three parts, and this post is Part I. We will cover the first two categories in Part I.

more ...

New Airbnb User Booking Prediction

The basic aim of this notebook is to predict new Airbnb users' first destination country based a historical dataset. This work involves a considerable amount of data cleansing work. After the dataset is cleaned and preprocessed, I use a popular xgboost classifier as the prediction model, and grid searching with 3-fold cross-validation to find the most suitable parameters for the classifier. If you are interested in finding more about the dataset or the xgboost classifier, please follow along this article.

more ...