Data Overview¶
There are 6 files provided. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:
1. train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries.
2. test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a wish list, ran a search etc.) taken by the users in both the testing and training datasets above.
A glimpse on the data
import pandas as pd
#import data
tr_datapath = "data/train_users_2.csv"
te_datapath = "data/test_users.csv"
df_train = pd.read_csv(tr_datapath, header = 0, index_col = None)
df_test = pd.read_csv(te_datapath, header = 0, index_col = None)
# size of training data
print(df_train.shape)
df_train.head()
# size of test data, short of the country_destination column which need to be predicted by our model
print(df_test.shape)
df_test.head()
Data cleansing¶
From the above snapshot of the data in training and test files, a few key pieces of information about the integrity of this dataset can be identified.
- Firstly, is that at least two columns have missing values – the age column and date_first_booking column.
- Secondly, most of the columns provided contain categorical data. In fact 11 of the 16 columns provided appear to be categorical.
- Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47.
- Fourthly, erroneous values. For some columns, there are values that can be identified as obviously incorrect. This may be a gender column where someone has entered a number, or an age column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
- Lastly, some columns need to be standardized. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.
First, let's combine the training data and test data into one DataFrame so that we can do data cleansing at the same time.
# combine df_train and df_test into one DataFrame
df_all = pd.concat((df_train, df_test), axis = 0, ignore_index = True, sort = False)
Fix the format of the dates
Because we will use the date information, cleaning the date timestamps is necessary. If we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates.
# fixing the date_account_created column
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
# fixing the timestamp_first_active column
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
# use the timestamp_first_active column to fill the missing values in data_account_created column
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)
Drop inconsistant columns
There are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.
This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset.
# Drop the date_first_booking column
df_all.drop('date_first_booking', axis = 1, inplace = True)
Correct the age column
As noticed earlier, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’.
To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.
Besides, the significant portion of users who did not provide a age value should also be noticed. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1. After testing with other methods of filling the NaN values, including average, median, and most frequent value, using the value -1 yields the best prediction model.
import numpy as np
import warnings
warnings.filterwarnings('ignore')
#avoid comparison with NaN values
df_all['age'].fillna(-1, inplace=True)
# function to clean incorrect value
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values<min_val, col_values>max_val), np.NaN, col_values)
return df
# Fixing age column
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)
Fill the missing values in column first_affiliate_tracked
And then view the DataFrame.
# Fill missing values in first_affliate_tracked
df_all['first_affiliate_tracked'].fillna(-1, inplace = True)
df_all.tail()
Data transformation and feature extraction¶
We use data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data.
We then use feature extraction to create new features which will help improve the prediction accuracy of our model.
We will first focus on data transformation.
Transforming categorical data - one hot encoding
The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.
# one hot encoding function
def convert_to_onehot(df, column_to_convert):
categories = list(df[column_to_convert].drop_duplicates())
for category in categories:
cat_name = str(category).replace(" ", "_").replace(
"(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
col_name = column_to_convert[:5] + '_' + cat_name[:10]
df[col_name] = 0
df.loc[(df[column_to_convert] == category), col_name] = 1
return df
#One hot encoding, and drop the original column from df_all
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language',
'affiliate_channel', 'affiliate_provider',
'first_affiliate_tracked', 'signup_app',
'first_device_type', 'first_browser']
for column in columns_to_convert:
df_all = convert_to_onehot(df_all, column)
df_all.drop(column, axis = 1, inplace = True)
df_all.head()
Creating new featrues
Two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in.
# Add new datetime related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days
# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
if column in df_all.columns:
df_all.drop(column, axis=1, inplace=True)
print(df_all.shape)
Adding new data¶
We will see what new data we can add from the sessios.csv file. The dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user.
Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.
Import sessions data
# read sessions.csv
session_path = 'data/sessions.csv'
sessions = pd.read_csv(session_path, header = 0, index_col = False)
sessions.head()
Extract the primary and secondary devides for each user
How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values.
# Determine primary device
sessions_device = sessions.loc[:, ['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'],
as_index = False, sort = False).aggregate(np.sum)
index = aggregated_lvl1.groupby(['user_id'], sort = False)[
'secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = pd.DataFrame(aggregated_lvl1.loc[
index, ['user_id', 'device_type', 'secs_elapsed']])
df_primary.rename(columns = {
'device_type': 'primary_device',
'secs_elapsed': 'primary_secs'}, inplace = True)
df_primary = convert_to_onehot(df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis = 1, inplace = True)
# Determine secondary device
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[index])
index = remaining.groupby(
['user_id'], sort = False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = pd.DataFrame(
remaining.loc[index, ['user_id', 'device_type', 'secs_elapsed']])
df_secondary.rename(columns = {
'device_type': 'secondary_device', 'secs_elapsed': 'secondary secs'}, inplace = True)
df_secondary = convert_to_onehot(df_secondary, 'secondary_device')
df_secondary.drop('secondary_device', axis = 1, inplace = True)
Determine action counts
Determine action counts for the three columns action, action_type, action_detail, to generate 3 sepparate tables. Then we join the three tables together on the basis of the user_id.
# function to count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
id_list = df[id_col].drop_duplicates()
df_counts = df.loc[:,[id_col, column_to_convert]]
df_counts['count'] = 1
df_counts = df_counts.groupby(by = [id_col, column_to_convert],
as_index = False, sort = False).sum()
new_df = df_counts.pivot(index = id_col, columns = column_to_convert, values = 'count')
new_df = new_df.fillna(0)
#rename columns
categories = list(df[column_to_convert].drop_duplicates())
for category in categories:
cat_name = str(category).replace(
" ", "_").replace("(", "").replace(")", "").replace(
"/", "_").replace("-", "").lower()
col_name = column_to_convert + '_' + cat_name
new_df.rename(columns = {category: col_name}, inplace = True)
return new_df
# Aggregate and combine actions taken columns
session_actions = sessions.loc[:, ['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')
# flag indicating the first loop
first = True
for column in columns_to_convert:
print("Converting " + column + " column...")
current_data = convert_to_counts(df = session_actions, id_col = 'user_id', column_to_convert=column)
if first:
first = False
actions_data = current_data
else:
actions_data = pd.concat([actions_data, current_data], axis = 1, join = 'inner')
Combine data sets¶
The last steps are to combine the various datasets into one large dataset.
First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv.
Finally, we combine the sessions dataframe with the user data dataframe.
The first two joins need outer join because not all users have a secondary deivce. The second merge could use an outer join or an inner join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.
# Combine device datasets
df_primary.set_index('user_id', inplace = True)
df_secondary.set_index('user_id', inplace = True)
device_data = pd.concat([df_primary, df_secondary], axis = 1, join = 'outer', sort = False)
#Combine device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis = 1, join = 'outer', sort = False)
df_sessions = combined_results.fillna(0)
#Combine user and sessions datasets
df_all.set_index('id', inplace = True)
df_all = pd.concat([df_all, df_sessions], axis = 1, join = 'inner', sort = False)
df_all.head()
Create a Model¶
So far, the df_all dataset is ready to be used to train and test a model to predict the first booking destination country for each new user.
The traing algorithm we will use is the popular XGBoost. From my perspective, this method is superior to random forest. It builds a first tree, typically a shallower tree than if you use one single decision tree, and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.
Cross validation
To avoid overfitting in our model, I will use 10-fold cross validation. K-fold cross validation involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data.
# Import libraries
import xgboost as xgb
from sklearn import decomposition
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder
Prepare training data
We previously combined the training and test data to simplify the cleaning and transforming process. To feed these into the model, we also need to split the training data into the three main components – the user IDs (we don’t want to use these for training as they are randomly generated), the features to use for training (X), and the categories we are trying to predict (y).
# Prepare training data for model training
df_train.set_index('id', inplace = True)
df_train = pd.concat([df_train['country_destination'], df_all], axis = 1, join = 'inner', sort = False)
index_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels) # training labels
x = df_train.drop('country_destination', axis = 1, inplace = False) # training data
Now that we have our training data ready, we can use GridSearchCV to run the algorithm with a range of parameters, then select the model that has the highest cross validated score based on the chosen measure of a performance (in this case accuracy, but there are a range of metrics we could use based on our needs).
# Grid Search - used to find the best combination of parameters
XGB_model = xgb.XGBClassifier(objective = 'multi:softprob',
subsample = 0.8, colsample_bytree = 0.8, seed = 0)
param_grid = {'max_depth': [3,4,5],
'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = GridSearchCV(estimator = XGB_model, param_grid = param_grid,
scoring = 'accuracy', verbose = 10, n_jobs = 1,
iid = True, refit = True, cv = 3)
# Model training
model.fit(x, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Make Predictions on test data
Now we can use our trained model (with best training parameters) to make predictions on our test data.
# Prepare test data for prediction
df_test.set_index('id', inplace = True)
df_test = pd.merge(df_test.loc[:, ['date_first_booking']],
df_all, how = 'left', left_index=True,
right_index=True, sort=False)
x_test = df_test.drop('date_first_booking', axis = 1, inplace = False)
x_test = x_test.fillna(-1)
id_test = df_test.index.values
# Make predictions
y_pred = model.predict_proba(x_test)
Comments
comments powered by Disqus