Airbnb New User Booking Predictions¶

Introduction¶

Predict new users' first bookings for their stay in a specific country. Details can be found in a Kaggle competion here.

Data Overview¶

There are 6 files provided. Two of these files provide background information (countries.csv and age_gender_bkts.csv), while sample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:

1. train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries.

2. test_users.csv – This dataset also contains data on Airbnb users, in the same format as train_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.

3. sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a wish list, ran a search etc.) taken by the users in both the testing and training datasets above.

A glimpse on the data

In [1]:

import pandas as pd

#import data
tr_datapath = "data/train_users_2.csv"
te_datapath = "data/test_users.csv"
df_train = pd.read_csv(tr_datapath, header = 0, index_col = None)
df_test = pd.read_csv(te_datapath, header = 0, index_col = None)

In [2]:

# size of training data
print(df_train.shape)
df_train.head()

(213451, 16)

Out[2]:

	id	date_account_created	timestamp_first_active	date_first_booking	gender	age	signup_method	signup_flow	language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser	country_destination
0	gxn3p5htnn	2010-06-28	20090319043255	NaN	-unknown-	NaN	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	NDF
1	820tgsjxq7	2011-05-25	20090523174809	NaN	MALE	38.0	facebook	0	en	seo	google	untracked	Web	Mac Desktop	Chrome	NDF
2	4ft3gnwmtx	2010-09-28	20090609231247	2010-08-02	FEMALE	56.0	basic	3	en	direct	direct	untracked	Web	Windows Desktop	IE	US
3	bjjt8pjhuk	2011-12-05	20091031060129	2012-09-08	FEMALE	42.0	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Firefox	other
4	87mebub9p4	2010-09-14	20091208061105	2010-02-18	-unknown-	41.0	basic	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	US

In [3]:

# size of test data, short of the country_destination column which need to be predicted by our model
print(df_test.shape)
df_test.head()

(62096, 15)

Out[3]:

	id	date_account_created	timestamp_first_active	date_first_booking	gender	age	signup_method	language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser
0	5uwns89zht	2014-07-01	20140701000006	NaN	FEMALE	35.0	facebook	en	direct	direct	untracked	Moweb	iPhone	Mobile Safari
1	jtl0dijy2j	2014-07-01	20140701000051	NaN	-unknown-	NaN	basic	en	direct	direct	untracked	Moweb	iPhone	Mobile Safari
2	xx0ulgorjt	2014-07-01	20140701000148	NaN	-unknown-	NaN	basic	en	direct	direct	linked	Web	Windows Desktop	Chrome
3	6c6puo6ix0	2014-07-01	20140701000215	NaN	-unknown-	NaN	basic	en	direct	direct	linked	Web	Windows Desktop	IE
4	czqhjk3yfe	2014-07-01	20140701000305	NaN	-unknown-	NaN	basic	en	direct	direct	untracked	Web	Mac Desktop	Safari

Data cleansing¶

From the above snapshot of the data in training and test files, a few key pieces of information about the integrity of this dataset can be identified.

Firstly, is that at least two columns have missing values – the age column and date_first_booking column.
Secondly, most of the columns provided contain categorical data. In fact 11 of the 16 columns provided appear to be categorical.
Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47.
Fourthly, erroneous values. For some columns, there are values that can be identified as obviously incorrect. This may be a gender column where someone has entered a number, or an age column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
Lastly, some columns need to be standardized. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

First, let's combine the training data and test data into one DataFrame so that we can do data cleansing at the same time.

In [4]:

# combine df_train and df_test into one DataFrame
df_all = pd.concat((df_train, df_test), axis = 0, ignore_index = True, sort = False)

Fix the format of the dates

Because we will use the date information, cleaning the date timestamps is necessary. If we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates.

In [5]:

# fixing the date_account_created column
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
# fixing the timestamp_first_active column
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
# use the timestamp_first_active column to fill the missing values in data_account_created column
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)

Drop inconsistant columns

There are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset.

In [6]:

# Drop the date_first_booking column
df_all.drop('date_first_booking', axis = 1, inplace = True)

Correct the age column

As noticed earlier, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’.

To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Besides, the significant portion of users who did not provide a age value should also be noticed. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1. After testing with other methods of filling the NaN values, including average, median, and most frequent value, using the value -1 yields the best prediction model.

In [7]:

import numpy as np
import warnings
warnings.filterwarnings('ignore')
#avoid comparison with NaN values
df_all['age'].fillna(-1, inplace=True)

# function to clean incorrect value
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<min_val, col_values>max_val), np.NaN, col_values)
    return df

# Fixing age column
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)

Fill the missing values in column first_affiliate_tracked

And then view the DataFrame.

In [8]:

# Fill missing values in first_affliate_tracked
df_all['first_affiliate_tracked'].fillna(-1, inplace = True)
df_all.tail()

Out[8]:

	id	date_account_created	timestamp_first_active	gender	age	signup_method	signup_flow	language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser	country_destination
275542	cv0na2lf5a	2014-09-30	2014-09-30 23:52:32	-unknown-	31.0	basic	0	en	direct	direct	untracked	Web	Windows Desktop	IE	NaN
275543	zp8xfonng8	2014-09-30	2014-09-30 23:53:06	-unknown-	-1.0	basic	23	ko	direct	direct	untracked	Android	Android Phone	-unknown-	NaN
275544	fa6260ziny	2014-09-30	2014-09-30 23:54:08	-unknown-	-1.0	basic	0	de	direct	direct	linked	Web	Windows Desktop	Firefox	NaN
275545	87k0fy4ugm	2014-09-30	2014-09-30 23:54:30	-unknown-	-1.0	basic	0	en	sem-brand	google	omg	Web	Mac Desktop	Safari	NaN
275546	9uqfg8txu3	2014-09-30	2014-09-30 23:59:01	FEMALE	49.0	basic	0	en	other	other	tracked-other	Web	Windows Desktop	Chrome	NaN

Data transformation and feature extraction¶

We use data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data.

We then use feature extraction to create new features which will help improve the prediction accuracy of our model.

We will first focus on data transformation.

Transforming categorical data - one hot encoding

The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.

In [9]:

# one hot encoding function
def convert_to_onehot(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())
    
    for category in categories:
        cat_name = str(category).replace(" ", "_").replace(
            "(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1
    return df

#One hot encoding, and drop the original column from df_all
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 
                      'affiliate_channel', 'affiliate_provider', 
                      'first_affiliate_tracked', 'signup_app', 
                      'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_onehot(df_all, column)
    df_all.drop(column, axis = 1, inplace = True)

df_all.head()

Out[9]:

	id	date_account_created	timestamp_first_active	age	country_destination	gende_unknown	gende_male	gende_female	signu_facebook	...
0	gxn3p5htnn	2010-06-28	2009-03-19 04:32:55	-1.0	NDF	1	0	0	1	...
1	820tgsjxq7	2011-05-25	2009-05-23 17:48:09	38.0	NDF	0	1	0	1	...
2	4ft3gnwmtx	2010-09-28	2009-06-09 23:12:47	56.0	US	0	0	1	0	...
3	bjjt8pjhuk	2011-12-05	2009-10-31 06:01:29	42.0	other	0	0	1	1	...
4	87mebub9p4	2010-09-14	2009-12-08 06:11:05	41.0	US	1	0	0	0	...

5 rows × 157 columns

Creating new featrues

Two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in.

In [10]:

# Add new datetime related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

print(df_all.shape)

(275547, 164)

Adding new data¶

We will see what new data we can add from the sessios.csv file. The dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user.

Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.

Import sessions data

In [11]:

# read sessions.csv
session_path = 'data/sessions.csv'
sessions = pd.read_csv(session_path, header = 0, index_col = False)
sessions.head()

Out[11]:

	user_id	action	action_type	action_detail	device_type	secs_elapsed
0	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	319.0
1	d1mm9tcy42	search_results	click	view_search_results	Windows Desktop	67753.0
2	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	301.0
3	d1mm9tcy42	search_results	click	view_search_results	Windows Desktop	22141.0
4	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	435.0

Extract the primary and secondary devides for each user

How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values.

In [12]:

# Determine primary device
sessions_device = sessions.loc[:, ['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'],
                                          as_index = False, sort = False).aggregate(np.sum)
index = aggregated_lvl1.groupby(['user_id'], sort = False)[
    'secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = pd.DataFrame(aggregated_lvl1.loc[
    index, ['user_id', 'device_type', 'secs_elapsed']])
df_primary.rename(columns = {
    'device_type': 'primary_device', 
    'secs_elapsed': 'primary_secs'}, inplace = True)
df_primary = convert_to_onehot(df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis = 1, inplace = True)

# Determine secondary device
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[index])
index = remaining.groupby(
    ['user_id'], sort = False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = pd.DataFrame(
    remaining.loc[index, ['user_id', 'device_type', 'secs_elapsed']])
df_secondary.rename(columns = {
    'device_type': 'secondary_device', 'secs_elapsed': 'secondary secs'}, inplace = True)
df_secondary = convert_to_onehot(df_secondary, 'secondary_device')
df_secondary.drop('secondary_device', axis = 1, inplace = True)

Determine action counts

Determine action counts for the three columns action, action_type, action_detail, to generate 3 sepparate tables. Then we join the three tables together on the basis of the user_id.

In [13]:

# function to count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()
    df_counts = df.loc[:,[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by = [id_col, column_to_convert], 
                                  as_index = False, sort = False).sum()
    
    new_df = df_counts.pivot(index = id_col, columns = column_to_convert, values = 'count')
    new_df = new_df.fillna(0)
    
    #rename columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
        cat_name = str(category).replace(
            " ", "_").replace("(", "").replace(")", "").replace(
            "/", "_").replace("-", "").lower()
        col_name = column_to_convert + '_' + cat_name
        new_df.rename(columns = {category: col_name}, inplace = True)
        
    return new_df

# Aggregate and combine actions taken columns
session_actions = sessions.loc[:, ['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']
session_actions = session_actions.fillna('not provided')

# flag indicating the first loop
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df = session_actions, id_col = 'user_id', column_to_convert=column)
    if first:
        first = False
        actions_data = current_data
    else:
        actions_data = pd.concat([actions_data, current_data], axis = 1, join = 'inner')

Converting action column...
Converting action_type column...
Converting action_detail column...

Combine data sets¶

The last steps are to combine the various datasets into one large dataset.

First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv.

Finally, we combine the sessions dataframe with the user data dataframe.

The first two joins need outer join because not all users have a secondary deivce. The second merge could use an outer join or an inner join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.

In [14]:

# Combine device datasets
df_primary.set_index('user_id', inplace = True)
df_secondary.set_index('user_id', inplace = True)
device_data = pd.concat([df_primary, df_secondary], axis = 1, join = 'outer', sort = False)

#Combine device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis = 1, join = 'outer', sort = False)
df_sessions = combined_results.fillna(0)

#Combine user and sessions datasets
df_all.set_index('id', inplace = True)
df_all = pd.concat([df_all, df_sessions], axis = 1, join = 'inner', sort = False)
df_all.head()

Out[14]:

	age	gende_unknown	gende_male	signu_basic	signu_0	...	action_detail_view_search_results	action_detail_wishlist_content_update
d1mm9tcy42	62.0	0	1	1	1	...	23.0	25.0
yo8nz8bqcq	-1.0	1	0	1	1	...	0.0	1.0
4grx6yxeby	-1.0	1	0	1	1	...	0.0	1.0
ncf87guaf0	-1.0	1	0	1	1	...	32.0	10.0
4rvqpxoh3h	-1.0	1	0	1	0	...	0.0	0.0

5 rows × 720 columns

Create a Model¶

So far, the df_all dataset is ready to be used to train and test a model to predict the first booking destination country for each new user.

The traing algorithm we will use is the popular XGBoost. From my perspective, this method is superior to random forest. It builds a first tree, typically a shallower tree than if you use one single decision tree, and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.

Cross validation

To avoid overfitting in our model, I will use 10-fold cross validation. K-fold cross validation involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data.

In [15]:

# Import libraries
import xgboost as xgb
from sklearn import decomposition
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder

Prepare training data

We previously combined the training and test data to simplify the cleaning and transforming process. To feed these into the model, we also need to split the training data into the three main components – the user IDs (we don’t want to use these for training as they are randomly generated), the features to use for training (X), and the categories we are trying to predict (y).

In [16]:

# Prepare training data for model training
df_train.set_index('id', inplace = True)
df_train = pd.concat([df_train['country_destination'], df_all], axis = 1, join = 'inner', sort = False)

index_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels) # training labels
x = df_train.drop('country_destination', axis = 1, inplace = False) # training data

Now that we have our training data ready, we can use GridSearchCV to run the algorithm with a range of parameters, then select the model that has the highest cross validated score based on the chosen measure of a performance (in this case accuracy, but there are a range of metrics we could use based on our needs).

In [17]:

# Grid Search - used to find the best combination of parameters
XGB_model = xgb.XGBClassifier(objective = 'multi:softprob',
                              subsample = 0.8, colsample_bytree = 0.8, seed = 0)
param_grid = {'max_depth': [3,4,5], 
              'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}
model = GridSearchCV(estimator = XGB_model,  param_grid = param_grid, 
                     scoring = 'accuracy', verbose = 10, n_jobs = 1,
                     iid = True, refit = True, cv = 3)

# Model training
model.fit(x, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] learning_rate=0.1, max_depth=3, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.7min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=3, n_estimators=25, score=0.7030189752549673, total= 2.7min

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  5.1min remaining:    0.0s

[CV] learning_rate=0.1, max_depth=3, n_estimators=25 .................
[CV]  learning_rate=0.1, max_depth=3, n_estimators=25, score=0.6923795976427556, total= 2.4min
[CV] learning_rate=0.1, max_depth=3, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  7.5min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=3, n_estimators=25, score=0.6961665108337737, total= 2.3min
[CV] learning_rate=0.1, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 12.1min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=3, n_estimators=50, score=0.7052943805615375, total= 4.5min
[CV] learning_rate=0.1, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 16.7min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=3, n_estimators=50, score=0.6867709815078236, total= 4.6min
[CV] learning_rate=0.1, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 21.4min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=3, n_estimators=50, score=0.7000691085003455, total= 4.6min
[CV] learning_rate=0.1, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 24.3min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=4, n_estimators=25, score=0.7052943805615375, total= 2.9min
[CV] learning_rate=0.1, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 27.3min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=4, n_estimators=25, score=0.6942491363543996, total= 2.9min
[CV] learning_rate=0.1, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 30.3min remaining:    0.0s

[CV]  learning_rate=0.1, max_depth=4, n_estimators=25, score=0.7007195414447742, total= 2.9min
[CV] learning_rate=0.1, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=4, n_estimators=50, score=0.7052943805615375, total= 5.8min
[CV] learning_rate=0.1, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=4, n_estimators=50, score=0.6862426336110546, total= 5.6min
[CV] learning_rate=0.1, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=4, n_estimators=50, score=0.7025082320419529, total= 5.7min
[CV] learning_rate=0.1, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=25, score=0.7049286904229816, total= 3.6min
[CV] learning_rate=0.1, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=25, score=0.6935175777281041, total= 3.7min
[CV] learning_rate=0.1, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=25, score=0.703849749989837, total= 3.7min
[CV] learning_rate=0.1, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=50, score=0.704847425947747, total= 7.3min
[CV] learning_rate=0.1, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=50, score=0.6824629140418614, total= 7.3min
[CV] learning_rate=0.1, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.1, max_depth=5, n_estimators=50, score=0.7048660514655067, total= 7.2min
[CV] learning_rate=0.3, max_depth=3, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=25, score=0.7021250660273861, total= 2.3min
[CV] learning_rate=0.3, max_depth=3, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=25, score=0.6848201585043691, total= 2.3min
[CV] learning_rate=0.3, max_depth=3, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=25, score=0.7008008455628277, total= 2.3min
[CV] learning_rate=0.3, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=50, score=0.701353053512657, total= 4.5min
[CV] learning_rate=0.3, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=50, score=0.6821784190205242, total= 4.6min
[CV] learning_rate=0.3, max_depth=3, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=3, n_estimators=50, score=0.7010447579169885, total= 4.5min
[CV] learning_rate=0.3, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=25, score=0.7028970785421154, total= 3.0min
[CV] learning_rate=0.3, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=25, score=0.6659215606584028, total= 2.9min
[CV] learning_rate=0.3, max_depth=4, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=25, score=0.7034838814585959, total= 2.9min
[CV] learning_rate=0.3, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=50, score=0.7018000081264475, total= 5.8min
[CV] learning_rate=0.3, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=50, score=0.6597845966267019, total= 5.9min
[CV] learning_rate=0.3, max_depth=4, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=4, n_estimators=50, score=0.7045001829342656, total= 5.8min
[CV] learning_rate=0.3, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=5, n_estimators=25, score=0.7033034009182886, total= 3.6min
[CV] learning_rate=0.3, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=5, n_estimators=25, score=0.6759195285511075, total= 3.6min
[CV] learning_rate=0.3, max_depth=5, n_estimators=25 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=5, n_estimators=25, score=0.7001097605593724, total= 3.6min
[CV] learning_rate=0.3, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=5, n_estimators=50, score=0.700743569948397, total= 7.5min
[CV] learning_rate=0.3, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

[CV]  learning_rate=0.3, max_depth=5, n_estimators=50, score=0.6653932127616338, total= 7.6min
[CV] learning_rate=0.3, max_depth=5, n_estimators=50 .................

C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
C:\Users\hche958\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 161.6min finished

[CV]  learning_rate=0.3, max_depth=5, n_estimators=50, score=0.6992560673198097, total= 7.5min
Best score: 0.701
Best parameters set:
	learning_rate: 0.1
	max_depth: 5
	n_estimators: 25

Make Predictions on test data

Now we can use our trained model (with best training parameters) to make predictions on our test data.

In [18]:

# Prepare test data for prediction
df_test.set_index('id', inplace = True)
df_test = pd.merge(df_test.loc[:, ['date_first_booking']], 
                   df_all, how = 'left', left_index=True,
                   right_index=True, sort=False)
x_test = df_test.drop('date_first_booking', axis = 1, inplace = False)
x_test = x_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(x_test)

New Airbnb User Booking Prediction

Airbnb New User Booking Predictions¶

Introduction¶

Data Overview¶

Data cleansing¶

Data transformation and feature extraction¶

Adding new data¶

Combine data sets¶

Create a Model¶

Comments

Airbnb New User Booking Predictions¶

Introduction¶

Data Overview¶

Data cleansing¶

Data transformation and feature extraction¶

Adding new data¶

Combine data sets¶

Create a Model¶

Related Posts:

Comments